Postgenomic technologies for genomic and proteomic analysis in biological and medical research

Over the 15 years since the decoding of the human genome a large number of individual genomes have been sequenced. Targeted sequencing – sequencing of select genome regions has been widely used both in research and in medical practice. The use of various types of genetic analysis is starting to be used in daily clinical routine. At the same time, the price of sequencing decreases and as a result, the amount of genetic information available to researchers and physicians increases. These processes together determine the need for creation of databases for the centralized storage of genetic information which is crucial for synchronization and validation of the work of various institutions. One of the first such databases was the NCBI database created and supervised by the US National Center for Biotechnological Information (NCBI) in collaboration with the National Institute for Human Genome Research (NHGRI). At the same time, the available methods for studying associations between DNA polymorphisms and various phenotypic manifestations do not cover the most important layer of regulation of biological processes the proteome. The methods of high-throughput proteomic analysis that are to be developed will allow identifying driver mutations that make the greatest contribution to the phenotype of the studied object. The application of an integrated analysis of the genome and proteome for the diagnosis and treatment of cancer pathologies is one of the most important research goals now. This approach will allow to identify new genetic biomarkers that could be used for reliable prediction of the treatment response, risks of the most important diseases, and the development of novel medications. This review shows recent advances in proteomic and genomic approaches to the development of more sensitive diagnostic and prognostic biomarkers that can be translated into improved clinical care and treatment of the disease.


Introduction
Over the 15 years since the decoding of the human genome, a large number of individual genomes have been sequenced.Targeted sequencing -sequencing of select genome regions -has been widely used both in research and in medical practice.The use of various types of genetic analysis is starting to be used in daily clinical routine.At the same time, the price of sequencing decreases and as a result, the amount of genetic information available to researchers and physicians increases.These processes together determine the need for creation of databases for the centralized storage of genetic information which is crucial for synchronization and validation of the work of various institutions.One of the first such databases was the dbSNP database created and supervised by the US National Center for Biotechnological Information (NCBI) in collaboration with the National Institute for Human Genome Research (NHGRI).Currently, the database contains more than 23.7 million genetic polymorphisms, of which 14.5 million are validated.For most diseases associated with changes in DNA, there is no currently clear understanding of the specific genetic mechanisms of its development.In other words, it can be reliably established that the disease is due to hereditary factors, but the specific mechanism of inheritance is still unexplored.The vague formulations found in official medical guidelines for diagnosing such diseases are based on this fact.For example, recommendations for the early diagnosis of type 2 diabetes (T2DM) from the International Diabetes Association suggest that T2DM is determined by hereditary factors.The authors recommend that patients to undergo genetic testing without specifying the nature, methodology, references and expected test results.Oncological diseases stand aside due to their specificity -there is exact knowledge of several mechanisms of cancer and key genes whose dysfunction can cause cancer, which makes it much easier to establish the cause of the hereditary form of cancer and even develop a genetic testing method.Even for those diseases for which the genetic mechanisms of predisposition and inheritance are somewhat reliably established, the clear practical guidelines are missing.Surprisingly, the most deadly class of the diseases -cardiovascular -despite studied fairly good (currently more than a hundred genetic polymorphisms that determine the inheritance of CVD are identified), are not covered by clinical guidelines.In developed countries, the mandatory procedures for assessing the risks of cardiovascular diseases, based on biochemical and questionnaire parameters, began to appear only during past two decades.At the same time, clinical and scientific research are conducted in these cases.For a number of diseases, genetic predictor polymorphisms were discovered during the clinical trials and are stored in the ClinVar database with the "risk factor" flag.These risk factors fall into two classes -"pathogenic" and "likely pathogenic".There are also results of the most advanced and promising studies that complement and extend the clinical trial data.Advances in genomic technologies have greatly facilitated the understanding of the genetic mechanisms of a number of diseases, and have also contributed to the discovery of new biomarkers.The combination of proteomic and genomic technologies is essential for the detection of biomarkers for the early diagnosis of diseases associated with DNA damage, as well as for general biological research, developments in the field of biotechnology and the food industry.Recent advances in the detection of socially significant diseases based on the human genome using advanced genomic technologies such as PCR and next generation sequencing (NGS) have shown promising results.Similarly, proteomics can lead to a revolution in the diagnosis and screening of socially significant diseases based on new proteomic databases, which include somatic variants and post-translational modifications.Thus, the developed proteomic technologies can be used as an addition to classical research methods (Panis, Pizzatti, Souza, & Abdelhay, 2016).Moreover, the use of several proteomic and genomic biomarkers, rather than one gene or protein, can significantly improve diagnostic accuracy and increase predictive ability, which can provide adequate monitoring of the response to treatment and can be an important milestone on the way to personalized medicine (Jackson & Chester, 2015;Larijani, Perani, Alburai'si, & Parker, 2015).At the same time, the available methods for studying associations between DNA polymorphisms and various phenotypic manifestations do not cover the most important layer of regulation of biological processes -the proteome.The methods of high-throughput proteomic analysis that are to be developed will allow identifying driver mutations that make the greatest contribution to the phenotype of the studied object.The application of an integrated analysis of the genome and proteome for the diagnosis and treatment of cancer pathologies is one of the most important research goals now.This approach will allow to identify new genetic biomarkers that could be used for reliable prediction of the treatment response, risks of the most important diseases, and the development of novel medications.This review shows recent advances in proteomic and genomic approaches to the development of more sensitive diagnostic and prognostic biomarkers that can be translated into improved clinical care and treatment of the disease (Tanase, Albulescu, & Neagu, 2015).

Methods for studying genome
The personalized medicine can be defined in the following manner: all diagnostic tools, types and combinations of therapy, treatment procedures including surgery, medical recommendations, and, in the future, the development of new types of drugs that are created and applied based on knowledge about the individual characteristics of this patient.Until the decoding of the human genome, the very existence of such a medicine was impossible due to a lack of knowledge about the features of the genome, transcriptome, proteome, and human immunity mechanisms (Day & Siu, 2016).DNA sequencing has revolutionized molecular biology, medicine, genomics, and related fields.The first sequencing method proposed by Frederick Sanger in 1977 over the years has made possible the development of new and improved DNA sequencing platforms (Sanger, Nicklen, & Coulson, 1977).These technologies, along with a variety of computational tools for the analysis and interpretation of data helped researchers better understand the genomes of various organisms.They made sequencing a powerful yet feasible research tool that has evolved to the point where it can be easily used even in small laboratories with high efficiency, without the need for large sequencing centers.Classic sequencing methods Even after the advent of next-generation sequencing methods, it is still considered that most of the DNA sequence data was obtained using first-generation technologies.Although these technologies are slower and more expensive (even after the automation), they are still used in studies where increased accuracy is required.The initial first-generation sequencing technologies were sequencing methods established by Maxam-Gilbert (Maxam & Gilbert, 1977) and the sequencing-bysynthesis method developed by Sanger (Sanger et al., 1977).Maxam-Gilbert sequencing method This method first appeared in 1977 and is also known as the "chemical degradation" method.Chemical reagents act on the specific bases of existing DNA molecules, which lead to the subsequent cleavage.In this method, DNA is labeled with radioactive phosphorus at the 5' or 3' end.The next step is to obtain single-stranded DNA.This can be done by restriction cleavage, which produces sticky ends in DNA, or by denaturation at 90 °C in the presence of DMSO, which leads to the formation of singlestranded DNA.The sample is divided into 4 aliquots, after which a partial hydrolysis reaction is carried out in each part, leading to the occurrence of gaps at the sites of incorporation of 4 different nucleotides or their combinations.Ecology, 9(4), 2019 Fig. 1 Maxam-Gilbert sequencing method (Verma et al., 2016) Currently, Maxam-Gilbert sequencing is not used in practice due to the low speed of analysis and overall laborious procedure.Sanger sequencing technology This method is also known as chain termination sequencing.Sanger sequencing has played a crucial role in understanding the genetic landscape of the human genome.It was developed by Frederick Sanger in 1975, and commercialized in 1977(Sanger et al., 1977).The technique is based on the use of dfideoxyribonucleoside triphosphates (ddNTP), in which the 3 'hydroxyl group is missing.The process uses seven different components for performing sequencing.They include a single-stranded DNA template for sequencing, primers, Taq polymerase for amplification of the template, reaction buffer, deoxynucleotides (dNTP), fluorescently labeled ddNTPs and DMSO (used to denature secondary structures in the DNA chain).Since 3'-OH group is absent in the incorporated ddNTP, the phosphodiester bond between C3'-OH of the last base and C5' of the next dNTP is not formed, which leads to the termination of the chain at this point (Fig. 2a).Electrophoresis in polyacrylamide gel is used to separate the products of each reaction by length in four parallel lanes (Fig. 2b).In 1987, Applied Biosystems, Inc. (ABI) released the first automated DNA sequencing machine, Model 370 ABI, developed by Leroy Hood and Mike Hankapiller, which could generate read lengths of up to 350 nucleotide pairs.In 1995, ABI released ABI PRISM 310 genetic analyzer, which made it possible to simplify the inconvenient and laborious process of preparing gels, component installation, and sample loading.Swerdlow and Gesteland developed the machine known as capillary sequencer, which uses capillaries filled with polyacrylamide gel, rather than using the gel plates.A general view of the obtained electrophoregrams is shown in Figure 2c.Currently available on the market sequencers use 4, 16, 48, 96 or 384 capillaries simultaneously.As the number of capillaries increases, the read length and, sequencing speed also increase (Verma et al., 2016).

Next Generation Sequencing
The emergence of next-generation sequencing methods (NGS) has made the difficult task of sequencing much easier and faster.Fast and economically viable NGS technologies have become much more popular than the slow and laborious analogues of the first generation.In combination with bioinformatics technologies, these methods have significantly increased the speed of data collection and its amount.These methods made it possible to simplify DNA sample preparation for sequencing because the transformation of E. coli is no longer required (Kamps et al., 2017).454 (Roche) pyrosequencing 454 sequencing was the first of the NGS techniques to be introduced in 2005.This process is called pyrosequencing, which is based on the emission of light due to a cascade of reactions that occur after the release of pyrophosphate.First, the DNA duplex is cut into smaller fragments followed by ligation of the adapters, which are complementary to the primer sequences, to the both sides of the DNA fragment.These adapters act as a primer-binding site and initiate the sequencing process.Each DNA fragment is connected to the emulsion microsphere, so that the ratio of DNA to microspheres in the sequencing reaction volume is 1:1.This is followed by amplification of each fragment using emulsion PCR and after several cycles, many copies of these DNA molecules per microsphere are synthesized (Ronaghi, 1998).Immobilized enzymes (DNA polymerase, ATP-sulfurylase, luciferase and apyrase) are added to the wells in a microplate each containing one microsphere covered with copies of DNA fragment.Then, each of the four dNTPs is applied in turn.Complementary dNTP is integrated into the growing chain using DNA polymerase.This process is accompanied by the release of pyrophosphate (PPi), which is turned into ATP by ATP-sulfurylase.In the presence of generated ATP, the luciferase enzyme converts luciferin to oxyluciferin, which is accompanied by emission of light signal, the intensity of which is proportional to the amount of ATP.The intensity of the light signal is recorded by a CCD-camera.As soon as the signal is received and processed, the apyrase enzyme cleaves the existing nucleotides and ATP and then the next nucleotide is added (Fig. 3).Recently, this 768 technique has been further improved by introduction of paired-end sequencing.The adapters are linked to both ends of the fragmented DNA, which allows reading the fragment from both ends.The main advantage of this method is long reading length.In contrast to other NGS technologies, 454 sequencing gives a reading length of up to 400 base pairs and can generate more than 1,000,000 reads per cycle.This method is useful for the de novo assembly of genomes and the study of metagenomes (Petrosino et al., 2009).The main disadvantage of this sequencing technology is unreliable quality of reads containing homopolymers, because the amount of light produced by 8-10 repetitive nucleotides makes it impossible to accurately deduce the length of the homopolymer region.

Illumina (Solexa)
The Solexa sequencing platform (later acquired by Illumina, Inc) was first developed by British chemists Shankar Balasubramanian and David Kleinerman and then was commercialized in 2006.It is based on the principle of sequencing by synthesis (Zhou et al., 2010).Genomic DNA is first fragmented and adapters are ligated to the both ends.Then, the barcoded and labeled DNA fragments are loaded onto a flow cell, where one end of the DNA fragment hybridizes to a complementary oligonucleotide that is covalently attached to the surface of the cell.The opposite end of each of the single-stranded DNAs hybridizes with the adjacent complementary oligonucleotide.After this, the fragment is amplified during the process called bridge PCR (Fig. 4).For the next PCR cycle, the template chain and the newly synthesized complementary chain are denatured to start amplification again.After several amplification cycles, millions of dense clusters of duplex DNA are generated in each lane of the flow cell.After that, the cell is ready for sequencing (Fig. 5) (Lizard et al., 2017).

770
The amplified DNA is denatured, primers are attached and the second strand synthesis begins with the inclusion of complementary dNTPs.Each dNTP is labeled with different reversible fluorophores.Before the attachment of the next base, Tris-2-carboxyethyl-phosphine (TCEP) is added to remove the fluorophore fragment from the previous dNTP and to remove the block at the 3' end of the nucleotide (Fig. 6).Ion Torrent Ion Torrent sequencing is based on the process of the formation of covalent bonds in a growing DNA chain catalyzed by DNA polymerase.Incorporation of each nucleotide leads to the release of pyrophosphate and hydrogen ion H + .This, in turn, decreases the pH of the medium, which is detected in order to determine the sequence of DNA (Malapelle et al., 2015).The sequencing semiconductor chip contains microcells, which are filled with microspheres carrying clonally amplified singlestranded template DNA molecules.Then the wells are sequentially filled with DNA polymerase and unmodified dNTP.If a complementary nucleotide is included in the growing chain, then a hydrogen ion will be released.An ion-sensitive field effect transistor (ISFET) is located below each microcell.These sensors determine the pH change by measuring the potential difference.Each change in pH is recorded.Before the start of the next cycle, unbound dNTP molecules are washed out.The following type of dNTP is added and the cycle repeats (Korlach et al., 2010).In 2010, Ion Torrent Systemsm, Inc was founded.The company has developed a sequencer for small studies (Ion Torrent PGM), which is used for targeted sequencing of different sections of the genome, sequencing of small genomes (bacterial and viral), as well as sequencer for more extensive studies -Ion Proton.The advantage of the developed devices is a single platform for preparing libraries for sequencing, which is based on emulsion PCR.Sample preparation can also be automated by using the Ion Chef station, which is also developed by Ion Torrent.
The key advantage of the technology is the relatively low cost of launching the device and the great flexibility of the platforma significant change in the sequencing protocols and the replacement of reagents is possible, which allows adapting the sequencing to the specific tasks of the researcher.The disadvantages of the technology include a relatively large rate of errors in the sequencing of homopolymer DNA regions.This is because the relation between the change in the pH of the medium and the length of the homopolymer site is nonlinear.In 2012, Ion Torrent was acquired by Thermo Fisher Scientific, after which an improved version of Ion Proton (Ion GeneStudio S5) was developed.Currently, there are versions of both sequencers designed for clinical use.
Ukrainian Journal of Ecology, 9(4), 2019 The main advantage of SMRT technology is the extremely long read length (from 8,000 to 30,000 nucleotides).This can significantly reduce the number of reading errors that are associated with PCR amplification of fragments as well as simplify the de novo assembly of genomes.The disadvantages of PacBio includes the weight and size of the device (it weighs more than a ton), as well as high cost of sequencing.

Oxford Nanopore
The concept of nanopores and their use in sequencing was developed in the mid-1990s.After many years of research and development of the technology, Oxford Nanopore licensed it in 2008 (Zascavage et al., 2019).Nanopores are nanometer-wide channels that can be of three types: -biological: pores that are formed by a pore forming protein in the membrane (e.g,alpha-hemolysin); -solid: pores that are formed by synthetic material or obtained chemically (e.g.silicon and graphite); -hybrid: pores that are formed by a biological agent such as a pore-forming protein encapsulated in a synthetic material.In contrast to all the aforementioned sequencers, Oxford Nanopore does not require labeling or detection of nucleotides.This method is based on the principle of modulation of the ion current during the passage of a DNA molecule through nanopores.Since different nucleotides have different sizes, they block the ion current in different ways for a certain period.Having detective these changes, it is possible to determine the sequence of the necessary molecule (Fig. 9) (Haque et al., 2013).Fig. 9 Oxford Nanopore sequencing technology (Verma et al., 2016) Sequencing platforms comparison Table 1 presents the comparative characteristics of different sequencing technologies.The most commonly used and representative sequencing instruments are analyzed.Methods for studying proteome After successfully completing the Human Genome Project, the HUPO (Human Proteome Organization) officially launched the global Human Proteome Project (HPP), which aims to map the entire set of human proteins.The main efforts are aimed at the quantitative analysis of enzymes, their distribution, intracellular localization, as well as interaction with other biomolecules in different physiological conditions.As a general experimental strategy, the HPP research group has focused on three main methods (Omenn et al., 2017): -mass spectrometric methods; -methods based on the use of antibodies; -analysis of proteomic databases.In this review, the main proteomic platforms used to detect cancer biomarkers will be described.Proteomic assays based on mass spectrometry methods Recently, the level of development of mass spectrometry technology allows it to be used to evaluate the whole human proteome.Mass-spectrometric methods played an important role in the discovery of protein biomarkers of cancer and other diseases (Cho, 2017).Initially, proteomic studies were based on two-dimensional gel electrophoresis with subsequent mass spectrometry.This sequential approach has greatly facilitated the identification of peptide sequences in proteins that were present in different amounts on gels.Subsequently, the analysis of proteomic samples was developed as one of the approaches to the discovery of biomarkers.Recognition of patterns of mass spectra allowed researchers and clinicians to use bioinformatics to diagnose cancer (Li et al., 2017).One of these methods is surface-enhanced laser desorption mass spectrometry (surfaceenhanced laser spectrometry -SELDI-TOF-MS).This method can be used to analyze the mass of the protein directly without its enzymatic cleavage.Samples are dried, after which the laser ionizes the crystallized peptides.Then these ions are accelerated by an external electric field and sent to the tube.The detector measures the ions when they reach the end of the tube and the results are processed using specialized software (Fig. 10) (Liu, 2011).Mass spectrometry can be combined with liquid chromatographic separation (liquid chromatography separation -LC); this combination is called LC/MS (liquid chromatography separation/mass spectrometry).In the LC/MS assay, whole proteins present in complex biological samples are broken down by enzymes into peptide fragments, and then the LC/MS method is used to identify thousands of proteins in biological samples, such as tissue, serum, plasma, or urine.LC/MS-based methods that can be used for a comprehensive analysis of cleaved peptides are called proteomic shotgun (Adaway et al., 2015;Lopes et al., 2017).
Mass spectrometry is a quantitative method.The advantage of the method is high accuracy and stability.The disadvantage is the relatively high cost of purchasing and maintaining equipment.The detection limit is from 1 pg of protein in 1 μl of liquid.Antibody-based proteomic assays One of the objectives of proteomics is the creation of specific antibodies that can recognize each protein of a human proteome.Antibody proteome analysis plays a key role in the detection and confirmation of cancer biomarkers.In particular, this analysis contributes to the high throughput assessment of cancer biomarkers and provides a logical strategy for the systematic generation and use of specific antibodies for the study of the proteome.The Human Protein Atlas project was created to systematically generate specific antibodies on a global scale and use these antibodies to study the corresponding proteins and protein isoforms (Fagerberg et al., 2014).The use of antibodies for protein profiling on a global scale is an intuitive approach that should facilitate the systematic study of the cancer proteome.Approaches using antibodies can be used in combination with a wide range of high-throughput assays, such as immunohistochemistry (IHC), tissue microarrays (TMA) and protein microchips.
TMA is a method of assembling multiple tissue samples from a single paraffin block for the simultaneous evaluation of several biomarkers using IHC; TMA can potentially become a rapid molecular method of using a large-scale library of antibodies to study the relationship between molecular biomarkers and clinical results (Fagerberg et al., 2014).Protein microarrays can be divided into two main classes: forward phase protein protein analysis (FPPA) and reverse phase protein protein analysis (RPPA).RPPA is a high-performance, antibody-based method for detecting protein expression in cell or tissue lysates.To some extent, this method is similar to Western blotting (Isik & Ercan, 2017).Western blot has historically been widely used to detect the expression of individual proteins; however, the need for a relatively large number of protein samples per cycle makes this method unsuitable provided there are limited patient tissue samples for clinical studies.Therefore, there is an urgent need to improve the sensitivity of the detection strategy (Kim, 2017).In addition, to maximize the use of valuable clinical samples, it was necessary to develop hight-throughput analysis.RPPA technology provides increased sensitivity, minimal sample requirements and multiplex analysis.RPPA is a promising method in the field of the hypersensitive detection of important proteins or markers in biological or clinical specimens.The advantages of RPPA are the possibility of personalized molecular profiling for patients with an automated high-performance system (Creighton & Huang, 2015).This technology allows to detect protein samples taken from patients with a limited number of blood cells, or to carry out laser capture of biopsies, cell cultures, serum, urine, synovial fluid and vitreous humor.Depending on the type of microarray, from 20 pg to 1 ng of a protein sample can be used for analysis, and several thousand samples can be analyzed simultaneously on one slide (Fig. 11) (Creighton & Huang, 2015).Ukrainian Journal of Ecology, 9(4), 2019 Methods for proteomic analysis using antibodies are quantitative methods.The advantages of the method are high accuracy and stability, low cost.The disadvantage can be considered a longer analysis time.The detection limit is from 20 pg of protein in one μl of liquid.
Proteome databases Information about the molecular genetic mechanisms of cancer is accumulated on a large scale.The initial goals of large-scale research were aimed at sequencing the entire genome and mapping human transcriptome.Recently, information about human proteomes has attracted increasing attention.The molecular and functional complexity of human proteome creates problems for researchers, and this complexity requires bioinformatic resources specifically designed for collection and integration of currently available data.The Human Proteome Global Project provides a complete atlas of human proteins in their biological context.It generates publicly available data and information resources, which, in turn, further explore the human proteome."Human Proteome" is built based on a knowledge-based database in order to integrate information obtained from the basic protein research methods described previously.With regard to knowledge-based proteomic approaches, the HPP working group decided to use the UniProtKB/Swiss-Prot, PRIDE, PeptodeAtras, GPMDB and Atlas Protein Atlas databases as main data sources (Thul & Lindskog, 2017).

Conclusion
Currently, the key results of the Human Genome and Human Proteome projects, as well as a number of studies of genomic and proteomic disorders in various diseases carried out using high-throughput analysis platforms, allow us to obtain fundamentally new information about the manifestation of genomic disorders at higher levels -transcriptomic and proteomic.
Understanding the relationship between the genome and the proteome is necessary to develop new methods for diagnosing diseases associated with genetic disorders, research in the field of systems biology, applied research and development in the field of biotechnology and food products, as well as to create new methods of therapy and introduce new technologies into medical practice.