Bioinformatics Techniques used in Hepatitis C Virus Research

Hepatitis C is widely spread and induces life threatening situations. Researchers from various fields have developed vaccine, but they are not that effective because of the variation in genotype of Hepatitis virus and also not much affordable. In-silico approach is of greater importance in designing and testing the model vaccine. In this study, investigation has been done for the available Bioinformatics tools and methodologies used in HCV research. Different types of tools and databases commonly used by researchers were reviewed to get an overall picture of bioinformatics techniques, computational biology tools and databases used in Hepatitis C Virus research. Exclusive study has also done to figure out different statistical methods used by different research groups. This paper will provide an up-to-date picture of computation approaches used for exploring Hepatitis C treatment.

The causative agent of Hepatitis C, a life threatening disease is Hepatitis C virus. Characteristically it affects the liver and the individual can develop acute and chronic infections. The infection starts with slight infirmity for few weeks to major lifetime illness. HCV infection occurs via the blood transfusion of the infected patient to the normal individual, using non sterilized medical equipment, using same syringe or needle for more than one individual. Globally 130-150 million people develop chronic hepatitis C infection 1 . Out of those, considerable numbers of chronically infected individuals develop liver cirrhosis and cancer. Numbers of individual who lose their lives from this infection are close to 500,000 per year 2 . Among the HCV infected population antiviral medicines can neutralise the effect of 90% population which leads to minimized chances of death from liver cirrhosis and cancer but access to diagnosis and treatment protocols are very poor. Research to avail first vaccine globally is still in process 3 . Hepatitis C virus came into picture in 1989 by expression cloning of immunoreactive cDNA isolated from infectious non-A and non-B Hepatitis agent. HCV comes under the flaviviridae family and genus Hepacivirus. Its genetic material comprises of positive single-stranded RNA. The genome size of HCV is 9.6 kb RNA having highly structured 5' and 3' ends ( Fig. 1). The 5' end is 341 nucleotides long conserved non coding region which contains four major domains when folded into complex structure. Flavivirus codes for a long open reading frame having 5' cap and conserved RNA structures at both the 5' and 3' untranslated regions essential for replication and translation of viral genome 4 . The structural HCV proteins are currently designated as core, E1, E2 and p7. Additionally, the non-structural proteins are NS2, NS3, NS4A, NS4B, NS5A and NS5B. The genomic RNA is translated into a single polyprotein precursor consisting of three structural Capsid (C), perinuclear membrane (prM), and Envelop (E) protein and seven non-structural NS1, NS2a, NS2b, NS3, NS4a, NS4b and NS5 proteins arranged in the order C-prM-E-NS1-NS2a-NS2b-NS3-NS4a-NS4b-NS5. Mature and infectious virion contains only the structural proteins and the non-structural proteins are involved in the polyprotein processing, viral RNA synthesis and virus morphogenesis.
The HCV core protein is manly involved in the assembly of the nucleocapsid 5 . It consists of highly conserved 191 amino acids and mainly divided into the three major domains. Domain 1(1-117 amino acid) has basic amino acids which enhance the dimerization of viral RNA leading to the formation of nucleocapsid 6 . Domain 2 (118-174 amino acid) has less of the basic amino acids which are more hydrophobic in nature whereas Domain 3 (175-191 amino acid) has highly hydrophobic amino acids. Domain 2 and 3 are involved in the lipid transport and interaction with other HCV proteins namely E1, E2 and NS5A 7 .
The envelope protein E1 consists of C-terminal domain and is a transmembrane glycoprotein which regulates the membrane permeability changes and membrane association 8 . E2 is receptor binding protein which has 11 N-glycosylation sites. It is responsible for the viral particle attachment to the surface of the host cell. N -terminal of the E2 is referred as receptor binding region 9 . The p7 protein is hydrophobic transmembrane protein which forms the hydrophobic pores and allows the release of the viral particles to increase the infection by altering the membrane permeability. It is also involved in the late viral replication cycle 10 . Non-structural protein NS2 interacts with the E1, E2, p7 and other non-structural proteins to favor the viral assembly. It attracts the envelop proteins to the viral assembly site and promotes viral assembly 11 .
The N-terminal protease domain (NS3pro) of non-structural protein 3 (NS3) has role in proteolytic processing and the C-terminal region having RNA triphosphatase, RNA helicase and RNA-stimulated NTPase domain required for RNA replication. The serine protease domain of NS3 has major role in the replicative cycle of Flavivirus. The data shows that approximately 66% population of Northern India is found to be infected with NS3a 12 .
NS4A is involved to form the complex with the viral proteins such as NS3, NS4B and NS5A. it acts as a cofactor for proper functioning of the NS3 protein which results in increased enzymatic activities. It also helps in the viral replication at endoplasmic reticulum membrane by forming a complex with NS4B and NS5A 13 . NS4B interacts with NS5B to alter its polymerase activity and shows its significance in carcinogenesis. It is involved in the formation of the membranous structure which serves as a platform for the viral replication to happen 14 .
NS5A is proline rich phosphoprotein essential for the viral replication and assembly 15 . It is categorized into three domains. Domain 1is known as the zinc binding domain which forms the homodimer to get in contact with N-terminals. This domain is also involved in the binding of the RNA during the replication and may have role in switching between viral replication and translation. Domain 2 is responsible for the inhibition of the protein kinase PKR which is induced as a response to the IFN whereas domain 3 is poorly conserved [16]. NS5B is RNA dependent RNA polymerase which is responsible for the initiation of the replication cycle. Cooperation between the NS5B and p7 increase the Virion infectivity with the decrease in sphingomyelin level in Virion 17 .
In-silico approaches have major role in HCV research from sequencing the genome to design the vaccine model before its implication in the wet lab. Our study is completely focused on reviewing bioinformatics tools and techniques used in the HCV research and to evaluate how bioinformatics has curved HCV research.

METHODS
To find and analyze the bioinformatics tools and techniques used recently in the Hepatitis research various keywords and conditions were brought up in the action. After obtaining the query results the relevant research papers were reviewed and out of those related information was extracted. The extracted information were categorized into (1) Databases and tools, (2) Gene expression analysis and (3) Sequence alignment methods For this entire work Pubmed database was used with 19 keywords, later on, all 19 queries (keywords) could be reduced to only 2 queries. Query 1: (hcv) OR hepatitis c) AND epitope -total hits were 1376 and Query 2: (anti hcv) OR anti hepatitis c) AND peptide -total hits were 3422.
Apart from bioinformatics tools and techniques used in HCV study, statistical methods used by the researchers were also mentioned. The frequency of above mentioned categories in different research papers was also calculated in order to determine which category was most frequently used by the researchers.
We in our study have categorized the information into Databases and tools, Gene expression analysis and Sequence alignment methods. We have discuss each part separately. The main objective behind the categorization was to put forward clear understanding of the information collected and it can be possible that different researchers will have different categorization methods. The tools which were used more frequently by the researchers were discussed along with the ones which were more crucial to carry out the research.

HCV Databases and Resources
Here, we will discuss the databases, resources and tools which were used by the researchers in HCV study. Most of the researchers who have followed in-silico approach used the databases in order to carry out their work meaningfully.

euHCVdb (European Hepatitis C Virus Database)
The database was developed by Combet 18 and was available in 2006. By accessing the database the researchers can analyze the genetic and structural variability in the HCV sequences which can be implicated on vaccine and drug designing. The euHCVdb is internationally collaborated with US and Japan's databases. In this database the researchers can access amino acid sequence of HCV proteins their 3D structures and functional analysis. The database is extended from the HCVDB (Hepatitis C virus database) which was developed in 1999 19 . The euHCVdb is automatically annotated and updated every month from the EMBL (European Molecular Biology Laboratory) nucleotide sequence database. The database cab can be accessed by using the following web link: http://euhcvdb.ibcp.fr.

The Los Alamos hepatitis C sequence database
The database was developed by Kuiken et al 20 . At Los Alamos National Laboratory, US in 2004. The database is updated monthly from the HCV sequences present in Genbank. The database annotation includes sequence information in terms of genotype, subtype, and comparison to the reference HCV-H strain, sampling date, country, city and sampling tissue. Patient information provided in terms of age, gender, ALT level, HLA type, co-infection with HIV and hepatitis B, infection date, city, country, treatment and results. The web link to access the database is: http://hcv. lanl.gov.

HVDB (Hepatitis Virus Database)
This database was developed by Shin-I et al 21 . The database is a combined form of HCV, HBV and HEV databases. The database has around 44000 HCV (hepatitis C virus), 11000 HBV (hepatitis B virus) and 1600 HEV (Hepatitis E virus) sequences when it was made available to scientific community. The sequences present in the database were retrieved from INSDC (International Nucleotide Sequence Database Collaboration). The HCV master database contains the sequence sourced from DDBJ (DNA Databank of Japan). The reference map (loci, sub regions) is generated by comparing the retrieved sequence from DDBJ to the reference HCV genome. After that the sequence is placed under one of the three divisions which are C division, E1 division and NS5 division.

HCV pro
Kwofie et al 22 developed the database which was available in 2011. The database provides the information about HCV protein-protein interactions. It contains the manually verified entries of Hepatitis virus-virus and virus-human protein interaction. The data was sourced from various literature and databases. The database provides extensive information on HCV proteins structure and function with development stage of drug and vaccine. In addition to this it provides the information of Hepatocellular Carcinoma genes which are coded into proteins and these proteins are linked to gene ontology, pathways, OMIM (Online Mendelian Inheritance in Man) and crossreferenced with various important annotations.

IEDB (Immune Epitope Database) resource
This resource has the various tools which are used to identify and analyze the epitopes. In a research this was employed to predict the MHC class 1 epitopes via ANN using the sequence method. The query sequence or the data are overlapped with the peptides & then their binding is checked by the prediction method selected for the analysis 23 .

ENCODE project database
Recently this database was utilized to create a bioinformatics pipeline. The database was used to obtain the raw form of the CHIP-sequence data. The obtained data was used to find the specific binding of the transcription factors to the alleles. The study was based on genetic prevention of the liver fibrosis (caused by HCV) by reducing the allele specific expression of the MERTK gene 24 .

VirusMINT database
This database stores the viral-human protein interaction information at the cellular level. As on 2008 this database had 5000 interactions out of which 490 were unique viral interactions from 110 viral strains. The database presents the query results in the form of graphical view. The database mainly contains the human protein interaction with viral proteins from the viruses such as Hepatitis C virus, human immunodeficiency virus 1, papilloma virus and other infectious and oncogenic viruses to human. The interactions stored were manually curated from MINT, IntAct and HIV-1 Human Protein Interaction Database 25 .

VirHostNet
This knowledge base is publically available which contains the virus-virus, virus-host and host-host protein interaction information. It contains 2671 non-redundant virus-virus and virushost interactions from different 180 viral species curated originally from the literatures and 10672 human protein interactions (68252 non-redundant entries) curated from publically available data 26 .

ViPR (Virus Pathogen Database and Analysis Resource)
This database stores the information of human pathogenic viruses' family including flaviviridiae (HCV), other positive and negativesense single stranded RNA viruses and doublestranded RNA virus. The main objective of the database is to provide single resource to access multiple virus research communities. ViPR contains the manually annotated information like sequences, epitopes, 3D protein structures obtained from GenBank, UniProt, PDB(Protein Data Bank), IEDB (Immune Epitope Database), PubMed and Gene Ontology Consortium 27 .
We used a variety of Databases for our search process, including the database search capabilities available under resources (Table 1).

Important HCV Information Sources
In this part of information we have discussed some HCV information resources which basically provide the support to HCV infected individuals and spread awareness among the public including diagnosis and treatment information. Some of these organizations also offer a special training program for the HCV infected individuals. Resource: Hepatitis Foundation International Address: http://www.hepatitisfoundation.org/ Scope: Hepatitis Patient Registry Network. Audience: Patients, Health care professionals, General public Description: The Hepatitis Foundation International (HFI) was established in 1944. It is a non-profit organization which aims to completely destroy viral hepatitis among the 550 million people around the world. In addition, it also provides the information regarding prevention of chronic liver disease, habits and practices which negatively effects liver. Resource: HCV advocate Address: http://hcvadvocate.org/ Scope: To provide correct information, support and make familiar with this information to the groups suffering from HCV, HIV/HCV co-infection including medical providers. Audience: Patients, Health care professionals, General public Description: It is a Hepatitis C support project, which is registered, non -funded organization established by Alan Franciscus in 1997. At present it is the reputed and well classified HCV publication in the U.S. Also offers training program to the HCV infected population. Resource: Hep C Address: http://hepc.liverfoundation.org/ Scope: Provide information regarding diagnosis, treatment and provide support the HCV infected individuals. Audience: Patients, Health care professionals, General public Description: Hepatitis C is American Liver Foundations online resource center which put forwards the information and support to the HCV positive population.

Resource: Centers for disease control and prevention
Address: http://www.cdc.gov/hepatitis/index.htm Scope: HCV statistical, training information center Audience: Patients, Health care professionals, General public Description: Contains information about all the forms of hepatitis. All the statistical information, hepatitis outbreak information and training programs offered.

Resource: The Hepatitis Foundation of New Zealand
Address: http://www.hepatitisfoundation.org.nz/ hepc Scope: Hepatitis information center Audience: Patients, Health care professionals, General Public Description: Non -profit organization maintained by the Ministry of health, New Zealand. It look after for the people who are suffering from hepatitis B or C. Offers two courses, one "Hepatitis B FOLLOW up programme" and "Hepatitis C standard programme". This foundation situated in New Zealand has performed important screening, vaccination and research programmes in New Zealand during their 30 years of work. Resource: Public Health Agency of Canada Address: http://www.phac-aspc.gc.ca/hepc/ Scope: HCV information center Audience: Patients, Health care professionals, General Public Description: The agency is maintained by the government of Canada to aware the people with various infectious diseases, their cause and treatment for reducing the frequency of infectious diseases, including the execution of the international research and development to the Canada's people. Also used as a platform to exchange information of the Canada's experts around the globe.

Tools used in HCV study
This protein-protein interaction network analysis was used in a study to obtain the proteinprotein interaction information of HCV and human proteins. The purpose of using STRING was to form a network in which we can identify novel genes related to HCV and trace element metabolic process by preparing an identification approach 28 .

Tagident
This tool is used to determine the molecular weight and the isoelectric point of the unknown protein. Basis of determination is that it compares the query protein sequence with the sequences in the protein sequence database UniProtKB/Swissprot. By using this information the approximate location of the protein when it is analyzed under 2-Dimensional Gel Electrophoresis can be determined. In an HCV research this tool employed to identify the location of the proteins when they were running in the 2D Gel 29 .

Propred
This tool is for the prediction of the MHC epitopes. It contains the quantitative matrix which stores the scores which are generated by the experiments. Matrix compares the input peptide with that matrix which is generated by considering the properties of each amino acid and its location in an epitope. In a research these tools were extensively employed along with some other methods to determine the epitopes of the HCV, which are dominant in a group of the South African population on the basis of the binding score and other parameters which was presented by these tools as their output 30 .

BCPred
This tool is for the prediction of the B-cell epitopes. It uses the BCPred algorithm which takes the window size of 9 amino acid 75% specificity for the prediction. In a research this tool was employed to predict the epitopes and total 19 epitopes were predicted out of which 12 were antigenic as the antigenicity of the epitopes were confirmed by the vexijen version 2.0. This evaluation can be helpful in eliciting the desired immune response. For the T cell epitope prediction they used the online tool epijen which confirms the epitopic property. A total of 6 epitopes was determined by them using this 31 .

Clin Pro tools v2.0
This tool was utilized for the data analysis in the study focused on finding whether the serum proteome profiling is able to detect the treatment changes in the HCV-1b infected individuals. The uses of the tools include the normalization and recalibration of the spectra which was obtained from peptide profiling by MALDI-TOF/MS. The tool was utilized to statistically and visually analyze the data 32 .

M fold and I-TASSER
Both of these tools were used in a study to find out the 2D and 3D structure of the HCV-  33 . I-TASSER was used to obtain the tertiary or 3D structure of the each amino acid sequence 34 .

SVM (Support Vector Machine) Model
In recent study machine learning methods like SVM were used to predict the interaction between the Hepatitis C virus protein and human protein. Cui's SVM model showed average accuracy beyond 80% by using the feature number of times three consecutive amino acids present in a protein sequence. Emamjomeh's SVM model using the features like amino acid composition, evolution information, PTM information, tissue information, pseudo amino acid composition and network centrality measures obtained accuracy of 83% when used on the human-HCV proteinprotein interaction dataset (same dataset was used by the Cui) 35 .

IPA (Ingenuity Pathway Analysis)
This network analysis was performed in recent study to classify the proteins on the basis of their location inside the cell. In addition it also tells possible biological, molecular and biochemical functions of the protein 36 .

PatchDock
This program is used to perform the docking between the two molecules which can be protein, DNA, peptides and drugs. The process starts by converting the molecules into pieces called as patches on the basis of their surface shape and matching those patches with rest of the pieces by shape matching algorithm. The docking is performed in three steps named as Molecular shape representation, Surface patch matching and Filtering and Scoring 37,38 . In study 39 the drug vedroprevir docking was confirmed with active site of the HCV 1a NS3/4A protease.
The tools and databases which were used by the researchers during their HCV study are summarized in table 2. The databases have made the researchers to remain familiar with each and every up gradation in the HCV study. Numerous bioinformatics tools like Tagident which has allowed finding the PI value and molecular weight, then tools like Propred, BCPred helps in the determination of the epitopes of the HCV by using these tools they predicted the epitopes of the HCV. Algorithms like active paths algorithm which comes as a Cytoscape plug in also used which has made the researchers to find the network of the human and HCV protein interaction. Various bioinformatics techniques used in the HCV research, those techniques were broadly divided into four types namely considering sequence alignment, clustering/phylogeny, gene expression and databases or database searches studies along with the reference number of the paper in which these techniques were employed ( Table 3). Out of all the reviewed papers the most frequently, the databases were mostly used by the researchers as they allow them to remain up to date with the current findings in the HCV study. Gene expression analysis allowed researchers to study the progression cycle of the HCV during different course of time.
The bioinformatics software employed mainly in the HCV research is mentioned in table 4. Out of all the work presented in the literature, none of the softwares were mentioned twice in different researches except the BLOSUM matrix, which was mentioned twice in 40 and 41 . BLOSUM Matrix calculates and shows the no. of times a particular mutation taking place in a related protein  42 . Table 5 highlights the different statistical tools used by the researchers in their HCV study. T test and equal variance were used in the form of software named as Minitab 44 .

Gene Expression analysis
Gene expression analyses are performed to find out the level of expression of the different genes during the infection. Techniques like microarray and genetic linkage mapping were used. Study was done to find the expression of genes at different point of time during the HCV infection in a chimpanzee 48 . The fold change in expression level of IFN-inducible genes were measured in which at day 7 the expression level was increased to 100 folds after this expression level was measured as normal in week 8.

Tools for gene expression analysis MIDAS (Microarray Data Analysis System)
This software was developed by TM4 group. It is used to analyze the raw gene expression value from the spotfinder by doing the normalization (includes global, iterative linear regression and LOWESS normalization) and data analysis by t-Test and MAANOVA 49 .

SAM (Significance Analysis of Microarrays)
This tool is an excel add-on used for the analysis of various microarrays like Cdna or oligo array, SNP(Single Nucleotide Polymorphism) arrays, protein arrays etc. It relates the microarray expression data with clinical parameters including treatment, diagnosis categories, survival time etc 50 . In microarray gene expression analysis of HCV 3a genotype was done in early liver fibrosis and cirrhosis patients in which, MIDAS was used for normalization and SAM was used to identify significantly expressed genes during fibrosis stages 51 .

MAS (Microarray Analysis Suite)
This tool is used for the normalization and estimation of the microarray data for Affymetrix genechip. Normalization is done by the linear regression 52,53 .

RMA (Robust Multichip Average)
This software analyze the gene expression value for affymetrix genechip. The process is divided into background adjustment, quantile normalization and summarization 54 . MAS and MIDAS were used for microarray gene expression analysis in Hepatitis C virus and HCC (Hepatocellular Carcinoma) patients. RMA has better precision, consistency and specificity in detection of differential gene expression than MAS. MAS do reasonably good work on analyzing the brighter probe sets 55 .

Sequence Alignment Methods
Sequence alignment methods enable the researchers to find the homology between the DNA or amino acid sequences of different organism and find out the evolutionary relationship between them. In pairwise sequence alignment two sequences are compared with each other where as in multiple sequence alignment multiple sequences are compared with each other or with a single sequence. A group of researchers in 1994 published a paper in which they have sequenced the whole NS3b protein 56 . NS3b protein was the causative agent for HCV infection in North India. Pairwise and multiple sequence alignment tools were used by the researchers 57 . Abida's research statement was brought to a conclusion that nonstructural proteins of HCV virus control the activity of HCV virus inside the host and was significant to pathogenesis of hepatitis. KAUSHIK   The research work of vikas helps us to understand human and viral protein interaction during the time of infection. By following the insilico approach with the help of partially identified transcripts of proteins they found the matches in Uniprot database, which enables them to design the epitopes 58 . In their work we found the use of database Uniprot to obtain the sequences of the viral protein in short period of time. After various bioinformatics tools like propred and propred-1 used predict the epitopes from the protein sequences. The resulting epitopes were aligned with the multiple sequence alignment tool named as ClustalW. These bioinformatics tools were used to obtain the information of viral proteins at gene level.
The study of 59 was focused to find the most pathogenic viral protein with the help of protein-protein interaction. Using the sequences obtained from the UniProt database, they predicted interaction of these protein sequences with Human proteins using HCVpro database.
Phylogenetic tree can simply referred to as evolutionary tree which shows evolutionary relationships among different biological species or molecules. Generally, the phylogenetic tree is based on similarities and variation in the physical or genetic features of organisms. The phylogenetic trees contain a lot of information about the inferred evolutionary relationships between a set of structural and non-structural proteins of hepatitis C virus (HCV).
In the figure 2, the horizontal dimension gives the amount of genetic change and it contains branches, the longer the branch in the horizontal dimension, the larger the amount of variation.
Moreover, the strength of the similarities (relationships) is varied, for-instance, in the case of the seven proteins, the similarities between NS5a and NS1 is stronger than with NS4a, while that of NS4a is stronger than with E1, while that of E1 is stronger than with E2, while that of E2 is stronger than with NS4b and NS3. urthermore, this evolutionary relationship is very significant in analysis of the T cell epitopes of hepatitis C virus due to the following reasons -(1) It enable us to predict what is the cause of immunogenicity of the virus; (2) It enable us to predict why some proteins have many epitopes compare with others and (3) It enable us to predict which features of amino acids should be consider more in design the model

CONCLUSIONS
The present review discussed the work done to explore and analyze the use of the bioinformatics in the HCV study. The bioinformatics has played a major role from determining the sequence of the HCV proteins, pathway of interaction of those HCV proteins with the human proteins, the expression level of HCV infection during the different period of infection to the identification and the Insilco vaccine design against the HCV strains. Our research will help those who want to work in this field related to the HCV as a beginner and to those who are interested in working on the HCV choosing the bioinformatics as its field of study.