An analytical study on the identification of N-linked glycosylation sites using machine learning model

N-linked is the most common type of glycosylation which plays a significant role in identifying various diseases such as type I diabetes and cancer and helps in drug development. Most of the proteins cannot perform their biological and psychological functionalities without undergoing such modification. Therefore, it is essential to identify such sites by computational techniques because of experimental limitations. This study aims to analyze and synthesize the progress to discover N-linked places using machine learning methods. It also explores the performance of currently available tools to predict such sites. Almost seventy research articles published in recognized journals of the N-linked glycosylation field have shortlisted after the rigorous filtering process. The findings of the studies have been reported based on multiple aspects: publication channel, feature set construction method, training algorithm, and performance evaluation. Moreover, a literature survey has developed a taxonomy of N-linked sequence identification. Our study focuses on the performance evaluation criteria, and the importance of N-linked glycosylation motivates us to discover resources that use computational methods instead of the experimental method due to its limitations.


INTRODUCTION
The process of glycosylation is considered to be one of the most complex type of post translation modification (PTM) in eukaryotes cells (Akmal, Rasool & Khan, 2017;Yang et al., 2019). The post translation modification occurs when protein, after synthesis, undergo different type of changes and without these modification proteins cannot perform their psychological functionalities properly . Nearly 200 different types of such post translation modification have been discovered and glycosylation is most important amongst them as it plays a vital role in different biological functions such as cell communication, protein folding, recognition of antigens and −50% of the human genomes are glycosylated (Akmal, Rasool & Khan, 2017;Akmal et al., 2020;Yang et al., 2019). The glycosylation sites are very relevant for cancer discovery as well as for further drug development (He, Wei & Zou, 2019;Hwang et al., 2020). Glycosylation sites are classified into five types: N-linked, O-linked, C-linked, glypiation and phospho glycosylation (Lei, Tang & Du, 2017). It is very much important to identify such sites.
There are various techniques to identify such sites, broadly it can be classified into experimental and computational method (Audagnotto & Dal Peraro, 2017). The experimental method requires the understanding of cell biology and the functions of cell structure (Hwang et al., 2020). The well-know techniques used for experimental identification are: radioactive label, chromatin immunoprecipitation (ChIP), mass spectrometry (MS) and liquid chromatography (LCG) (Akmal et al., 2020;Hwang et al., 2020;Naseer et al., 2020a). In computational method, researchers discover valuable information from the structure of protein sequences and apply some artificially intelligent algorithms to predict the relevant glycosylation or any other PTM sites (Hamby & Hirst, 2008;He, Wei & Zou, 2019;Shek, Kotidis & Betenbaugh, 2021;Naseer et al., 2021b;Murad et al., 2021).
The N-linked glycosylation is the primary glycosylation type, as 90% of glycosylated sites belong to the N-linked glycosylation (Akmal, Rasool & Khan, 2017). Usually, Nglycans are attached to glycoproteins on asparagine residues within the Asn-X-Ser/Thr sequon (except proline, X could be any amino residue) (Zhang et al., 2021b;Alkuhlani et al., 2021). N-linked glycans plays vital role in intrinsic and extrinsic (Alkuhlani et al., 2021). Apart from improving the protein's stability, it provides a structural component to the cell surface. N-glycan also mediate cell-to-cell interaction and controls the glycoprotein in the cellular environment (Naseer et al., 2020b). N-linked glycan helps is identification of various diseases such as type I diabetes, cancer, rheumatoid arthritis, and Crohn's disease (Alkuhlani et al., 2021;Naseer et al., 2020a;Khan et al., 2020b). Therefore, it is very much important to identify such sites, but the identification of such sites using experimental technique is time-consuming and expensive as well (Coff et al., 2020;Akmal et al., 2020;Qiu et al., 2018). Therefore, researchers have developed several computational models based on artificial neural network (ANN) to predict the N-linked sites (Le, Sandag & Ou, 2018;Butt et al., 2016;Alkuhlani et al., 2021). Although, few reviews exist on N-linked prediction model, but they mainly focus on algorithm used to train the model and less focused on the feature set construction and performance metric, as shown in Table 1. These studies only analyzed the models developed up to 2019.
The glycosylated region of N-linked sites appears at the specific location within the protein sequence, as protein sequence consists of the chain of amino acid and each amino acid out of known 20 is represented by specific alphabetic character (Qiu et al., 2018;Yang et al., 2019;Kumari, Kumar & Kumar, 2018). In computational approach, it is required to extract some useful information from these sequences to construct the feature vector (Butt, Rasool & Khan, 2017;Chien et al., 2020;Hamby & Hirst, 2008;Naseer et al., 2021b). The feature vectors of glycosylated and non-glycosylated N-linked sites have certain pattern of protein sequences and these patterns have identified through the various technique (algorithm) of machine learning method (Taherzadeh et al., 2019;Tran, Pham & Ou, 2021;Hayat & Khan, 2011;Park et al., 2019;Xiang, Zou & Zhao, 2021;Dimeglio et al., 2020).
The evidence of organism type also helps in the successful identification of such sites (Huang & Li, 2018).
The existing reviews are compared on various perspectives such as quality assessment scores, availability of N-linked model, feature set construction method, training model algorithm, specie type, performance metric and target repositories as shown in Table 1. The proposed study only focused on the review articles accepted in recognized journals because of reliability (Barukab et al., 2019). This comparison helps the need to build the survey.
The rational of our work is to provide the comprehensive systematic literature review on the identification of N-linked sites to bring out the detail of exiting computational models. The researchers have performed numerous efforts to identify such sites computationally in the recent past. The work presented by these researchers has been reviewed by few authors to ensure the effectiveness of the proposed prediction model to identify the N-linked sites  (Shek, Kotidis & Betenbaugh, 2021;Alkuhlani et al., 2021;Audagnotto & Dal Peraro, 2017).
The authors primarily focused on the feature set construction algorithm and training algorithm, and less or no focus on quality assessment criteria, performance metric evaluation and the type of species of the reviewed articles used to predict the N-linked sites. The proposed systematic review provides novel features such as targeting channel, quality assessment score, new classification criteria, and performance evaluation based on accuracy, sensitivity, and specificity metric after evaluating studies empirically. This SLR will help the medical scientists in the targeted identification of cancer, type I diabetic cell for treating the patients, and help the pharmacists in effective drug development by opting the accurate predictor of N-Linked sites. Furthermore, it will facilitate the researchers to develop more accurate and efficient predictive model by analyzing the techniques used by existing researchers.
The proposed article is presented in the following sequence: the methodology adopted to conduct survey along with objectives and research questions is presented in "Survey methodology". The analysis of the research question is described in "Assessment and discussion". The "Discussion and future direction" presents synthesis of reviewed literature. Finally, the article has been concluded in "Conclusion".

SURVEY METHODOLOGY
The survey methodology consists of three phases: plan, conduct of review and conclusion as shown in Fig. 1.

Review plan
The process involved to conduct the review is shown in Fig. 2.

Review conduct
The steps involved to conduct the review were: (a) Search of relevant primary study from different search venues. (b) Selection of relevant research articles from searched articles obtained in previous step through predefined inclusion/exclusion criteria. (c) The selected articles were then assigned score based on their defined quality parameters. (d) Backward snowballing to include the important articles.

Automated search in digital library
The relevant research articles have been extracted through system search. Therefore, automatic, and manual search has been performed. The google scholar is used as digital venue to get the relevant research articles. To get appropriate and relevant search result, keyword based search has been applied on the digital venue. Based on the RQs mentioned in Table 2, keyword are selected for primary and secondary term. The Boolean operator 'AND' and 'OR' are used to build query string. The search query based on keyword is shown in Fig. 3. The search query is grouped into three groups where each group contain the similar keyword to ensure maximum relevant studies as mentioned in Table 1. Using the Boolean operators (OR, AND) final search query is designed in which AND operator is applied in different groups and OR operator is with in different keywords of a group.
Listening 1 ["n linked" OR "Post translation modification"] AND ["Glycosylation sites" OR "Glycan"] AND ["prediction model" OR "Artificial Intelligence" OR "Neural Network" OR "Deep Learning"] Primary keywords were selected as a key identifier for N-linked prediction models. Primary keywords along with the secondary and additional keywords were chosen. Combination of keywords and Boolean operators have developed as mentioned in Table 3.  To identify • High quality publishing venue.
• Scentometric analysis based on meta information including research type, approaches and validation methods.
RQ2 Which are the exiting prediction model (tool) used for the identification of N-linked Glycosylation sites and for which kind of species these sites are identified?
To help the researchers to identify diseases i.e., cancer detection, type 1 diabetic and also drug discoveries through cost effective and time saving approach.
RQ3 Which algorithm or method are used to construct N-Linked feature vector?
To understand the in-depth structure of protein sequences to extract useful information to train model.  b) It must target any of the research question mentioned in Table 2.
c) It is published in journal or in preprint repository since 2017.
d) It should contain computation or semi computational approach for prediction.

Exclusion Criteria
a) Eliminate articles that do not address the N-linked glycosylation or glycosylation.
b) Eliminate articles that purely identify N-linked sites through biological experimentation.
c) Eliminate the books appeared in the result of search query.

Quality assessment as selection criteria
The quality assessment (QA) is the major step to conducting any systematic review. In this study, questionnaire has been designed to measure the quality of selected articles. The score is computed on the following criteria: a) The study has awarded score (1) if N-linked predictive tool has developed, otherwise scored (0).
b) The study has awarded score (2) if the method developed to extract feature from data based on computational approach, score (1) for hybrid approach and score (0) in-case of experimental approach.
c) The study has awarded score (1) if the computation method for training has provided, otherwise scored (0).
d) The score (1) has been awarded if the data set used is available otherwise scored (0).
e) The score (1) has been awarded if the organism type is available otherwise scored (0).

Digital library
Search query Applied filter IEEE Xplore ("n linked" OR "Post translation modification") AND ("prediction model" OR "Artificial Intelligence" OR "Neural Network" OR "Deep Learning")

2017-2021
Springer link ("n linked" OR "Post translation modification") AND ("Glycosylation sites" OR "Glycan") AND ("prediction model" OR "Artificial Intelligence" OR "Neural Network" OR " Google scholar ("n linked" OR "Post translation modification") AND ("Glycosylation sites" OR "Glycan") AND ("prediction model" OR "Artificial Intelligence" OR "Neural Network" OR "Deep Learning") f) The studies were rated by taking conference and journal rating list into account. The possible score for publication is shown in Table 4.
The resultant score has been calculated for each study by aggregating the points of all question. Article achieving minimum score (5) has been included in the review.

Selection based on snowballing
After performing the quality assessment, back-word snowballing to extract the relevant articles from the references of the selected articles. The articles by Kumar et al. (2020) and Ilyas et al. (2019) have been shortlisted after performing the inclusion exclusion criteria and quality assessment.

Review report
The glycosylation sites especially N-Linked identification is very important domain, therefore in this review, systematic and empirical method is adopted to extract the relevant article from the digital libraries mentioned in Table 3, using query string as shown in Listening 1. Almost 800 articles are left after removing the articles before 2017.
The shortlisted articles are then filtered based on title, abstract, introduction and examined the full article if required for each search result. The article contains less than four pages and irrelevant articles were eliminated. The results of primary search, filtering and inspection phase, covering five digital libraries, are presented in Table 5.
After the preprocessing of articles, inclusion/exclusion test has been performed and after that quality assessment score has been computed. The article having at least five score have included in this study and it is total of 70 in count as given in Table 6.

ASSESSMENT AND DISCUSSION
In this section, the research questions have been analyzed based on 70 primary studies.         Which are the relevant publishing channel for N-linked glycosylation research? Which channel type and geographical area target this research?
To find the relevant publishing channel, channel type and geographical aspects for the Nlinked glycosylation sites requires the meta information. To achieve this purpose, channel type, publishing year and demographical distribution is presented for the analysis of selected studies. The importance of selected topic can be evaluated from the yearly publication on the relevant domain. The 28 out of 70 articles has been published in 2021 which also of 40% of selected article as shown in Fig. 4.
It is clear from Fig. 5 that the maximum portion of studies belong to the recognized journal followed by international conferences.
It is observed, 42 out of 70 studies have been published in the different regions of the Europe as shown in Fig. 6. Quality assessment score for each finalized study awarded according to defined criteria in quality assessment score section, shown in Table 7. It is clearly observed that only studies qualifying minimum threshold are listed. The article published in Q1 quality journal achieve highest score, it will help researchers to find the relevant publishing venues for the N-linked and other glycosylation site prediction studies. Almost 50% of the studies achieve eight score or above which shows the relevancy of the selected studies through developed query string.
The overall classification result and QA studies have presented in of Table 6. The finalized articles have classified based on seven parameters: research type (solution proposed or review article), empirical type (computational approach, experimental approach based on biological studies or hybrid approach based on computational and biological study), glycosylation type, specie type, method (used for feature extraction), Algorithm (used to train predictive model) and tool (developed for prediction).
Furthermore, the sources of finalized studies, and total number/percentage of studies per publication source mentioned in Table 8.    Which are the exiting prediction model (tool) for the identification of N-linked Glycosylation sites and for which kind of species these sites are identified?
The available tool to identify the N-Linked glycosylation sites and for which kind species it can identify the relevant site is the parameter of this study. There is hierarchy of N-Linked Glycan to PTM. Where PTM is classified into various type and Glycosylation in one of them and glycosylation is further classified into five group and N-linked is one of them. The summarized detail of eight is represented in Fig. 7. It is observed, there are 13 studies including (Chien et al., 2020;Taherzadeh et Table 9.

He, Wei & Zou (2019) 2019 Not mention Provided
The computational model used to identify N-linked sites while specie is not mentioned.

Yang et al. (2019) 2019 Human Awesome
The hybrid approach develop to identify PTM sites for human.

Le, Sandag & Ou (2018)
2018 Human PTM Transporter The computational approach developed PTM sites including N-Linked sites for human.

Audagnotto & Dal
Peraro (2017) 2017 Not mention Provided The computational model used to identify N-linked sites while specie type is missing.
2017 Human Sequon Computational method to identify N-Linked sites for human.
It is important to specify for which kind of species these tools will be operating, therefore to achieve this purpose the information is also extracted from the selected studies. Some authors (He, Wei & Zou, 2019;Audagnotto & Dal Peraro, 2017;Shek, Kotidis & Betenbaugh, 2021;Carpenter et al., 2022) did not mention the organism type while other mentioned it and it is observed most of them use human data for site identification as mention in Table 9.

ASSESSMENT OF Q3:
Which algorithm or method are used to construct N-Linked feature vector?
The data is the major component to develop any machine learning model (Mahmood et al., 2020;Naseer et al., 2020aNaseer et al., , 2020bKhan et al., 2020b). In bioinformatics, there are two major sources of data on which model can be developed, one is existing repositories such as UniProt (protein repository), GenBank (nucleotide sequence) etc. and other is experimental data which obtain from specific biological experiments. The dataset obtained from any source needs preprocessing to construct the feature vector. The more accurate feature helps to develop efficient model (Barukab et al., 2019;Butt & Khan, 2019;Hussain, Rasool & Khan, 2020;Shah & Khan, 2020). For this purpose, feature method used to predict the N-Linked sites in the selected articles have taken as a parameter of this study.
Most of the authors used the computational feature extraction approach while few used the experimental data obtained from mass spectrometry, human plasma and psychochemical method as mentioned in Table 10. It is observed, mostly researcher (Akmal, Rasool & Khan, 2017;Chien et al., 2020;Taherzadeh et al., 2019;Liu et al., 2019;Li et al., 2019;Bojar et al., 2021b;Lundstrøm et al., 2022;Park et al., 2019;Le, Sandag & Ou, 2018;Suga, Nagae & Yamaguchi, 2018;Dimeglio et al., 2020;Magaret et al., 2019;Kumar & Gilula, 1986;Perpetuo et al., 2021;Huang & Li, 2018) used the statistical moment method based on combination of protein sequence, structure and functions along with some other parameters like position relevance of sequences using the protein dataset to construct the feature matrix. The other computational method used to construct features selected article are word embedding vector technique, UbiSite-XGBoost, Similarity voting, CfsSubSetEval, Kernel Density Estimate, correlation subset and graph method as mentioned in Table 10.

ASSESSMENT OF Q4:
Which algorithm or method are used to train N-Linked computation model?
The choice of algorithm to train any predictive model is most important factor which impact the performance of any model (Butt & Khan, 2019;Hussain, Rasool & Khan, 2020;Malebary & Khan, 2021). Therefore, it is required to know which type of algorithm are being used to develop the N-linked prediction model. For this purpose, algorithm used for training models in the selected article has been noted as the parameter of this review article as mentioned in Table 11.

ASSESSMENT OF Q5:
How effective are the existing model to predict the N-Linked sites?
The result comparisons are used to present the performance to various based on which conclusion can be drawn with respect specific dimension. In this systematic review, the performance comparison of N-linked model in the selected articles has performed. The parameter used for the performance consists of (a) availability of data set. (b) accuracy metric (c) sensitivity metric. (d) specificity (e) availability of developed tool (f) comparison on independent data and the type of glycosylation as mentioned in Table 12. It is observed, most of the authors (Kotidis & Kontoravdi, 2020;Sha et al., 2019;Park et al., 2019;Antonakoudis et al., 2021;Zhang et al., 2021b;Wang et al., 2017;Kumar et al., 2020;Ilyas et al., 2019;Sugár et al., 2021;Bojar et al., 2021a;Perpetuo et al., 2021;Huang & Li, 2018) did not provide the results or they did not follow provided performance metrics in their Graph CNN Glycosylation LectinOracle, a model combining transformer-based representations for proteins and graph convolutional neural networks for glycans to predict their interaction.
(Continued ) and out of these, authors Chien et al. (2020) and Hwang et al. (2020) has not provide the data set on which experiments have performed.

DISCUSSION AND FUTURE DIRECTION
This section summarizes and discuss the detail of this systematic literature review regarding the identification of N-linked sites.

TAXONOMY HIERARCHY
The objective of this study was to analyze the current progress to identify the N-linked glycosylation sites. To achieve this objective, a taxonomy has built based on the coding scheme as mentioned in Table 13 after critically analyzing 70 articles, selected through a systematic approach. The coding developed on the various aspects related to this study such as: Feature set construction method, machine model training algorithm and performance evaluation. These aspects are further divided into the sub-level showing the depth of each aspect and their role in the efficient identification of N-linked sites. The coding scheme helped to construct the taxonomy as shown Fig. 8 to further investigate domain and sub-domains identified through it.

GENERAL OBSERVATION AND FUTURE DIRECTION
Several possible observations can be made in the finding of this SLR based on the taxonomy as shown in Fig. 8. Various RQs were developed which plays a key factor in the identification of N-linked sites. The trends and finding can be observed while the identification of such sites. These include the following observation along with future direction.
(a) Feature set construction method The performance of computational model deeply depends on the quality of feature set extracted from the data set which later used for training the machine learning model (Saeed, Mahmood & Khan, 2018;Khan et al., 2019;Naseer et al., 2021a). The discriminating features helps the model to learn proficiently and then perform the right prediction. Therefore, it is significant to discover the techniques which extract the useful information from the dataset. The various methods have been used by authors to construct the feature set, the widely used are: protein sequence feature, protein structure feature, statistical moments, word embedding technique and similarity voting. The majority of the authors (Liu et al., 2019;Bojar et al., 2021b;Magaret et al., 2019;Bojar et al., 2021a) only used the sequence based information of protein to train the model. It has also observed, the authors (Akmal, Rasool & Khan, 2017;Taherzadeh et al., 2019;Li et al., 2019;Park et al., 2019;Murad et al., 2021) applied the combination of multiple features such as sequence, structural and statistical to construct feature vector. More than 50% of the research article selected in this study, which got 10 points based on quality assessment score used combination of various features as mentioned above. The new techniques adopted in recent research articles are word embedding vector, graph statistical feature along with similarity voting and Chou's five step method. The researchers can use these feature extraction techniques to improve the performance of N-linked prediction model or any PTM site identification model. (b) Machine training algorithm The most significant part of computational model after the feature extraction method is to develop the method to train the machine model (Hussain, Rasool & Khan, 2020;Barukab et al., 2022;Khan et al., 2020a). The performance of model impacted most by the technique used for training the machine. The appropriate learning algorithm along with fine feature extraction method, results highly adequate model that predicts the independent data with great accuracy. Therefore, the development of appropriate machine learning method is very much essential. The researchers proposed various methods to predict the N-linked sites accurately. The most widely used methods include: Artificial Neural Network ( (c) Performance evaluation Once the model has trained, it then validated on the independent data to evaluate the performance. There are various techniques to measure the validity of model, the most significant metrics to evaluate the performance are Accuracy metric, Sensitivity and Specificity metric. The sensitivity test measures the true positive accuracy of a model while specificity measures the true negative accuracy of the model. In this study, the performance has evaluated on aforementioned metrics. Around 40% of the authors have not validated their model on any of above mentioned performance metrics. Only 20% of the authors have performed each of the defined performance metrics. The predictive models in which PTM type is specialized to N-linked have better accuracy as compared to those in which PTM type is not specified or are the generalized ones. The highest accuracy of −99% was achieved by author Akmal, Rasool & Khan (2017) based on these evaluation criteria. It also presents the sensitivity and specificity measures of the model which were 99.8% and 99.9% respectively, but it did not provide the web server. The author Hwang et al. (2020) claims the accuracy of 99% along with the sensitivity of 100%, but did not provide the working tool, dataset, and result comparisons with other predictors. The most efficient predictive models with available web server are Sequon model Ruiz-Blanco et al. (2017) and Sprint-Gly model Taherzadeh et al. (2019) with the accuracy of 97.5% and 97% respectively. The Sequon model has trained on the human protein sequence only while Sprint-Gly is equally effective for both human and rat species. Therefore, Sprint-Gly considered to be a reliable model out of the currently available web servers.

Future direction
Bioinformatics is an emerging filed, there are lot of problems that needs the computational solution over the experimental. As it was mentioned earlier, the researchers have identified almost ∼ 200 types of PTM which plays key role in various biological functions. Apart from N-linked glycosylation, the other types of glycosylation such as O-linked and Clinked also play vital role in protein functioning and various drug discovery techniques. Therefore, it is the opportunity for the researchers, pharmaceutical and academia to develop the efficient computational model to solve the problem that needs better computational solution. Few of the existing problem that needs to be addressed are given below (a) Identify the O-linked glycosylation sites for threonine and serine using ANN.
(b) How the performance of C-linked glycosylation can be enhanced through exiting neural network classifiers.
(c) Develop a comprehensive predictive model to classify the type of glycosylation.
(d) How effective are the exiting classifier to predict the other PTM sites?

CONCLUSION
The significance of N-linked glycosylation promotes the discovery of such sites using computational methods instead of experimental method due to its limitations. In this systematic study, existing information to identify such sites was studied which covered the possible challenges and their solutions through systematic method. The research articles, related to the keywords associated with N-linked glycosylation were evaluated through five major digital libraries. In the result of search query applied to digital libraries, more than 800 articles have found and after filtering process 70 article were remained for further analysis. The results show that approximately 75% of the articles were published in recognized journals and rest belong to top conferences. It was observed that more than 40% of articles were published in the American journal followed by the Middle East with 20%. Most of the selected studies focused on the feature construction method and training algorithm, but less focused on the performance evaluation criteria and development of tool or web server. The major shortcomings of any SLR primarily are related to search strategy, poor classification, and inaccurate data extraction. In this SLR, these deficiencies were overcome by applying the search query on five major digital libraries to reduce biasness. The results of search queries were then filtered through well-defined inclusion/exclusion criteria.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
The authors received no funding for this work.