Prediction of Conotoxin Type Based on Long Short-term Memory Network

Background: Conotoxin is a valuable peptide that targets ion channels and neuronal receptors. The toxin has been proven to be an effective drug for treating a series of diseases, but the process of identifying the type of toxin through traditional wet experiments is very complicated, low eciency and high cost, but the method of machine learning is used to identify the cono toxin. Training in the process can effectively change this status quo. Methods: A method to predict the type of spiral toxin using the sequence information of the toxin combined with the long-term short-term memory network (LSTM) method model. This method only needs to take the conotoxin peptide sequence as input, and uses the character embedding method in text processing to automatically map the sequence to the feature vector representation, and extract the features for training and prediction. Results: Experimental results show that the correct index of this method on the test set reaches 0.80, and the AUC (area under the ROC curve) value reaches 0.817. For the same test set, the AUC value of the KNN algorithm is 0.641, and the AUC value of the method proposed in this paper is 0.817. Conclusions: The algorithm does not require manual feature extraction and feature reconstruction steps, thereby simplifying the algorithm design, and can use the advantages of the long-term dependence of LSTM according to the characteristics of the cono toxin sequence, so that its classication can be better predicted, and the classication of the cono toxin can be better predicted. The sequence information of spirotoxin combined with the LSTM method can be better than the KNN classication algorithm.


Background
Conus is a kind of poisonous carnivorous tropical sea and ocean soft-body animals [1] . There are more than 500 species of Conus in the world, and there are at least 50,000 active peptides in the venom of Conus. The secreted toxin (called conotoxin) is mainly used in the predation and defense behavior of animals [2] . Conotoxin is extremely toxic and can cause animals to tremble, convulse, even paralyze and die. There are more than 700 kinds of conos in the world that secrete more than 100,000 toxins. However, the current experiments have only con rmed and recorded relatively few conotoxins (about 3,000 peptides) [3] . Conotoxin has strong biological activity and novel chemical structure. It has extremely high selectivity for ligand gates or voltage-gated ion channels [4] . It can distinguish between similar ion channel types and is widely used as an ion. Pharmacological reagents in channel research. Because the insectivorous conotoxin can kill many kinds of worms [5] , it has the potential to cultivate new varieties of insect-resistant crops or develop it as a peptide insecticide. Therefore, conotoxin has become a new source of new drug development and a powerful tool for pharmacology and neuroscience [6] , and it ranks rst in the research of animal neurotoxins. It is called "the treasure house of marine drugs", and it has received attention from all walks of life and has broad development prospects.
According to the different target sites of conotoxin [7] , it can be divided into three categories: (1)Conotoxin that acts on ligand-gated ion channels. (2)Conotoxin acts on voltage-gated ion channels, which are also called voltage-sensitive channels. (3)CTX acting on other receptors [8] . There are more than 300 ion channels in living cells. Many important functions in life, such as heartbeat, sensory conduction and central nervous system response, are controlled by cell signaling through various ion channels. Ion channel dysfunction can cause a variety of diseases, such as epilepsy, arrhythmia and type II diabetes.
These diseases are mainly treated with drugs that regulate the relevant ion channels [9] . Ion channels are also an important target for the treatment of viral diseases. Due to their importance to human life, ion channels have become the second most common drug development target. The following three ion channels are usually targets of toxins: potassium (K) channels,sodium (Na) channels, and calcium (Ca) channels. Based on its function and target object, conotoxin can be divided into the following three types: (i) K channel targeting type; (ii) targeting non-channel type; (iii) calcium channel targeting type [10] .
Due to the explosive growth of protein sequence data [11] , traditional wet experiment methods can no longer meet the needs of rapid identi cation of protein sequences. Yuan et al. developed a feature selection technique based on binomial distribution to predict ion channels by using radial basis function networks The type of toxin targeted. Subsequently [12] , they developed a predictor (iCTX type) to improve prediction accuracy. Zhang et al. applied mixed features in the prediction problem. Wang et al. combined variance and correlation (AVC) analysis with support vector machines to reduce attribute redundancy and improve prediction accuracy and calculation speed. However, none of these methods can be used to predict the type of conotoxin de ned by its target ion channel. For example, δ-toxoid-like Ac6.1 and ωtoxin-like Ai6.2 both belong to toxoid C1. However, the former targets voltage-gated sodium channels, while the latter targets voltage-gated calcium channels [13] .
To solve this problem, this article proposes a method to identify the three types of conotoxins by using their sequence information alone. In this research, we propose a deep learning long-term short-term memory (LSTM) neural network model to predict the classi cation of cono toxins [14] , and use word embedding technology to represent the conotoxin sequence as a vector, which is because the protein sequence can be seen Into a natural language. Effective features are extracted from the conotoxin sequence in order to further evaluate the performance of the model. The target model is compared with the existing machine learning model SVM [15] . The experimental results show that the method has good prediction performance and is suitable for classi cation and prediction of conotoxin. The work ow is shown in Fig. 1.
In this paper, word embedding technology and LSTM are combined to construct a model for anticancer peptide prediction, so as to take advantage of LSTM's advantages in sequence modeling and long-term memory and word embedding in sequence representation.

Experiment1:Model parameter optimization
Due to the small number of training sets, this paper adopts the cross-checking method to conduct experiments. In order to conduct effective veri cation, accuracy and ROC are used as measurement indicators [22] .
Because this paper predicts which ion channel the conotoxin belongs to is a three classi cation problem, that is, according to the sequence of the conotoxin to determine whether it belongs to a potassium ion channel, a sodium ion channel or a calcium ion channel, the activation function is softmax when compiling the model [23] .
First, determine the appropriate word vector embedding dimension. According to the characteristics of the collected conotoxin sequence data, the dimension of the word embedding vector space is selected as 60 and 90 for comparative analysis. The ROC curve corresponding to the model during veri cation is shown in Fig. 3.

Experiment2:Distribution of accuracy and loss function on training set and test set
Experiment3: Algorithm performance comparison KNN is one of the most commonly used classi cation algorithms. It has a good predictive effect and is not sensitive to outliers. Considering that there may be some erroneous data in the conotoxin data set collected in this paper, the KNN algorithm is used as a comparison algorithm. The ROC curves of the two methods on the test set are shown in Fig. 5. It can be seen from Fig. 5 that the area under the LSTM curve is larger than the area under the KNN curve, indicating that the accuracy of LSTM is higher than that of KNN, which proves that the method based on LSTM is superior to the KNN algorithm Dealing with the problem of conotoxin data classi cation.

Discussion
When the embedding dimension is 90, the obtained area under the ROC curve, that is, the AUC value, is the largest. At this time, the prediction performance is the best. Therefore, the vector dimension of the word embedding space is set to 90, Fig. 3. When the training parameter epoch is set to 10 times, the accuracy and loss function curves of the model on the training set and independent test set are shown in Fig. 4. As can be seen from Fig. 4, whether it is on the training set or the test set, the accuracy and loss value curves are close to each other, indicating that there is no over-tting phenomenon, which shows that the model has a good generalization ability, Fig. 4. Combined with Fig. 4(a), it can be concluded that considering the imbalance of classi cation data, the method proposed in this paper has certain reference value for both accuracy and ROC, which further shows that the method is in the treatment of conotoxin.
The superiority of the three classi cations, Fig. 5.

Conclusion
According to the characteristics of the conotoxin sequence, this paper uses the LSTM algorithm based on the word embedding method to classify and predict the conotoxin. The algorithm does not require manual feature extraction and feature reconstruction steps, which simpli es the algorithm design, and can use the advantages of the long-term dependence of LSTM according to the characteristics of the conotoxin sequence to provide a better prediction for the classi cation of the conotoxin Performance and experimental results show that the proposed algorithm can effectively predict the conotoxin in three categories.

Data set
The conotoxin sequence and its function used in this experiment are collected from UniProt [16] . In order to improve the quality of the data, when collecting data, we rst limit the function of the conotoxin to support potassium, calcium and sodium channels [17] . There are no conotoxins clearly marked on UniProt.
Except for a few of the conotoxins we have identi ed, all the others are discarded. In the end, we obtained 192 conotoxins, of which 74 calcium ion channel targeting types, 84 sodium ion channel targeting types, and 34 potassium ion channel targeting types. The training set consists of 60 calcium ion channel conotoxins, 67 sodium ion channel conotoxins and 25 potassium ion channel conotoxins. The test set consists of 14 calcium ion channel conotoxins and 17 sodium ion channel conotoxins. Toxin and 9 kinds of potassium ion channel conotoxin. The details of the data set are shown in Table 1. Sequence characterization This algorithm does not need to manually determine the physical and chemical properties of amino acids through wet experiments. It only uses the conotoxin character sequence as input data, and uses the word embedding training method to divide the conotoxin sequence into individual characters; because the length of the conotoxin sequence is not xed, So we set a xed maximum length according to the data set, and encode the conotoxin sequence with a xed length. When the length of the encoded sequence is less than the maximum xed length, ll it with 0 at the end, so that each character corresponds to An integer; then the word embedding training is carried out through the neural network, and 20 amino acid letters are mapped to the word embedding vector space, so that each character corresponds to a vector representation. The above steps can be automatically completed by the Tokenizer API provided by Keras.
Each conotoxin sequence can be coded as an M × N matrix, where M is the set sequence length and N is the set embedding space vector dimension.
LSTM three-class prediction model LSTM is a recurrent neural network with a special structure, which is an effective technology to solve the problem of long sequence dependence [18] . It is composed of a group of unit modules with memory function. Each unit module is composed of input gate, forget gate and output gate to realize the input, ltering and output of information. These gated operations enable LSTM to automatically extract and learn long-range correlation information useful for the overall classi cation task in the sequence, and the prediction of conotoxin classi cation based on sequence information is just in line with the characteristics of this type of sequence classi cation problem, so LSTM is suitable for Classi cation of Conotoxin.
This article classi es three ion channel-targeted conotoxins of potassium ion, calcium ion and sodium ion. Therefore, the activation function should not use the sigmoid function [19] , but the softmax activation function. This is because the effect of sigmoid in dealing with two classi cation problems Not bad, but the softmax function works better when dealing with multi-classi cation problems.
The overall process of the classi cation prediction algorithm proposed in this paper is shown in Fig. 2. First, the amino acid characters appearing in the conotoxin sequence are automatically mapped to the embedding vector space after neural network training, so that each amino acid character corresponds to a vector representation; then each conotoxin sequence is represented as a corresponding matrix; nally, The matrix is used as the input of the LSTM model for training and learning.

Evaluation method and evaluation index
Cross-validation and independent test data sets are used to verify the performance of the algorithm in this paper. Cross-validation divides the training set data into ve sub-sets [20] . Each time one subset is used as the test set for veri cation, and the remaining four combinations are combined as the training set. This process is repeated 5 times until each subset is considered as a test set at least once. At the same time, this paper also uses an independent test data set to verify the performance of the algorithm. The evaluation indicators of the algorithm include: 1) True Positive Rate (TPR); 2) False Positive Rate (FPR); 3) Correct Index; 4) ROC [21] curve and the area value AUC under it. The calculation formula for each indicator is as follows: In the formula: TP refers to the number of positive samples predicted to be positive; FP refers to the number of negative samples predicted to be positive; TN refers to the number of negative samples predicted to be negative; FN refers to the number of positive samples predicted to be negative.

Compliance with Ethical Standards
Research involving human participants and/or animals: This article does not contain any studies with human participants performed by any of the authors.Funding: There is no funding for this study Con ict of Interest: The authors declare that they have no con ict of interest. Figure 1 The Flowchat of Proposed method  Model accuracy curve and loss function curve