A deep learning model for fish classification

Abstract


Introduction
Fish is one of the most widely study group of aquatic organisms, about 27,683 fish species have most recently been catalogued into six classes, 62 orders and 540 families worldwide [1,2].Fish taxonomy and rapid species identification are the 2 fundamental premise of fishery biodiversity and fishery resources management, and also an important part of marine biodiversity.As a traditional classification method, morphological identification has successfully described nearly one million species on the earth, which has laid a good foundation for species classification and identification [3,4].However, routine species classification poses a challenge for fish classification owing to four limitations.First, due to the differences of individual, gender and geographical, phenotypic plasticity and genetic variability used for fish discrimination can result in incorrect classification [5].Second, with the deterioration of ecological environment and disturbance of human activities, many fishery resources have been seriously damaged, making it more difficult to collect fish specimens, especially for those with less natural resources [6,7].Third, some fishes show subtle dissimilarity in body shape, colors pattern, scale size and other external visible morphological features, which cause confusion of the same species.Finally, the use of key not only demands professional taxonomic knowledge, but also requires extensive experience that misdiagnoses are common [8].The limitations of morphology-based method, a new technology to fish classification is needed.
Genomic approach is a new taxonomic technique combining molecular biology with bioinformatics that uses DNA sequences as 'barcodes' to differentiate organisms [5].The DNA-based barcoding method is attainable to non-specialists.Many studies have shown the effectiveness of DNA barcode technology for more than 15 years, it has been extensive used in various fields such as species identification [9], discovery of new species or cryptic species [10,11], phylogeny and molecular evolution [12], . CC-BY 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted February 15, 2021.; https://doi.org/10.1101/2021.02.15.431244 doi: bioRxiv preprint biodiversity survey and assessment [13,14], customs inspection and quarantine [15], conservation biology [16].
In the field of species classification, a short gene segment is used in DNA barcoding, called the COI sequence, to build global standard dataset platforms, universal technical rules and identification systems for animals' taxonomy [1].COI gene has the characteristics of high evolution rate, obvious interspecific variation, relatively conservative within species, good universality of primers and easy amplification [17].Therefore, COI gene has been widespread employed as an effective DNA barcode for species classification of varied animal lineages, including bird [18,19], Mosquito [20,21], marine fish [22][23][24], freshwater fish [25][26][27].DNA barcode based on COI gene can be used to identify marine fish up to 98%, while freshwater fish can be identified with 93% accuracy [28].The approach base on DNA barcode has been proven to be a valuable molecular tool for fish classification.
However, the complexity and high-dimensional characteristics in COI gene sequences, analyzing these sequences reasonably and obtaining accessible information that humans can classify fishes correctly are a major challenge.This issue requires a multidisciplinary approach to deal with DNA sequences and to analyze the information contained from data.Deep learning, a method of learning and extracting useful representations from raw data, trains model, and then, uses the model to make predictions, has made great progress in recent years [29].Therefore, in this paper, we propose a novel approach based on DNA barcode, use the deep learning model to classify fish from different families and determine which fishes are regarded as outgroup, called ESK-model.To verify the effectiveness of the model, three families with many species and obvious interspecific variation were selected as the datasets.
First, the model preprocesses the original data that makes the COI gene sequences into a matrix representation, then, converts them into numerical data.Second, the model learns these data using EN-SAE model and obtains an outgroup score of each fish.Finally, the KDE model is used to generate a threshold and to predict which fish is outgroup base on threshold.The main contributions of our paper are as follows: • We introduce a deep learning model to classify fish from different families and determine which fish is outgroup based on DNA barcode, which is effective and robust.
• To solve the model overfitting caused by COI gene sample of species in the same family is limited, an Elastic Net is used for the model to increase the generalization ability.
• We employ EN-SAE model to receive outgroup scores.The decision threshold is automatically learned from organisms in same family by KDE model.An original predictor is proposed based on the anomaly scores, while other classification works often omit the importance of automatic learning threshold.
• We quantitatively evaluate the performance of our approach, and the results demonstrate that our ESK-model outperforms state-of-the-art methods.

Data description
The COI sequences from three dominant families of fish in this study were

Data definition
To facilitate the subsequent processing, DNA sequences can be represented by a matrix.The COI sequences for each family were formulated as follows: ( where n denotes the size of samples, and m denotes the number of features in each species.

One-hot code
One-hot encoding is the process of converting categorical variables into a form that is easy to use by machine learning algorithms, which are a combination of 0 and 1 [30].Therefore, the model encodes matrix into a numeric type of data using one-hot code.COI gene is composed of four bases, A, T, C, G.Each coded base was a 1×4 vector [0, 0, a i , 0], where a i =1.Therefore, four bases were formulated as follows: (3)   In stage one, there are two main tasks: (1) preprocessing raw data by representing the COI gene sequence in a matrix and (2) the one-hot code is performed on the matrix because the features of each fish species need to be transformed into numerical data.Finally, the preprocessed data are used as inputs for stage two.

An overview of the ESK-model
In stage two, a deep learning network, EN-SAE, is used to learn deep features from the data preprocessed in stage one.The model utilizes the EN-SAE model to compress the digitalized data into a representation of the potential data to reconstruct input, then, calculates the difference between input and output, and obtains an outgroup score of each fish.Finally, the outgroup scores are used as inputs for stage three.
In stage three, the KDE technique is used to learn the relationship between each score from stage two, and then, fits the data distribution according to properties of the outgroup scores.After that, the KDE model determines which fish is inner group and which fish is outer group base on the threshold.Traditional AE is a three-layer neural network, including an input layer, an output layer and a hidden layer.The structure of AE is symmetric, that is, the input layer and output layer have the same number of nodes and the dimensions of each node are the same too [31].The purpose of AE is to compress input data and save useful information to reconstruct input, and use the back propagation algorithm to update the weights so that the output data is as similar to the input data as possible [32].However, the output data are not sufficient to yield a rewarding representation of input.The reconstruction criterion with three-layer structure is unable to guarantee the extraction of useful features as it can lead to the obvious solution "simply copy the input" [33].The SAE can greatly solve this problem.(1) Encoder: in this step, the activation function σ e maps input data vector x to hidden representation h that can compress the input data and retain more useful representation, the typical form followed by a nonlinear representation:

Learning deep features and computing outgroup scores by EN-SAE
where x denotes input data vector, w is a weight matrix connecting the input layer to hidden layer, b is bias vector belongs to nodes of latent layer, σ e represents activation function, such as Sigmoid, Relu, Tanh, etc.
(2) Decoder: in this step, the hidden representation h is mapped into reconstruction vector y, the typical form as follows: ( where w ' is weight matrix connecting the latent layer to output layer, b ' is bias vector, σ d represents activation function.
Loss function is defined to measure the reliability of SAE.SAE is trained to reconstruct the features of input, and the weight of encoder and decoder are adjusted to minimize the error between output and input.Thus, loss function is introduced, it is represented by mean square error as follows: L1-norm also called Lasso regression, which contributes to generating a sparse matrix.And it is defined as: , where is the sum of the absolute value of each element in weight vector w., where is the sum of the squares of each element in weight vector w.In the process of training, we usually tend to make the weight as small as possible, because it is generally believed that the model with small parameters is simpler and can fit different data effectively.Thus, L2-norm can void overfitting to some extent and improve the generalization of model to adapt different fish families.
On the basis of proposed EN-SAE model, the outgroup score of each species can be defined to measure whether fish is outgroup.The higher outgroup scores are, the more likely they are to be treated as outgroup.
Therefore, the outgroup scores can be calculated by the following formula: where λ 1 is a parameter to adjust the L2-norm, λ 2 is a parameter to adjust the The EN-SAE model rejects high-dimensional features into low-dimensional features step by step to obtain higher representation of COI sequences, which is significantly more suitable for extract features and express data from original data.

Analyzing the outgroup scores by using KDE
KDE borrows its intuitive approach from the familiar histogram, which is among the most common nonparametric density estimation techniques.KDE provides a method of smoothing data points, and then, the distribution is fitted by the properties of data itself.The decision threshold is ascertained by using KDE model base on the outgroup scores.After that, the correct classification results of fish will be found.
Given the outgroup scores vector s, which obtained from EN-SAE model, KDE estimates the probability density function (PDF) p(s) in a nonparametric way: (8) where n is the size of the training dataset, {s i }, i = 1, 2, …, n, is the training dataset's outgroup scores vector, K (⋅) is the kernel function, and h is the bandwidth.
There are many kinds of kernel function, epanechnikov function is the most common function in density estimation and also has a good effect.Therefore, the epanechnikov is used to estimate the PDF: After obtaining p(s) of training the outgroup scores vector s by KDE, the cumulative distribution function (CDF) F(s) can be defined as fellow: (10) Given a significance level parameter α ∊ [0,1] and combine with CDF, a decision threshold s α can be found, s α satisfies following formula: (11) If the outgroup scores of each species meet the condition s ≥ s α , this species will be considered as outgroup.On the contrary, they are ingroup.Confirmed by repeated experiments that significance level parameter α is recommended to be set to 0.05.
ESK-model algorithm is summarized as shown in Algorithm 1.

Algorithm 1 ESK-model
Input: the COI sequences of each family Output: the outgroup in matrix x Step 1: Preprocessing data

Evaluation method
To test performance of the proposed model, divide the sample into four situations based on the actual classification and the ESK-model predicted classification.In    Additionally, Table 3 illustrates the detailed data corresponding to Figs 4-6.The results of Table 3 show that the outgroup scores of proposed model with five layers on different datasets were 0.0193, 0.0197 and 0.01, respectively.Moreover, after the number of AEs increased from 3 to 5, the outgroup scores on three datasets decreased by approximately 29.04%, 41.02% and 16.90%, respectively.Those results indicate that the proposed method can achieve low scores on identifying fish from different families and the outgroup scores tend to be stable gradually.

Performance evaluation with different methods
We compared our method, ESK-model, with four state-of-art algorithms, one class-support vector machine(OC-SVM) [35], K-nearest neighbor(KNN) [36], isolation Forest(iForest) [37], autoencoder(AE) [38], to evaluate performance on the task of sorting fishes from different families base on DNA barcode.Cross validation was used for model training, and confusion matrix of different models on three  In order to show the specific relationship between our method and other four methods, we utilize histograms to compare the performance of three matrices.
Additionally, Table 5     The most surprising finding was that the proposed model could accurately classify fish from different families.EN-SAE is used to calculate the outgroup scores, when the outgroup scores are high, the probability of being identified as other families is increased.The size of fish belonging to the same family is far more than that from other families, EN-SAE can well fit and learn the characteristics of intraspecific fish in the process of training.On the contrary, the number of fishes in different families is relatively small, we can't get a good fitting effect, resulting in higher outgroup scores.

Conclusion
In this study, we proposed the ESK-model that fuses EN-SAE model and KDE technology for fish classification in different families through DNA barcode.The experimental results and findings demonstrate the effectiveness of proposed model.
The main results and findings of this paper are as follows: (1) The outgroup scores have leveled off when the number of stacked AEs was set to five.
(2) Adding Elastic Net can prevent overfitting more effectively and improve the An overview of the proposed model is shown in Fig 1, ESK-model, which consists of three stages: (1) the data preprocessing stage, (2) learning deep features and computing each species outgroup score stage, and (3) deciding threshold base on outgroup scores and classifying fishes from different family stage.

Fig 1 .
Fig 1.An overview of ESK-model.Three-dimensional visualization of data is The SAE model builds a deep neural networks base on AE by stacking several AEs, puts the hidden representation of the upper layer as the input of the next AE.In other word, extracting the compressed features of hidden layer into next AE to training.In this way, training layer-by-layer can achieve input features compressed.At the same time, more meaningful features of COI sequences are obtained.The decoder can be reconstructed back into the input with a sufficiently small differences, the structure of SAE is expressed in Fig 2.
Represent the DNA sequences by a matrixEncode the matrix into a numeric type as matrix xStep 2: Training EN-SAE model Set the number of stacked AEs L.

Impact of the number of stacked AEs on the classificationperformance
In the field of deep learning, the number of layers in the model is a critical factor, because it directly affects the performance of the model.After all of the COI sequences were prepared, the impact of the number of stacked AEs in our model on the classification performance was also assessed.The outgroup scores trend with various stacked AEs from 3 to 8 on Sciaenidae, Barbinae, Mugilidae is shown in Figs 4-6.The experimental results showed in Fig 4 demonstrate that, as the number of AEs increased, the outgroup scores decreased rapidly on Sciaenidae when the number of AEs was fewer than five.The outgroup scores gradually stabilized when the number of AEs was greater than five.The outgroup scores on other two datasets showed the same trend as those on Sciaenidae.These results reach the best classification performance when the number of AEs was stacked to five.

Fig 10 .
Fig 10.Confusion matrix of five models on three datasets.
experimental results and findings.A significant experimental result was that ESK-model achieved the best discrimination performance when the number of stacked AEs was set to five.There are several possible reasons for this result.The features of COI fragment can't be fully learned when the number of stacked AEs is few.With the increase of the number of AEs, the proposed model can learn the deeper hidden features of DNA sequences.Obviously, when the number of AEs increased to five, the outgroup scores decreased sharply.Experiments showed that increased the number of AEs did not improve performance.The performance tended to be stable when the number of AEs was more than five because the deep features had already fully learned.Hence, the prime number of stacked AEs in the ESK-model was five.Another considerable experimental result was that Elastic Net can improve the performance of proposed model.A good model of deep learning usually requires abundant data to training, while the limitation of obtaining the COI sequences of fishes from different families, the problem of overfitting in small datasets is more and more serious.To solve the overfitting problem in training process on small datasets is of great importance.This model puts forward by using Elastic Net to solve overfitting problem and improve the generalization ability of the model.Moreover, genetic characteristics of fish belong to high-dimensional data, which is time-consuming during training.However, directly combining a set of fully connected EN-SAE is often useless to extract useful information.Elastic Net provides sparse connection also can save training time.Therefore, Elastic Net can improve the performance of.CC-BY 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted February 15, 2021.; https://doi.org/10.1101/2021.02.15.
Therefore, they are more likely to be treated as outgroup in KDE-model.At the same time, compared with other algorithms, it further confirms that the proposed model has better performance in fish classification.These positive results and findings suggest that the ESK-model based on deep learning, with the utilization of DNA barcode technology, can effectively classify the fish from different families.
. CC-BY 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

Table 1 . Summary of datasets.
collected, which the length of COI gene was 565 bp.20 homologous sequences in Sphyraena pinguis and Sphyraena jello from Mugiliformes were designated as outgroup.Species of experimental samples on Mugilidae is shown in S3

Table .
4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in . CC-BY 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in . CC-BY 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in [34], it can be used to choose more meaningful representations.When training model, the features are too many to select what are contribute more for this model.So we dropped the connections that the contribution of this model is so tiny, even if drop its have no impact on the model[34].It can reduce time consuming and study more useful features.
4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

Table 2 ,
four situations are illustrated with a confusion matrix.True positive (TP) is the number of outgroups that are correctly classified as outgroup.True negative (TN) is the number of ingroups that are correctly classified as ingroup.False positive (FP) is the number of ingroups that are wrongly classified as outgroup.False negative (FN) is the number of outgroups that are wrongly classified as ingroup.

Table 2 .
Confusion matrix.International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

Table 3 . The outgroup scores with different numbers of AEs on three datasets.
To evaluate effect of Elastic Net on the model performance, Stack Autoencoder-Kernel Density Estimation (SK) and ESK-model were compared in Figs 7-9.International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in Evaluation method has been defined in previous section.As shown in Figs 7-9, all evaluation indicators of ESK-model were higher than SK-model that without adding

Table 4 . The evaluation matrix on SK and ESK models.
. CC-BY 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in exhibits the detailed data corresponding to Figs 11-13.As we can see in Figs 11-13, ESK-model provides stable and efficient effects on three datasets and generates the highest Accuracy, Recall and F-measure.Those resultsshow that ESK-model is superior to other methods.

.9710 0.9694 0.9845
Note that the best result is typeset in bold.The order of evaluation matrix is as follows Accuracy, Recall, F-measure.CC-BY 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in DNA barcode with the employ of representative data to classify fishes from different families and distinguish the outgroup.In this section, we discuss and analyze the .
. CC-BY 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted February 15, 2021.; https://doi.org/10.1101/2021.02.15.431244 doi: bioRxiv preprint