An internet traffic classification method based on echo state network and improved salp swarm algorithm

Internet traffic classification is fundamental to network monitoring, service quality and security. In this paper, we propose an internet traffic classification method based on the Echo State Network (ESN). To enhance the identification performance, we improve the Salp Swarm Algorithm (SSA) to optimize the ESN. At first, Tent mapping with reversal learning, polynomial operator and dynamic mutation strategy are introduced to improve the SSA, which enhances its optimization performance. Then, the advanced SSA are utilized to optimize the hyperparameters of the ESN, including the size of the reservoir, sparse degree, spectral radius and input scale. Finally, the optimized ESN is adopted to classify Internet traffic. The simulation results show that the proposed ESN-based method performs much better than other traditional machine learning algorithms in terms of per-class metrics and overall accuracy.


INTRODUCTION
Summary of related work.

Work
Simple description Comments Aceto et al. (2010) Classify the network traffic of a specific port into the corresponding network application Be affected by dynamic ports Finsterbusch et al. (2013) Identify the application by analyzing the protocol signature in the payload Have large computational overhead, and may cause unnecessary privacy disputes Liu et al. (2020) Utilize the bidirectional GRU to extract the forward and backward features of byte sequences in the session, and then employed the attention mechanism to assign the weight of the features according to their contributions Wang et al. (2017) Cut the original data traffic and input it into the CNN to classify network traffic Yang et al. (2018) Adopt CNN to extract high-dimensional features of the network traffic, and then extract the representative features from these features based on the AE

Zeng et al. (2019)
Utilize CNN, LSTM and SAE to extract the spatial, temporal and coding features of the original traffic, and combine these features to achieve comprehensive understanding of the original traffic Lotfollahi et al. (2020) Use header information and payload data to train CNN and SAE respectively Ren, Gu & Wei (2021) Divide a large classification into small classifications by using the tree structure Simplify the training process, and have the characteristics of easy implementation and fast training speed ML algorithm, the neural network-based methods have achieved better identification effect. However, the methods have some disadvantages such as long training time and high computational cost since the structure of adopted neural network is complex and their weights have to be determined through multiple iterations. With respect to the related researches, they are summarized in Table 1, including the simple descriptions on approach and comments. Echo state network (ESN), as a kind of neural network, adopts the reserve pool composed of sparsely connected neurons as the hidden layer to perform high-dimensional and nonlinear representation of the input data (Grigoryeva & Ortega, 2018). It only needs to train the weights from the reserve pool to the output layer, which simplifies the training process and solves problems of traditional neural networks such as complex training and difficult determination of network structure. ESN, with the characteristics of easy implementation and fast training speed, has good application prospects in time series forecasting (Wang et al., 2018;Long, Zhang & Li, 2019;Zhang et al., 2019;Duan & Wang, 2015). In this paper, we propose an internet traffic classification method based on ESN.
Although the echo state network has advantages in solving the above problems, there are also some problems. For example, the stability of the reserve pool will affect the generalization ability of the network, which is easy to cause problems such as over-fitting. Hu et al. proposed an improved method of ESN, which combines ESN and deep learning, and used the efficient learning ability of deep learning to improve the stability of the reserve pool (Hu, Wang & Lv, 2020). When ESN is used to classify network traffic, hyperparameters such as the input scale and sparse degree are of great importance to the classification performance. The results obtained through different hyperparameter configurations vary greatly. Therefore, selecting the optimal hyperparameters are critical to the results of network traffic classification. The traditional hyperparameter optimization method finds out the hyperparameters of the ESN by manual setting. However, this method has the disadvantage of time-consuming and it is difficult to select the optimal hyperparameters. Swarm Intelligence (SI) Optimization algorithms are inspired by the collective behavior of creatures. Since the individuals in the population can interact and share information, the SI algorithm has the characteristics of strong flexibility and fast convergence, and can provide satisfactory solutions when applied to the automatic optimization of hyperparameters. At present, the Genetic Algorithm (GA) (Zhong et al., 2017), the Particle Swarm Algorithm (PSO) (Chouikhi et al., 2017), the Fruit Fly Optimization Algorithm (FOA) (Tian, 2020;Zhang et al., 2020), the Differential Evolution Algorithm (DE) (Hu, Wang & Tao, 2021) and the Grey Wolf Optimizer (GWO) (Kohli & Arora, 2018) algorithm have been adopted to automatically optimize hyperparameters of ESN. However, the reservoir of ESN contains many nodes and its search space is large, hence the above-mentioned algorithms are not suitable for optimizing hyperparameters with a large range of values. Salp Swarm Algorithm (SSA) is a new type of SI algorithm that simulates the foraging behavior of salps in the biological world. It guides search optimization by simulating the population behavior of salps sailing and foraging in ocean, and has the advantages of high optimization accuracy, strong search ability and good robustness. SSA has been employed to function optimization (Rizk-Allah et al., 2019), combination optimization (Abualigah et al., 2020), and shortest path solution (Ateya et al., 2019), etc., and can obtain the optimal solution. Mirjalili et al. (2017) evaluated SSA on 19 well-known mathematical functions and 2 optimization problems, and compared it with state-of-art SI algorithms such as PSO, GA and FOA. The results show that SSA has better optimization performance than other SI algorithms. However, like other SI algorithms, SSA also has some shortcomings, such as inadequate spatial search at the early stage and reduced population diversity at the iterative process. To enable SSA to find the optimal solution more accurately, we introduce Tent mapping with reverse learning, polynomial operator and dynamic mutation strategy to improve SSA, and then adopt the advanced SSA to optimize the important hyperparameters of ESN automatically. Our main contributions are as follows: (1) Network traffic classification plays a vital role in analyzing user behavior, enhancing network controllability, improving service quality and ensuring network security. Considering ESN has the advantages such as excellent classification performance, fast training speed and easy implementation, we propose a classification method of Internet traffic based on ESN.
(2) We improve the SSA. Firstly, Tent mapping with reverse learning is introduced to initialize the population, which makes the distribution of the initial population position uniform and improves the search efficiency of the SSA. Secondly, polynomial operator is used to maximize the diversification of search domain and improve the global exploration ability of the SSA. Finally, a dynamic mutation strategy is adopted to increase the population diversity at the later stage and avoid the SSA falling into local optimum.
(3) Hyperparameters of ESN such as input scale and sparse degree are of great importance to the classification performance. The reservoir of ESN contains many nodes and large search space, thus the optimization effect of the traditional hyperparameter optimization method still needs to be improved. We use the advanced SSA to optimize important hyperparameters of ESN.
The rest of the paper is organized as follows: 'Materials & Methods' introduces the ESN and discusses its important hyperparameters. Then, we improves the SSA and describes the network traffic classification method. In 'Results', we perform experiments to verify the effectiveness of the proposed scheme. Finally, we summarizes our work in 'Conclusion'.

ESN
ESN is composed of input layer, reservoir and output layer. The reservoir contains hundreds of sparsely connected neurons, and the connection weights between neurons are randomly generated and fixed. Figure 1 shows the ESN structure.
The state equation and output equation of ESN are as follows (Han & Xu, 2017): where u(t ) ∈ R M ×1 is the input vector, y(t ) ∈ R M ×1 is the output vector, b x ∈ R N ×1 is the input bias, and b ∈ R M ×1 is the output bias. The state x(t ) ∈ R N ×1 at the current time is calculated from the input vector u(t ) at the current time t and the state of the reservoir x(t − 1) at the previous time t − 1. ϕ(·) is the activation function of the neuron, which can select Sigmoid function or tanh function. The element of the input-reservoir connection weight matrix W in ∈ R N ×K is in the interval [−1,1]. W x ∈ R N ×N is the internal connection weight matrix of the reservoir. W back ∈ R N ×L is the output-reservoir connection weight matrix. W out ∈ R L×(K +N ×L) is the output connection weight matrix. W in , W x and W back are generated randomly and remain unchanged during the training phase of ESN. The network only needs to train the output connection weight matrix W out , which reduces the computational complexity.
The core of ESN is the reservoir. Its performance depends on four crucial hyperparameters: the size of the reservoir N , spectral radius R, sparse degree D and input scale S. How to select these hyperparameters is very important (Duan & Wang, 2015).
(1) Size of the reservoir The size of the reservoir N , as the most important hyperparameter affecting the performance of ESN, is the number of neurons in the reservoir. The larger the number of neurons is, the better the classification performance is. However, if the number of neurons is too large, overfitting will be caused.
(2) Spectral radius The spectral radius R is the absolute value of the maximum eigenvalue of the internal connection weight matrix W x of the reservoir. R < 1 is a necessary condition to ensure the stability of the network.
(3) Sparse degree The sparse degree D indicates the sparsity of neuron connections. The neurons in the reservoir are sparsely connected rather than fully connected. The larger the value is, the stronger the nonlinear approximation ability is.
(4) Input scale The input scale S refers to the scale factor before data is input into the reservoir, and represents the range of input connection weight. According to Eq. (1), it determines the working interval of activation function and the extent to which the input data affect the state of the reservoir. It's usually in the interval [0,1].
The performance of ESN relies heavily on the above hyperparameters, and the results obtained through different hyperparameter configurations vary greatly.

The advanced SSA
SSA is a new type of SI algorithm with the advantages of high optimization accuracy, good robustness and high convergence rate. SSA divides the population into leaders and followers. They form a salp chain to perform population optimization. To enable it to find the optimal solution more accurately, we improved the SSA.

Population initialization
The uniform distribution of the population can effectively maintain the population diversity and improve the optimization performance. The initial population of SSA is generated randomly. Lack of prior knowledge leads to uneven distribution and poor initial population diversity.
To enhance population diversity and search efficiency, the Tent mapping with reverse learning is introduced to initialize the population. The Tent mapping has the characteristics of randomness, ergodicity and regularity. Thus, it can generate initial salp population with rich diversity. Then, the reverse learning strategy is adopted to optimize the population and generate the reverse population. At last, the population generated by the Tent mapping and its reverse population are merged and sorted. The salps with better fitness value are selected to form the initial population. Tent mapping with reverse learning is introduced to expand the search range of the population, reduce invalid search, and improve the search efficiency. The population generated by Tent mapping is expressed as follows (Arora & An, 2019): where x d and x d+1 are the respective values of the dth and (d + 1)th dimensions of the population generated by the Tent mapping. The reverse learning strategy is used for the population generated by Tent mapping, and the obtained reverse population is as follows: where x d+1 are the values of the (d + 1)th dimension of the reverse population; u d is the upper limit of the dth dimension; and l d is the lower limit of the dth dimension.

Leader position update
Adequate global exploration is helpful for the algorithm to obtain better optimization results. Traditional SSA conducts global exploration by introducing random numbers into leader position update. However, the introduced random numbers have strong randomness and cannot fully perform global exploration. Polynomial operator can maximize the diversification of search domain and enhance the convergence speed of the SSA at the later stage. Therefore, we introduce polynomial operator into leader position update. The formula of improved leader position update is as follows: where F d is the food position (i.e., the optimal position of salp in the population); u is the random number in the interval [0,1]; η m represents the distribution index; X l d is the position of updated leader on the dth dimension; and X l d is the position of current leader on the dth dimension.

Follower position update
At the later stage of iteration, followers will gather near the current food source, which reduces the population diversity and the search ability of the SSA. To avoid the premature phenomenon at the later stage, we introduce dynamic mutation strategy into the follower position update, which increases the diversity of salp population at the later stage and improves the convergence accuracy of the SSA. At present, researchers have proposed a variety of mutation algorithms, such as Gaussian mutation and Cauchy mutation (Li et al., 2017). Compared with Gaussian operator, Cauchy operator has longer wings and can generate a large range of random numbers, so that the SSA has a greater chance to avoid local optimum. In addition, less time is needed to search the nearby area when the peak value is low. Therefore, we introduce Cauchy mutation into follower position update.
The formula of improved follower position update is as follows: where X m d and X m−1 d are the respective positions of the mth and (m − 1) th followers on the dpth dimension before the update; X m d is the positions of the mth followers on the dth dimension after the update; η is the mutation weight which decreases with the increase of the number of iteration; T is the maximum number of iteration; t is the current number of iteration; λ is a constant and its value is 10; C(0,1) is a random number generated by the Cauchy operator and its scaling parameter is 1.
If the mutation frequently occurs during the iteration process, it will not be conducive to the algorithm convergence. Therefore, the mutation trigger mechanism is introduced. If the fitness value of mutated follower position is better, the Cauchy mutation will be introduced into the follower position. Otherwise, the Cauchy mutation will not be introduced.
The steps of the advanced SSA (ASSA) are shown in Algorithm 1.
Algorithm 1: ASSA. Input: population size NP, maximum number of iteration T Output: the optimal position of salp in the population 1 Adopt the Tent mapping with reverse learning to initialize the salp population. 2 for t = 1 to T do 3 Calculate the fitness value of salps in the population 4 Sort salps in the population according to fitness value 5 Choose food. The salp position with the best fitness is regarded as the food position 6 Choose leaders and followers. After selecting the food, there are N − 1 salps remaining in the population. The salps with the first half of the fitness value are regarded as the leaders, and the others are regarded as followers 7 Update the leader position according to formula Eq. (5) 8 Update the follower position according to formula Eq. (9) 9 end for In Algorithm 1, we at first employ the Tent mapping with reverse learning to initialize the salp population. Then, we calculate the fitness values of salps in the population and sort salps in the population according to the fitness values. The salp position with the best fitness value is regarded as food position. After the food position is selected, there are N −1 salps left in the population. The salps with the first half of the fitness values are regarded as the leaders, and the others are regarded as followers. We update the position of leaders and followers respectively according to Eqs. (5) and (9). Repeat the above steps until meeting the stopping condition (e.g., the maximum number of iterations). Finally, output the salp position with the best fitness value.

Performance analysis of ASSA
To verify the performance of the ASSA algorithm, two typical functions of Sphere and Griewank, are selected for function optimization and convergence test of the algorithm. Sphere is a unimodal function and Griewank is a multimodal function. We compare ASSA with the PSO, GA and SSA. The number of algorithm iterations is set to 500. The testing results of the four algorithms on the functions are shown in Figs. 2 and 3. It can be seen from Figs. 2 and 3 that ASSA has obvious advantages in convergence speed and accuracy compared to the PSO, GA, and SSA. Therefore, the performance of the ASSA is significantly improved over the traditional optimization algorithms.

Scheme design
We use ASSA-optimized ESN for Internet traffic classification. The basic idea is as follows: The ASSA is utilized to find the salp position with the best fitness value. At the end of iteration, each dimension of the salp position is assigned to the corresponding hyperparameter of the reservoir of ESN, which establishes the network traffic classification  Fig. 4. The steps of ESN-based network traffic classification are as follows: Step 1: Pre-process the network traffic classification dataset. There are two pre-processing ways.
(1) One-hot coding. We implement one-hot coding for discrete features.
(2) Min-max normalization. The large difference between the data of the same attribute affects the training of the network. Therefore, we perform min-max normalization on continuous features.
Step 2: Divide the dataset into training set and testing set. Step 3: Train the ESN and adjust its hyperparameters.
(1) Initialize the parameters of the ASSA, such as the salp population size and the maximum number of iterations. Use the Tent mapping with reverse learning to initialize the salp population. Each dimension of the individual in the population represents a hyperparameter of the ESN, and different hyperparameters have different ranges. Therefore, each dimension of the individual is constrained.
(2) Assign each dimension of the individual to the corresponding hyperparameter of the ESN: the size of the reservoir N, spectral radius R, sparse degree D and input scale S.
(3) Calculate the fitness values of salps in the population according to the training samples and fitness function, and arrange the fitness values in ascending order to find the salp position with the optimal fitness value. If the stopping condition is met, go to Step 4. Otherwise, go to Step 3. The fitness function is the overall accuracy of network traffic classification.
Step 4: Input the testing data into the trained ESN model, and then get the classification result of each sample.

Experimental dataset
To verify the effectiveness of the proposed scheme, we conduct experiments using two public datasets called Moore dataset (Lopez-Martin et al., 2017) and NISM dataset (Demertzis & Iliadis, 2016), which are from raw traffic data. With a long interval between them, they have different data terminals and IP addresses, which enables the effective evaluation of the generality of the proposed scheme. Each dataset includes the training set and the testing set. The proportion of each category in the training set and testing set is consistent with that of the original dataset. 100,000 samples are randomly selected as the testing set, and the others are the training set.
(1) Moore dataset The Moore dataset comes from the traffic flowing through the network outlet of a biological institute from 0 to 24 h on August 20, 2003. 377,526 network samples are obtained from the 24 h traffic by sampling algorithm. They are divided into 12 application classes. Each sample contains 249 attributes, among which the last one is the category corresponding to each sample. Table 2 reports the Moore dataset statistics.
(2) NISM dataset The NISM dataset comes from the network traffic of the Information Technology Operations Center of the U.S. Military Academy in 2013. The dataset contains 713,851 network traffic samples, which are divided into 11 application classes. The NISM dataset statistics are shown in Table 3.

Evaluation index
We use the following evaluation indexes to evaluate the classification performance. The samples in the training set are divided into m application classes. TP i represents the number of its samples that are correctly classified as belonging to class i. FN i represents the number of its samples that are misjudged as other types. FP i represents the number of the samples from other application classes that are misjudged as belonging to class i. The evaluation indexes are defined as follows.
(1) The accuracy rate of class i (2) The recall rate of class i (3) The F-measure of class i (4) The overall accuracy rate Among the above indexes, the accuracy and recall rate of per class can reflect the classification performance of the proposed scheme for per class. The F-measure, as the harmonic average of the accuracy and the recall rate, gives a better comprehensive evaluation of the classification ability. In addition, the overall accuracy can reflect the proportion of correctly classified samples to all samples.

(1) Experiments of ESN hyperparameter optimization
The range of ESN hyperparameters is set as follows: The size range of the reservoir is set as [30,300] The experimental results show that, compared with above algorithms, SSA has better optimization performance. Therefore, referring to the literature (Hu, Wang & Tao, 2021), the population size is set as 21, and the maximum number of iterations 20. On the Moore dataset and the NISM dataset respectively, the optimal hyperparameter values of ESN selected by ASSA are shown in Table 4.  We utilize PSO, FOA, GA, GWO, SSA and ASSA to optimize the hyperparameters of the ESN. The fitness function, with a direct impact on the optimal solution of the model, is usually defined by the actual problem. For the network traffic classification problem, the overall accuracy is taken as the fitness function. On the Moore dataset and the NISM dataset, the changing curves of the fitness values of the six algorithms in the iteration process are shown in Figs. 5 and 6 respectively.
(2) Comparison of different machine learning algorithms To verify the effectiveness of the proposed scheme, it is compared with different ML algorithms such as SVM, SAE, CNN, GRU and Deep Belief Networks (DBN) algorithms on Moore dataset and NISM dataset, respectively. On the Moore dataset and the NISM  Table 5 The parameter values of the comparison algorithms.  Table 12.

DISCUSSION
As shown in Figs. 5 and 6, the ASSA initializes the population using the Tent mapping with reverse learning, which promotes the uniform distribution of the initial population and improves the search efficiency. Then, it introduces polynomial operator to maximize the diversification of search domain, which improves the global exploration ability. Finally, the ASSA introduces a dynamic mutation strategy to increase the population diversity at the later stage and avoid local optimum. Therefore, the ASSA has better fitness value and   Fig. 7 that the proposed scheme has obvious advantages over traditional ML methods on the two datasets in the class accuracy, class recall rate, class F-measure and overall accuracy. The reasons for the better classification performance of the proposed scheme are as follows: (1) The ESN processes information by simulating the thinking mode of human brain and has the characteristics such as self-organization, self-learning and self-adaptation. It adopts the reservoir composed of  sparsely connected neurons as the hidden layer to perform high-dimensional and non-linear representation of the input data. It only trains the weights from the reservoir to the output layer, which simplifies the training process.
(2) The Tent mapping with reverse learning, polynomial operator and dynamic mutation strategy are introduced to improve the SSA, which improves global exploration ability of the algorithm and avoids the algorithm from falling into the local optimum. The ASSA is then used to automatically optimize the hyperparameters of the ESN and can accurately find the optimal hyperparameters of the ESN. It can be seen from Table 12, due to use ASSA to optimize the hyperparameters of ESN, the training time is longer than other approaches. The trained model is tested on the testing set. The testing time of our method and other approaches is not much different. Especially when the testing time is averaged to each sample, this difference is very small. In addition, our method has obvious advantages over other approaches in terms of per-class metrics and overall accuracy. Therefore, the gap between the classification time of our method and other methods is acceptable.

CONCLUSION
We propose a classification method of Internet traffic based on ESN. Firstly, Tent mapping with reverse learning, polynomial operator and dynamic mutation strategy are introduced to improve the SSA. Then, the advanced SSA is used to optimize the hyperparameters of the ESN such as the size of the reservoir and spectral radius. Finally, the optimized ESN is adopted to classify network traffic. We evaluate the performance of the proposed scheme on Moore and NISM datasets, and perform comparison experiments with SVM, SAE, CNN, GRU and DBN algorithms in terms of per-class metrics and overall accuracy, respectively. Experimental results show that our method has advantages in multiple evaluation metrics compared with these traditional ML algorithms and effectively improves the accuracy of network traffic classification. Our method needs further experimental verification before it can be deployed in the Internet environment. In the further, how to improve the real-time performance of the network traffic classification method still needs further research.
• Wenwen Sun performed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.
• Jie Tian analyzed the data, prepared figures and/or tables, and approved the final draft.
• Xiyuan Zheng analyzed the data, prepared figures and/or tables, and approved the final draft.
• Shaopeng Guan conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.