Securing Cloud Computing from Flash Crowd Attack Using Ensemble Intrusion Detection System

,


Introduction
Cloud Computing (CC) has enhanced the computational approaches that involve the use of virtualization. The literature presented various definitions of CC. Specifically, the National Institute of Standards and Technology (NIST) [1] described CC as "a virtualized pay-as-you-go computing model to assist the prevalent, efficient, and desired network access to a shared pool of customizable computing resources that could be distributed at a high rate with the lowest management action or service provider interaction". Examples of these resources include storage, networks, servers, services, and applications. Although adversaries are aiming at numerous CC attributes, such as high demand and flexibility, the cloud environment is vulnerable to a diverse range of security threats [2]. Provided that the cloud offers on-demand usage to its provided services [3], attackers attempt to begin an organized Distributed Denial of Service (DDoS) attack on the CC servers that resemble an authorized flash crowd event, which could affect the availability of cloud services. This action enables the attacker to succeed in initiating the attack without being caught. The absence of services has a significant impact on the revenue and business of the service providers, considering the high possibility of transitioning to other service providers due to dissatisfaction regarding the Quality of Service (QoS) [4].
A flash crowd refers to a significant number of individuals who gather in one location for the same purpose in a short period. In computer networks, flash crowd denotes the increased website traffic within a relatively brief time. This situation occurs as a result of unique phenomena, which include breaking news and the delivery of a popular product. In some cases, a flash event happens when a well-known site is connected to a smaller site, which leads to a substantial rise in traffic identified as a flash-dot effect [5].
Flash events and flash-dot impact the server operation of websites and network infrastructure, considering that overcrowding at the network layer could prevent several user requests from reaching the server. The requests may arrive at the server after a significant delay due to requests for resending and packet loss. Certain web server configurations and descriptions are not capable of managing the number of flash event demands [6]. As a result, the users who attempt to access the website in a flash event would be dissatisfied due to the long wait or inability of the event to achieve the target. The severity of the phenomenon increases upon an attacker's attempt to avoid the defense mechanism by imitating the traffic pattern of authorized users in a flash event [7]. As illustrated in Fig. 1, these categories of attacks occurring during flash events are described as flash crowd attacks. Detecting flash crowd attacks during flash events in CC is a primary challenge for web servers, as they need to differentiate between legitimate user requests for the event and malicious demands. This differentiation is difficult to achieve using existing methods, which can result in delayed feedback to authorized users or even the entire web server crashing. To overcome this challenge, this article proposes a new approach to detect flash crowd attacks with greater accuracy. The contributions of this approach are: • Enhancing the current body of literature concerning the detection of flash crowd in CC.
• Proposing an adapted version of the White Shark Optimizer (WSO) for selecting the most significant feature subset. The other sections of this article begin with Section 2, which analyses the existing related works and highlights the research gaps. Section 3 discusses the approach proposed in this research in detail. Section 4 presents the findings and their discussion. This article ends with a conclusion in Section 5.

Related Works
CC assists the auto-scaling feature in scaling the resources on-demand in a dynamic manner. However, this attribute causes critical financial losses to customers when attacks take place on their purchased instances. Among the most critical and most employed attacks on the cloud are the TCP SYN EDOS attacks [8]. Notable initiatives have been performed in the previous decade as a defense against these attacks. Several simulation platforms were suggested for the measurement and analysis of the effect of EDOS attacks on CCE. To identify the traffic anomaly and prevent the emerging categories of DDoS attacks, various techniques are employed, including the artificial intelligencebased approach [9][10][11], statistical anomaly identification [12][13][14], machine learning-based approach [15,16], data mining approach [17][18][19][20][21], classifiers-based [22,23], hybrid anomaly detection [24,25], and signature-based detection [26,27]. Based on the comparative summary of all the methods presented in Table 1, machine learning anomaly identification is applied in this article.
Anomaly detection using statistical methods is a useful technique for identifying unusual traffic patterns, especially in terms of resource and computation efficiency. This method involves comparing incoming network traffic statistics to normal network traffic patterns to identify any anomalies. Once an anomaly is detected, statistical inference tests are utilized to assess the reliability of the patterns [28]. However, this method is not considered "adaptive," as shown in Table 1. There have been efforts by researchers to develop methods that combine the efficiency of statistical anomaly detection with the adaptive nature of Software-Defined Networking (SDN). In the context of defending against DDoS attacks, a popular approach is the use of TCP SYN cookies [29], which can effectively block TCP SYN flooding attacks on a server. This technique can be implemented on a cloud instance to prevent such attacks while also reducing the financial cost of the instance due to resource usage. When a TCP SYN attack with a payload is used, the inbound traffic accepted by the instance may require a large amount of bandwidth. After accepting a large number of TCP SYN requests, the instance processes these packets to identify the attack, which may also consume instance resources and result in charges for resource usage for the client.
Gaurav et al. [30] proposed a method called EDOS-Shield to mitigate E-DoS attacks, which uses a virtual firewall that maintains lists of client IP addresses as either "whitelisted" or "blacklisted" based on their classification using a Graphic Turing Test (GTT). Clients who pass the GTT are included in the whitelist, while those who fail are added to the blacklist. However, this approach has the disadvantage of creating overhead, causing delays for legitimate users trying to access the CC. Shawahna et al. [31] proposed a reactive method called EDOS-Attack Defense Shell (ADS), which is designed to block NAT-based attackers by using the port number and IP address of the attacker device and blocking requests from that port number. The authors used a trust factor calculation based on the GTT to determine whether a request was an attack or not. EDOS-ADS can identify clients using their port number and IP address, and it also effectively handles IP spoofing to allow legitimate users to access services. However, there are several issues with this approach. One issue is that when the attacker starts a new request, the NAT router assigns a different port number to the attacker, allowing them to continue the attack from a different port. Another issue is that the GTT involves a different channel assignment for every request, causing the server to generate numerous puzzles for a high number of requests, which could accumulate the attack if the puzzle is not solved in time. The GTT feedback duration is 13.06 s, allowing illegal users to generate massive volume of requests from one source and use massive number of channels for the GTT. Additionally, URL redirection adds an overhead of 0.63 s.
Bawa et al. [32] introduced an IDS, called EDOSEMM, to detect EDOS attacks in a CC. The model consists of three main modules, one of which is the data preparation module, which processes and organizes the flows of incoming packets, which can create overhead. This module deals with both UDP and HTTP attack traffic as well as legitimate traffic. The model uses Hellinger distance and entropy approaches to accurately detect anomalies. A mitigation approach for SYN flooding was suggested by Mendonça et al. [33], which is based on SDN and is executed on a controller. This approach involves utilizing a threshold value to identify and prevent TCP SYN attackers. Once the controller detects that the number of SYN requests from a specific host has exceeded the threshold value, it automatically blacklists and blocks the host. However, this method is solely dependent on the threshold value, which may lead to the blocking of legitimate users due to network disturbances or other related factors.
In [34], a proposal was made to enhance security measures for the Industrial Internet of Things (IIoT) because of its decentralized architecture. The authors suggested a prediction model that utilizes Deep Learning (DL) and is based on sparse evolutionary training (SET) to forecast various types of cybersecurity attacks, including intrusion detection, data type probing, and DoS. The SETbased model that was proposed achieved high performance within a short timeframe (i.e., 2.29 ms). Furthermore, in a real scenario of IIoT security, the performance in terms of detection rate was enhanced by an average of 6.25% in comparison with state-of-the-art models. In addition, [35] explores the application of SDN in enhancing intelligent machine learning methodologies for IDS. The authors propose a new IDS called HFS-LGBM IDS for SDN that utilizes a hybrid Feature Selection (FS) algorithm to obtain the optimal subset of network traffic features and a LightGBM algorithm to detect attacks, aiming to address security concerns associated with SDN. Based on the experimental outcomes from the NSL-KDD benchmark dataset, the proposed system surpasses current methods in terms of accuracy, precision, recall, and F-measure. The authors emphasize the importance of having accurate, high-performing, and real-time systems to tackle the risks linked to SDN.
The paper [36] puts forward a DL-based IDS that can detect diverse kinds of attacks on IoT devices. According to the proposed IDS, it has demonstrated excellent effectiveness in identifying various types of attacks, with a detection accuracy of 93.74% for both simulated and real intrusions. The overall detection rate of this IDS is 93.21%, which is deemed satisfactory in terms of enhancing the security of IoT networks. On the other hand, [37] introduces a new FS technique that improves the performance of Deep Neural Network-based IDS. This approach prunes features based on their importance, which is derived from a fusion of statistical importance. The performance of this approach has been evaluated on various datasets and through statistical tests, providing evidence of its effectiveness. The proposed approach provided important contributions to the field of securing IoT and a novel technique to enhance performance and improve security against vulnerabilities and threats.

Preprocessing
To develop a precise IDS model, several actions should be conducted before the data is included to train the model. Notably, pre-processing is crucial for developing an effective IDS method and reducing the computationally intensive processes. In this study, the following actions were conducted for data preparation:

Data Normalization
To compare the attributes, which had different ranges, the data was standardized using Z-score normalization (as shown in Eq. (1)) to transform it onto a different scale. This resulted in the standardized data having a standard deviation of 1 and a mean value of 0 [38].
where M denotes the mean, and σ is the standard deviation of given values.
Normalization is a process applied to dataset samples in IDS to standardize them, making them more consistent and easier to analyze. This includes techniques such as scaling, centering, transforming, and removing outliers and missing data. This article presents both binary and multiclass categorizations. In the binary experiment, normal strings were given a binary value of 0, while all malicious packet was given a value of 1. Each attack in the multiple class categorization was assigned a distinct digit value.

Data Reshaping
CNN requires input in the form of an image with 3-D: width, channel, and height. However, network traffic is in the form of 1-D dimension, which is not compliance with the architecture of CNN. Therefore, a transformation is necessary to convert the shape of the input packet to the resolution dimensions required by a CNN. For the subset of 48 attributes, the 48-D vector was converted into 8 × 6 images, while the 9-dimensional vector input was converted into 3 × 3 images. Since this article only uses grayscale images with a single channel, the channel number was set to 1.

White Shark Optimizer
The WSO is an algorithm that uses mathematical models based on the characteristics of great white sharks to solve optimization problems within a fixed search space [39]. It is a meta-heuristic algorithm that aims to balance the exploration process and exploitation process of the search space, using the search agents to find the best results. The process and pseudocode of WSO are illustrated in Figs. 3 and 4 respectively. The key concepts and foundations of WSO are inspired by the hunting behaviors of great white sharks, such as their highly developed senses of smell and hearing, which they use to locate and pursue their prey. Three characteristics of white sharks were adapted to locate their prey (e.g., the optimal food source). These characteristics are: (1) the movement towards prey based on the wave hesitation that occurs after the prey moves, which involves the white shark using its senses of smell and hearing to make an undulating movement towards the prey; (2) scavenging for prey in deep ocean areas, where the white shark navigates to the prey's location and gets close to the optimal prey; and (3) detecting the prey once it is within close-proximity, using fish school behavior to move towards the best white shark in the vicinity of the optimal prey. If the prey is not found, the location of each white shark would determine the optimal solution.

Principal Component Analysis
PCA is a technique used to decrease the number of dimensions in a dataset by identifying Principal Components (PCs), which are the directions that explain the highest variance in the data. It is a linear, unsupervised transformation technique that creates new features in a new subspace using orthogonal axes [40]. Fig. 5 demonstrates that the 1 st PC has the largest variance, followed by lower variances for the subsequent PCs. The purpose of PCA is to retain as much information from the initial data as feasible while reducing its dimensionality. PCA is used to minimize the dimensionality of a given benchmark dataset by transforming the initial d-dimensional dataset X into a new k-dimensional space Y (where k ≤ d) using a transformation matrix W [41]. The method used to obtain this transformation matrix is the linear Eigen-decomposition technique, which involves calculating the Eigenvalues and Eigenvectors (PCs) of the covariance matrix (X.X T). The Eigenvectors represent the directions of the data, and the Eigenvalues represent the magnitude of the data. To obtain the columns in the matrix W, each Eigenvector is assigned to a column, with the Eigenvalues being used to determine their order [42]. The Eigen-decomposition method is defined by breaking down the covariance matrix into three other matrices: In the definition of Eigen-decomposition, B is a square matrix (d × d) consisting of the Eigenvectors, and D is a diagonal matrix (d × d) with all elements except for those on the core diagonal set to zero. These elements represent the specific Eigenvalues, and BT is the transpose of matrix B.

Ensemble Classifier 3.4.1 CNN
CNN is a neural network architecture used for computer vision tasks, which utilizes a technique called convolution to efficiently process visual data. As shown in Fig. 6, the CNN architecture has three core layers: the first is pooling layer, the second is convolution, and the last is fully connected layers [43]. In the convolution layer, a filter is applied to the input data by multiplying it with a set of weights, creating a new two-dimensional array called a feature map. The filter is moved over the input using a step size called the "stride", which determines the size of the output feature map. This process is repeated, creating multiple feature maps, which are then processed by the next layers of the CNN, as illustrated by Eq. (3). The convolution equation involves various elements, such as θ , representing the non-linear activation function, x, representing the input data, b, which is the bias term, s, which denotes the feature map, and w, indicating the weight of the kernel function. In CNN, the Relu function is commonly used to set all negative values in the feature map to zero, thereby increasing the level of non-linearity in the convolutional layers. A CNN typically includes multiple convolutional layers, with the initial layer designed to detect basic features like edges or corners, while the later layers capture more advanced features. However, multiple convolutional layers may cause the output dimension to become smaller than the input, resulting in loss of information after a certain number of iterations. To address this issue, the padding technique can be employed by adding a border around the image. Two types of padding exist: "same" and "valid." The "same" padding method involves adding a border around the image to ensure that the input and output images are the same size. The padding size should satisfy the following equation to be valid: where f represents the filter size while denotes the padding size.
The valid convolution technique involves the utilization of the original image without incorporating any zero-pixel padding surrounding the input matrix. In CNN, the pooling layer is responsible for reducing the spatial size of convolved features. This can be achieved through two methods: max pooling or average pooling. Max pooling involves selecting the maximum value from a portion of the input image that corresponds to the kernel filter, whereas average pooling takes the average value instead. Max pooling is typically preferred because it can effectively reduce dimensionality and remove noise from the image. Following this step, the output is subjected to a fully connected layer for classification. Several architectures have been developed to enhance the performance of CNN, such as AlexNet [44], LeNet [45], GoogLeNet [46], VGGNet [47], ZFNet [48], and ResNet [49].

LightGBM
LightGBM is a machine learning model developed by Microsoft in 2017, based on Gradient Boosting Decision Trees (GBDT). GBDT involves combining weak learners to create strong learners, using only regression trees for Decision Trees (DT). Each DT makes predictions and retains residuals from all previous trees. The training process for LightGBM is depicted in Fig. 7, where the residuals of the target value become the target for the next learning and each tree is trained to predict the residuals. The final predicted output is a combination of multiple DTs' outputs. Although GBDT has shown success in many machine learning tasks, it can experience decreased precision and efficiency with increasing data volume. To tackle this problem, Microsoft introduced the LightGBM algorithm, which maintains prediction precision, significantly improves prediction speed, and reduces memory usage [50].

Figure 7: Generation strategy of LightGBM
The traditional Gradient Boosting Decision Trees (GBDT) algorithms can be sluggish and require a lot of memory since they sort feature values and enumerate all possible feature points to find the optimal segmentation point. However, the LightGBM algorithm solves this problem by using a histogram algorithm. This algorithm divides constant eigenvalues into k intervals and chooses the division points among these k values. As a result, the LightGBM algorithm trains faster and is more space efficient than GBDT. Furthermore, the decision trees generated by the histogram algorithm have regularization effects, which can prevent overfitting.

Ensembling using CNN and LightGBM
A model that combines the LightGBM algorithm and CNN is proposed. The ensemble process is shown in Fig. 3. The features from the dataset are extracted and refined by passing through the convolutional layer of CNN. Then, the output of the flattening layer is fed to the LightGBM model for classification and additional analysis. By combining these two methods, the proposed model achieves better prediction performance.
The ensemble classifier follows this procedure for categorization: • The dataset resulting from preprocessing, feature selection, and dimensionality reduction is split into training and testing sets. • The training data is fed into a developed CNN model for pre-training and to obtain the convolutional layer parameters and fully connected layer. • In CNN, the hyperparameters of the convolutional layers are then frozen and the data resulting from the flattening layer is used as an input for LightGBM for extra training. • The test dataset is then classified, a confusion matrix is generated, and performance metrics are computed.

Dataset
CICIDS 2017 [51] is utilized to determine the performance of the proposed IDS, which comprises favorable and the most updated regular attacks that have similarities to the true real-world data (PCAPs). It also presents the outcomes of the network traffic analysis with the use of CICFlowMeter and labeled flows in line with the source, protocols and attack, timestamp, source and destination ports, and destination IPs.

Experimental Environment
The experiment was carried out using Python and the Keras library with Tensorflow. An experimental setup used to assess the model parameters is shown in Table 2.

Evaluation Metrics
As depicted in Table 3, the confusion matrix is used to calculate the evaluation metrics for the proposed model to ensure its effectiveness. Five common evaluation metrics, including accuracy (AC), false-positive rates (FP), and false-negative rates (FN), are utilized to determine the effectiveness of the model. These performance measurements are calculated using the confusion matrix of a 2-class classifier. The equations below are used to demonstrate the performance of the introduced IDS: where the number of true positives is represented as TP, false negatives as FN, true negatives as TN, and false positives as FP [38,52,53].

Analysis and Findings
In this section, a thorough analysis of the findings acquired through the models suggested in this study is presented. An evaluation was conducted regarding the effectiveness of the suggested models on a CICIDS 2017 benchmark dataset through a sequence of various experiments. In this section, the possibility for the suggested models was illustrated to identify the Flash Crowd attacks. Furthermore, several experiments had been performed, which involved the testing of the suggested learning models with a new unlabeled category of attack. Meanwhile, the remaining attacks are employed during the training. In every experiment, a comparison is made between the effectiveness of the proposed model against state-of-the-art models. The IDSs used in this evaluation comprised: DT-IDS [6], bGWbPS-IDs [38], RF-IDS [54], RT-AMD IDS [55], and LSTM-IDS [56]. The default setting from Skit-learn and TensorFlow libraries were employed to implement these IDDs. The performance comparison was also conducted between the proposed IDS and these state-of-the-art models in terms of accuracy, precision, and recall F1 measure using the CICIDS 2017 benchmark. The results of this comparison are presented in Table 4. According to the results presented in Table 4, the performance of the proposed IDS was superior to that of the majority of state-of-the-art IDSs, and the hybrid models demonstrated even better performance. The use of a reduced set of features resulted in improved detection model performance.
To ensure fairness, the comparison was conducted on both the training and testing datasets, using the CICIDS 2017 benchmark dataset. Furthermore, the proposed IDS had lower time consumption than other IDSs, which was attributed to factors such as the number of features used for training and testing, the number of hidden layers, and the number of neurons per layer. However, using a CNN could reduce time consumption by addressing the issue of parameter explosion through shared parameters across layers. While training a DL model can be challenging and costly, using a Graphics Processing Unit (GPU) accelerator can significantly improve computational speed, which has increased by more than 10 times in recent years and is expected to continue improving with advancements in GPU architectures and specialized training chips. The convolutional layer of the CNN is also used for feature extraction and filtering, and the LightGBM model is applied to the flattening layer output for classification and information. These steps improve the model's prediction accuracy.
In comparison to other state-of-the-art IDSs, the CNN-LightGBM required a shorter training duration and testing duration. However, the long training duration does not have a significant effect on the model function. Determining the ideal set of hyper-parameters may be among the processes of developing a machine-learning model that requires the longest duration. This condition is in line with DL. When the correct values are discovered to function properly in the training data and exhibit high quality in the test data, the manual scheduling of the hyperparameters due to real-time identification would not be required. Meanwhile, the short detection period denotes the strength of the hybrid approaches for anomaly identification, particularly in virtual environments including CC. These environments are vulnerable to setbacks in their specific structure. Therefore, rapid and efficient identification approaches should be applied to manage the attacks at a fast rate before the brain of the environment is targeted and critical consequences take place in the entire network.

Discussion and Limitations
The purpose of this article is to propose an IDS that uses the WSO and ensemble classifier to address security issues in a CC environment, particularly Flash crowd events. One of the challenges faced by deep and machine learning models during the training phase is overfitting, which occurs when the model learns from noisy and biased samples that do not accurately represent the patterns of interest. Regularization techniques can help to mitigate this issue and improve prediction, but they do so by focusing on individual weight values rather than the relationships between matrix entries, which can make small changes in an attribute more significant in the forecast. To overcome these limitations, the authors used feature engineering which involves selecting and extracting relevant attributes to improve DL performance for attack identification. The experimental results showed that the introduced IDS outperformed previous approaches and achieved superior precision for both binary and multiclass detection. The IDS consists of four phases: preprocessing, feature engineering, a hybrid ML and DL classifier, and a detection stage.
CNN was applied for the reduction in the number of training parameters, which led to the development of a new model that can identify network disruptions intrusions without high computational cost. Moreover, CNN is capable of reducing the dimension of the input attributes with the use of the pooling layer. Several DL and ML algorithms were employed for categorization situations to gain accurate DL outcomes. The suggested IDS was developed to create binary and multiclass categorizations to differentiate between the types of attacks. However, the following limitations were present in the suggested model: • Despite the importance of the precision and assessment metrics for the evaluation of the model performance, these metrics are not adequate without the actual application of the created model in the CC environment. • The assessment of the network performance by considering the resource use, account throughput, and time delay specifications is highly crucial to perform an intensive test on the capability of the suggested intrusion identification. • Henceforward, the suggested IDS would be applied in a real CC environment, followed by a test on the potential for an intrusion identification to manage the identified attacks in an actual event.
In sum, using WSO in combination with an ensemble classifier in attack detection offers several advantages, including: • Optimal Model Selection: WSO can be utilized to identify the most effective individual classifiers for inclusion in the ensemble, based on their performance on the training data, resulting in a stronger overall ensemble. • Rapid Convergence: WSO can facilitate a quicker convergence of the ensemble classifier to the optimal solution compared to other optimization algorithms, thereby reducing computational time for attack detection. • Improved Robustness: The combination of WSO and an ensemble classifier provides enhanced robustness to inconsistent or noisy data, as the ensemble can counterbalance the impact of outliers while WSO can help the model rapidly converge to the optimal solution. • Better Handling of Imbalanced Data: In situations where one class is underrepresented, WSO can assist in selecting the best individual classifiers to handle such imbalanced data, further boosting the performance of the ensemble classifier in attack detection. • Elevated Model Performance: By optimizing the parameters of the individual classifiers and ensemble using WSO, the overall performance of the model can be enhanced, resulting in more accurate and dependable attack detection.

Conclusion and Future Work
The volume of services and information available on the Internet is extensive, which contributes to high exchange traffic. This excessive scalability disrupts the network. Flash Crowd attacks, specifically the DDoS-based attacks that involve authorized HTTP requests to overwhelm the victim resources, are regarded as the primary disturbing attacks on the providers and users of online services. The victim is surrounded by service requests created by the attacker through DDoS tools. Furthermore, the challenge in detecting the Flash Crowd attacks would increase with the presence of the Flash Event. Following the similarity of the two anomalies (the legitimate Flash Crowd and Flash Crowd attack), the attack can travel under the identification system. However, the suggested detection method has proven its effectiveness in identifying Flash Crowd attacks in CC and showing higher performance than other state-of-the-art methods. This method has also created opportunities for future studies in the field of application layer attack detection. Some of the future directions of work would include the hybridization of LightGBM with a deep learning model, the use of ensemble feature selection, and the utilization of automatic data augmentation and transfer learning to enhance detection performance.