A new multi-label dataset for Web attacks CAPEC classiﬁcation using machine learning techniques

Context: There are many datasets for training and evaluating models to detect web attacks, labeling each request as normal or attack. Web attack protection tools must provide additional information on the type of attack detected, in a clear and simple way. Objectives: This paper presents a new multi-label dataset for classifying web attacks based on CAPEC classiﬁcation, a new way of features extraction based on ASCII values, and the evaluation of several combinations of models and algorithms. Methods: Using a new way to extract features by computing the average of the sum of the ASCII values of each of the characters in each ﬁeld that compose a web request, several combinations of algorithms (LightGBM and CatBoost) and multi-label classiﬁcation models are evaluated, to provide a complete CAPEC classiﬁcation of the web attacks that a system is suffering. The training and test data used for training and evaluating the models come from the new SR-BH 2020 multi-label dataset. Results: Calculating the average of the sum of the ASCII values of the different characters that make up a web request shows its usefulness for numeric encoding and feature extraction. The new SR-BH 2020 multi-label dataset allows the training and evaluation of multi-label classiﬁcation models, also allowing the CAPEC classiﬁcation of the various attacks that a web system is undergoing. The combination of the two-phase model with the MultiOutputClassiﬁer module of the scikit-learn library, together with the CatBoost algorithm shows its superiority in classifying attacks in the different criticality scenarios. Conclusion: Experimental results indicate that the combination of machine learning algorithms and multi-phase models leads to improved prediction of web attacks. Also, the use of a multi-label dataset is suitable for training learning models that provide information about the type of attack.


Introduction & motivation
Every year there are significant increases in the number of attacks against web servers and applications; e-commerce platforms, financial and government institutions, large corporations, etc. are targeted by web attacks for economic or ideological reasons.According to Cisco ( Cisco, 2018 ), 14.5 million DDoS attacks are ex-pected in 2022.Also, SQL Injection (SQLi) and Cross-site Scripting (XSS) attacks are easy and powerful methods for attacking a web site ( Johari and Sharma, 2012 ).The impact of cyber-attacks suffered by companies threatened their viability in 17% of cases, reported specialist insurer Hiscox ( Hiscox, 2021 ), with their website becoming the first point of entry in 29% of cases.
Several technologies and systems exist to prevent and detect attacks on web servers and applications: misuse detection systems with large rules and vulnerability signature databases that must be continuously updated, anomaly detection systems that interpret deviations based on expected patterns of user and application behavior, taking this behavior as evidence of malicious activity.Recently, there has been a significant increase in scienhttps://doi.org/10.1016/j.cose.2022.1027880167-4048/© 2022 The Author(s).Published by Elsevier Ltd.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) tific interest in anomaly detection techniques applied to web intruder detection ( Sureda Riera et al., 2020 ).To successfully train and evaluate models based on anomaly detection techniques, specific web traffic datasets are needed; a major drawback in the study of web attack prevention and detection is the lack of public datasets to audit and validate the studies performed in this field ( Sureda Riera et al., 2020 ); the DARPA dataset and those belonging to the KDD family have been widely criticized in different studies ( Brugger, 2007;Mahoney and Chan, 20 03;McHugh, 20 0 0;Tavallaee et al., 2009 ).The CSIC-2010 dataset has become one of the most popular in recent years for testing protection systems against web attacks.This dataset is generated by synthetic traffic and contains 36,0 0 0 requests labeled as normal and more than 25,0 0 0 labeled as anomalous ( Torrano-Gimenez et al., 2009 ).
Most datasets are composed of artificially generated traffic and, to our best knowledge, all datasets available for training and/or evaluation of machine learning models provide only labeling of the request in terms of normality or attack, without specifying in any case what type(s) of attack(s) is/are being suffered.The authors strongly believe in the need for datasets that collect this type of additional information and that allow the training of machine learning models that classify attacks based on internationally accepted classification criteria, such as CAPEC.In this way, by providing the incident response team with the CAPEC classification of the attack, response times and effectiveness will be improved, as the specific attack pattern and possible mitigations can be queried.
For this reason, one of the main achievements of this work is the generation of a new dataset (the SR-BH 2020 dataset) that collects different types of attacks, coming from real traffic data (generated by collecting real traffic in a honeypot exposed to the Internet for 12 days), with multi-labels, that report the normality of the request, or the CAPEC classification of the type or types of attack that the web request represents.This dataset, to our knowledge, is the first one that allows the training and evaluation of multi-label machine learning models and algorithms, which can provide the CAPEC classification of the attack(s) that a web application is suffering.
One of the fundamental stages in any data science project is the preprocessing of the data that will be used to train the machine learning models; this stage includes the selection of the relevant characteristics of the dataset that allow a high level of model efficiency to be achieved when making the prediction.In the case of a dataset with web traffic data, it is necessary to numerically encode the various fields of each web request, so that it is possible to apply different statistical techniques to the resulting numerical values.In this study, we present a new form of numerical encoding consisting of calculating the average of the sum of the ASCII values of each of the characters that make up each field of a web request.
Several supervised machine learning algorithms and techniques applied to intrusion detection have been studied over the years, most notably algorithms using ensembles of decision trees in combination with gradient boosting ( Tama et al., 2020;Vu et al., 2019 ); two popular algorithms that provide a gradient boosting framework are LigthGBM ( Ke et al., 2017 ) and CatBoost ( Dorogush et al., 2018 ).In this paper, we will evaluate (using different metrics, depending on the criticality levels of several scenarios) the performance of these algorithms in predicting web attacks, such that they report the normality of the request or the CAPEC classification(s) of the attack.As this is a multi-label classification task since a single request may contain more than one type of attack, the SR-BH 2020 dataset will be used and different multi-label classification models will be combined with the LightGBM and CatBoost algorithms.
This paper makes the following contributions: • Construction of the SR-BH 20201 dataset, a new multi-label dataset for Web attack detection and prediction based on CAPEC attack patterns, suitable for training multi-label classification models.• A new way of web request data encoding using the mean of the sum of the characters ASCII values.• Design of one-phase, two-phase, and customized classification multi-label machine learning models.• Study of the behavior of novel algorithms applying the designed classification multi-label models.• A ranking is obtained of the different combinations of algorithms/models to be applied in the protection of different scenarios, according to their criticality levels, using different evaluation metrics.
The rest of this work is structured as follows: Section 2 describes background and related work.Section 3 shows in an overview the process followed in this work.In Section 4 , the materials and methods used are provided including the dataset used for the experiments and method evaluation, the system followed for feature extraction from the dataset, and a description of the different models applied to algorithms.Section 5 , shows the model results and considerations.The conclusions of the study are provided in Section 6 .

Background
This section provides an overview of the WAF and RASP web protection tools and the similarities and differences between them, a presentation of the characteristics of the algorithms analyzed in this study, namely LightGBM and CatBoost, as well as a description of the metrics and scenarios of criticality to be used to evaluate the performance of the various models and algorithms.

Web application protection
One of the most widely used methods for Web application protection is the implementation of Web Application Firewall (WAF) tools.A WAF is deployed between the application and the requesting user, inspecting the incoming traffic at the application layer of the OSI model and looking for attack patterns, eventually blocking incoming malicious traffic.WAF devices work by checking that incoming traffic against a database of signatures or rules so that the update of that database is critical.They are independent of the programming language of the web application as they act before the malicious traffic gets to execute the code.Different ways of circumventing the protection provided by a WAF have been proposed ( Ristic, 2022 ), OWASP : Normalization Method, Using HTTP Parameter Pollution (HPP), Using HTTP Parameter Fragmentation (HPF), Using logical requests AND / OR, replacing with their synonyms the SQL functions that get WAF signatures, using comments, case changing, triggering a Buffer Overflow / WAF Crashing.Most of these methods focus on protocol layer exploits that attempt to take advantage of small differences between how WAF, web servers, and backend applications see traffic.
Runtime Application Self-Protection (RASP) tools are defined by Gartner Gartner (2022) as "a security technology that is built or linked into an application or application runtime environment, and is capable of controlling application execution and detecting and preventing real-time attacks".These tools combine real-time contextual awareness of the factors that have led to the application of current behavior ( Dubey, 2016;Steiner et al., 2017 ).They work via "embedding" the security in the application server to be protected, intercepting all calls to the system to check they are secure so that they depend entirely on the programming language of the application to protect.
Both technologies work in the application layer filtering the HTTP protocol.WAF is a black box technology (it does not need access to the logic of the application; it intercepts the calls and responses to the logic of the application to be protected, performing a syntactic analysis to detect attacks).On the other hand, RASP is a white-box technology that works by installing an agent on the application server, examining how variables in the application's process and code stack get their values to, based on that information, predict whether an attack is taking place.
Several attempts have been made to improve the effectiveness of WAF solutions.The model proposed by Moosa (2010) is based on an Artificial Neural Network (ANN) that defends against SQL injection attacks.While in the training phase the ANN is exposed to a set of normal and harmful data, in the working phase, the trained ANN is embedded into the WAF, thus protecting the entire web server.
In an attempt to improve the effectiveness of WAF rule sets, Auxilia and Tamilselvan (2010) propose a negative security model, which monitors applications for security anomalies, uncommon behaviors, and common web application attacks.
Taint analysis is one of the techniques used in RASP solutions.Haldar et al. Haldar et al. (2005) , propose the analysis of externals to the web application data, marking them as untrusted (tainted data).They identify, monitor, and prevent the inappropriate use of this type of data at runtime in the web application to protect, developing a heuristic that instruments the java.lang.String class, to propagate taintedness of strings, as well as to mark certain strings as untainted.Halfond et al. (2008) propose the use of positive tainting , identifying and tracing trusted data at execution time, while conventional tainting is centered on untrusted data.In this way, False Negatives (FN) are completely eliminated, at the cost of producing an increase in False Positives (FP).This approach also makes use of syntax-aware evaluation, so that data usage regulation is applied at development time based on its syntax in the query string, requiring only the deployment of the web application using the MetaStrings library.

Algorithms
LightGBM LightGradient Boosting Machine, abbreviated as Light-GBM, is an open-source library that provides a fast, decentralized, highly performant gradient boosting environment on the basis of a decision tree algorithm.Guolin Ke, et al.Ke et al. (2017) , introduced two key concepts: • Gradient-based One-Side Sampling (GOSS): A modified version of the gradient boosting method that uses only large gradient data instances.GOSS can get fairly accurate information gain estimates with a much smaller data size, which speeds up training and reduces the computational complexity of the method.• Exclusive Feature Bundling (EFB): This approach groups dispersed (mostly null) features that are mutually exclusive from each other, thus becoming a feature selection method.
In standard decision tree algorithms, such as c4.5 ( Ross Quinlan et al., 1994 ) and CART ( Breiman et al., 1984 ), nodes are expanded in order of depth (level-wise tree growth) through of the "divide and conquer" strategy, using a prefixed order (usually from left to right).
LightGBM, on the other hand, is part of a so-called "best-first decision trees" (leaf tree growth) where nodes are expanded in the best order rather than in a fixed order.In this case, the best split node at each step is added to the tree.The best node is a node that is not classified as a terminal, that is, a node that minimizes the Gini impurity index among all nodes that can be split ( Shi, 2007 ).
LightGBM uses a histogram-based algorithm.That is, group successive feature values into individual bins, build feature histograms during training, and speeds up this process and reduces memory usage.Following the leaf-wise based approach produces a much more complex tree than the hierarchical-based approach, which is a key factor in achieving higher accuracy.But sometimes this can lead to overfitting.LightGBM aims to reduce the complexity of histogram construction by using GOSS and EFB to reduce sampling data and features.
CatBoost Yandex developed CatBoost, an open-source software library that provides an innovative categorical feature processing algorithm and a gradient boosting framework that implements ordering boosting , a permutation-based alternative to traditional algorithms.CatBoost uses one-time encoding to implement a symmetric tree that handles categorical features and helps reduce prediction times.LightGBM uses GOSS to reduce complexity, while the CatBoost algorithm introduces Minimal Variance Sampling (MVS) , a weighted version of stochastic gradient boosting sampling used to normalize boosting models.Using MVS reduces the number of examples required for each boosting iteration and greatly improves the quality of the model, making the model more general and less likely to overfit ( Dorogush et al., 2018;Prokhorenkova et al., 2018 ).
Another important concept of CatBoost is the use of Oblivious Decision Trees (ODT's) in the process of building Decision Trees so that a set of ODT's is built.If an ODT has n levels, the set will have 2 n levels, since an ODT is a complete binary tree.The same splitting criterion will be applied to all nodes that are not leaves of the ODT ( Hancock and Khoshgoftaar, 2020;Prokhorenkova et al., 2018 ).According to Prokhorenkova et al.ODTs are balanced, allow speeding up execution, and are less prone to overfitting ( Prokhorenkova et al., 2018 ).

Evaluation metrics and scenarios
Most supervised classification algorithms focus on binary or multi-class classification; in this case, classical classification metrics are adequate, e.g.accuracy, F1 Score, precision, recall, etc. ( Sureda Riera et al., 2020 ).However, when working with datasets in which there are several labels for each observation, it is necessary to complement the classical metrics since the notion of a partially correct prediction of the various labels that make up each observation is introduced ( Cheng and Hüllermeier, 2009;Gouk et al., 2016;Read et al., 2011;Zhang and Zhou, 2014 ).
Also, according to the work of Antunes and Vieira (2015) , it is necessary for the metrics to make sense of the model being evaluated and the scenario to which the model is applied; for this reason, their recommendations are followed and four different types of scenarios are defined in which each algorithm/model combination is evaluated, applying the recommended metric in each scenario.
In our case, we have chosen to perform several evaluations: first, we have calculated the Accuracy following the exact-match principle (exact prediction of all the labels); F measure, Recall, Precision, ROC AUC, Informedness, and Markedness of each of the model/algorithm combinations have been calculated on an overall basis.On the other hand, Accuracy and F measure have been evaluated for each of the labels individually, as it may be useful to determine how well the model predictions fit for each type of attack; additional metrics such as Hamming Loss, Hamming Score, and Jaccard Similarity have been introduced to evaluate the models taking into account possible partially correct model predictions.

Metrics
• Accuracy (for each label): This is the measure of accuracy that we will use to assess the model's efficiency in predicting each of the labels that make up an observation; in this case, we will apply the classical definition of accuracy: • Accuracy -Exact Match (EMR): This is the measure of accuracy that we will use to evaluate the global model performance; in our case, the exact measure of the prediction for all the labels that compose an observation: We just ignore partially correct predictions (considering them incorrect) and extend the concept of Accuracy used for single label prediction to the multilabel case.

EMR = Number of records with exact label match T otal number of records
• Precision: In our case, to account for label imbalance, we computed the proportion between the correct predicted labels and the total labels averaged over all cases and weighted them by support (the total number of cases for each label).

P recision = T P T P + F P P recision weighted = classes weight of class x precision of class
• Recall: This is the true positive percentage correctly identified, averaged across all instances and support-weighted.

Recall = T P T P + F N Recall weighted = classes weight of class x recall of class
• F measure (for each label): It is a measure of the accuracy of a test, taking into account both precision and recall.Also known as F1-Score or F-Score, this value is a balanced harmonic mean of two metrics: Precision (P) and Recall (R) Bermejo Higuera (2013) ; Díaz and Bermejo (2013) ; Van Rijsbergen (1979) .

F measure = 2 x precision x recall precision + recall
• F measure (averaged): F measure score averaged across all instances and support-weighted.
F measure weighted = classes weight of class x F measure of class • Informedness: Is a measure of how much information the system provides about positive and negative labels,i.e.how informed a predictor is for the desired condition, as opposed to chance, averaged across all instances and support-weighted.

of class x In f ormedness of class
• Markedness: A measure of the confidence in the system's positive and negative predictions, it quantifies how consistently the outcome includes a predictor variable as a marker, that is, how the labeled condition for a given predictor compares to chance, averaged across all instances and support-weighted.The AUC is a bidimensional metric of the area under the full ROC curve, which ranges from 0 (100% inaccurate predictions) to 1 (100% accurate predictions), reflecting the separability grades and showing the capacity of a particular model to distinguish among classes ( Sureda Riera et al., 2020;Swets, 1996 ).In our case, ROC AUC is averaged across all instances and support-weighted.• Hammming Loss: Is the proportion of incorrectly predicted labels over the total number of labels.In multi-label classification, the Hamming loss is calculated as the Hamming distance between the true and the predicted values.Its value ranges from 0 to 1.The lower the value, the better the performance

Markedness = T P T P +
Where represents the symmetric difference between the two sets ( Schapire and Singer, 20 0 0; Tsoumakas and Katakis, 2009 ).
• Jaccard Similarity: It measures the degree of similarity between two sets by examining the proportion of correctly predicted positive labels in a potentially positive set (expected positive and real positive).

J (T , P ) = T ∩ P T ∪ P
where T and P are true labels and predicted labels respectively.

Scenarios
Four different scenarios are defined, based on the classification made by Antunes and Vieira (2015) : • Very high critical scenario: Represents the development and evaluation of critical applications that have very high-security requirements because they need to provide their customers with a reliable system.Examples of this scenario are internet bank websites, trade-in equity shares, or massive electronic commerce systems.The top priority in this scenario is the elimination of the largest number of attacks , thus assuming the investment of time and resources in remediating the occurrence of non-existent attacks (false positives).In this case, the metric of choice is recall , as it maximizes attack detection.• Heightened-critical scenario: In this case, the goal is to achieve a balance between the priority of detecting and eliminating the maximum number of attacks and preventing the excessive reporting of false positives , since resources in this type of scenario must be properly managed.Security requirements are high, but lower than Very high critical scenario.Examples of this scenario are government portals, e-commerce web applications, etc.A metric of choice in this scenario could be the F-measure, but the problem is that it assigns equal importance to precision and recall; Informedness appears to be a better alternative, as it is bias-free and does not have the disadvantages of harmonic averaging.• Medium-critical scenario: This scenario features less exposed or less criticals applications, typically with a limited budget, so the resources available for remediation of reported attacks are limited.For this reason, both finding and eliminating as many attacks as possible and saving resources on remediation of false positives have equal importance .Examples of this type of scenario are web sites, where attacks result in lower financial losses, or intranet applications that are less prone to external attacks.The optimal solution for this case is to use F-Measure .• Low-critical scenario: This scenario represents non-critical applications that are not very exposed to attacks.They are characterized by being implemented with small budgets, so the resources available are limited; the use of these resources should be focused on confirmed attacks.The goal is to report as few false positives as possible , increasing confidence in reported attacks.Examples of such scenarios are web portals for small and medium-sized companies.Markedness seems to be the best metric.Actually, it is able to give greater preponderance to the accuracy considerations, while at the same time being capable of considering the attacks that are left unnotified.

Related work
This section details the state of the art related to studies on dataset generation, pattern extraction and feature selection techniques, attack classification and deep learning models.

Datasets
Many datasets have been proposed for the study and evaluation of models and techniques to enable web attack prediction:  ( Raïssi et al., 2007 ).
Although the University of California, specifically the archival authority of the KDD Archive in Irvine discourages its use Brugger (2007) ; Tavallaee et al. (2009) , as well as is considered inadequate and obsolete Mahoney and Chan (2003) ; McHugh (2000) , the KDDD family (DARPA, KDD CUP 99, and NSL_KDD), is widely used in current intrusion detection system evaluation studies ( Siddique et al., 2019 ).With the generation of the new SR-BH dataset, the state of the art is improved by providing the first dataset, derived from real web traffic data, specifically designed for the training of multi-label web attack prevention and detection models.Krügel et al. (2002) use a model of distribution of characters to characterize the traffic genuinely generated to the web application.In this work, in contrast to the previous one, a numerical value to each field of a web request is assigned, based on the ASCII value corresponding to each character, in order to train models that can predict the normality or malignity of a web request, as well as the corresponding CAPEC key.

Pattern extraction and features selection
In Kruegel and Vigna (2003) , Kruegel and Vigna introduce an anomaly detection system that uses web server log files as input to produce an anomaly score for every web request based on the length and character distribution of the attributes.In the present work, the anomaly score is calculated for each field of interest of a web request extracted from the ModSecurity log.Kozik et al. (2015) detect anomalies in HTTP traffic, using a pattern extraction method derived from the distribution character model suggested by Kruegel, Toth, and Kirda as well as token detection of a web request, using text segmentation.Our work takes advantage of the log generated by ModSecurity and returns a numeric value for each field of interest.
Resende and Drummond (2018) select characteristics and profile parameters for intrusion detection methods based on anomaly, using an adaptational approach that relies on a genetic algorithm.In our work, features are selected by generating a histogram based on the mean ASCII value obtained by each field of a web request.
In Tan and Hoai (2021) , Tan and Hoai propose the HQTN technique that transforms the HTTP request into numeric, focusing on attributes names and values, and query strings and using the CityHash hash function, testing their approach on the CSIC dataset.In our approach, we transform the full web request to a numeric value by calculating the mean value of the sum of the ASCII values of all the web request fields, which in principle is much easier and gives good results.In addition, they work with a binary label dataset (CSIC 2010) and we work with the SR-BH 2010 dataset which is specific for multi-label classification.

Attacks classification
Dang and François (2018) use relationship inference between various cybersecurity issue repositories: CAPEC (Common Attack Pattern Enumeration and Classification), CWE (Common Weakness Enumeration), and CVE (Common Vulnerabilities and  Mac et al. (2018) detect harmful patterns in the HTTP/HTTPS traffic, using an autoencoder.They worked on the raw web request, collecting the absolute path, the method, and the query parameters thanks to a preprocessing of the data by tokenizing the URL, replacing the characters with their corresponding ASCII code.In the present work, a numeric value based on the mean value of the sum of the ASCII score of all the characters that are part of a field in a web request is calculated.Liang et al. (2017) propose the use of an Autoencoder and a Recurrent Neural Network (RNN) to detect anomalous requests on web servers, tokenizing the URLs to reduce their variability, while   2021) , propose a multi-class classification to inform the attack type, comparing a Random Forest (RF), a Multi-Layer Perceptron (ML), and a Long-Short Term Memory (LSTM) algorithm.Although their proposal obtains very good results, it is important to consider that the fact of working with a multi-class classification implies that a request can only be classified under a single type of attack.Our work allows us to properly classify web requests that involve more than one type of simultaneous attack.Zhang and Zhou (2007) , based on the K-nearest neighbor (KNN) algorithm, develop a lazy multi-label learning algorithm.Madjarov et al. (2012) used 11 benchmark data sets on which they experimentally compared 12 multi-label learning methods by using 16 metrics.Zhang and Zhou (2014) review different multi-label learning algorithms and detail different evaluation metrics.Read et al. (2011) perform multi-label classification using a chain of classifiers.Büyükçakir et al. (2018) propose an online stacked ensemble for multi-label stream classification.Wang et al. (2020) propose a collaboration-based multi-label model for e-commerce fraud detection.

Process overview
The process followed to complete this work can be summarized as follows:  • The results obtained are tabulated and sorted by different metrics according to the level of criticality of the different scenarios chosen.
A graphical overview of the process can be seen in Fig. 1 .

Dataset description
In this study, our new SR-BH 2020 dataset has been developed and implemented to experiment and evaluate the different algorithms and models.The dataset is composed of web requests collected during 12 days of July 2020 by a web server (Wordpress) installed on a virtual machine and exposed to Internet.On this server, Modsecurity version 2.9.2 for Apache, with Core Rule Set (CRS) version 3.3.0was installed in "Detection only" mode, so that all requests (legitimate and malicious) were recorded in the log generated by ModSecurity, but without being blocked.Daily, the logs generated by ModSecurity were collected and the virtual machine was restored to a clean state.
Once the web server exposure period was over, the collected logs were manually and semi-automatically processed by one of the authors to review the web request tagging performed by Modsecurity, correcting where necessary the normal/attack assignment to the corresponding web request and ensuring an appropriate CAPEC classification assignment.
The final result is a multi-label dataset aimed especially at web attack detection and composed of 907,814 requests of which 525,195 are normal requests and 382,619 are anomalous requests, where each record has 24 different features and a set of 13 labels.See Table 1 for detailed information on the number of times a web request is classified under a given CAPEC heading.Note that the sum total of the number of CAPEC classifications in Table 1 is greater than the number of web requests present in the dataset, due to the fact that there are web requests in which more than one type of attack is combined.See Table 2 for details of the number of web requests with multiple CAPEC classifications.
In order to protect the personal data of users accessing the web server, the environment was configured so that all web re- quests pass through a router interposed between the web server and the Internet connection: in this way, all requests received by the web server seem to be originated from the local IP address of the router.

Preprocessing the data, numerical transformation and features selection
Before performing the training of the different machine learning models, a review and preprocessing of the dataset data is necessary to avoid inconsistent and/or duplicated data, error correction and, at the same time, to adapt the data for numerical coding so that they are usable for the machine learning models.
The main objective of the data preprocessing and selection of relevant features of the dataset is to allow the different combinations of algorithms and models evaluated to reach the maximum level of performance and efficiency in their predictions, in addition to reducing the computational cost of modeling.
The numerical transformation of the dataset data was carried out using a procedure inspired by the work of Kozik et al. (2015) , Krügel et al. (2002) and Kruegel and Vigna (2003) .In our case, the mean of the sum of the ASCII values of the characters (applying a transformation to lowercase) of each of the fields of each web request is calculated: in this way, those fields with a high presence of anomalous characters (such as "./../.", present in a typical "path traversal" attack attempt) will obtain different mean ASCII values than those fields where the web request made is normal.The detailed procedure is provided by Algorithm 1 .See Fig. 2 for an example of the proposed numerical coding.
In Table 3 , the mean values of different fields of a normal web request and one labeled as a combination of "Protocol Manipulation" and "OS Command Injection" attacks are compared.In Table 4 , shows the detail of the fields of the web request in which there are differences in the mean sum of their ASCII values.
Feature selection involves selecting those input variables that have the strongest relationship with the targets variables.Once the mean ASCII values of each feature have been calculated, a histogram is generated for each field in such a way that it is possible to determine and eliminate those characteristics that do not provide differential information.As can be seen in graph A of Fig. 3 ,  the cookie_value feature provides useful information to allow the differentiation of web requests, since its numerical values are distributed in the 80-90 and 100-110 ranges.However, in graphic B it can be seen that do_not_track_value feature does not provide any useful information since all web requests have the same value.
Once the features that do not provide useful information have been removed, an approximation to the normal distribution of each remaining feature is made by applying the natural logarithm to each value of the remaining features, subsequently standardizing their values using the StandardScaler class of the Scikit-learn library Pedregosa et al. (2011) .
Finally, using the random forest classification algorithm and the RFECV class Guyon et al. (2002) of the scikit-learn library Pedregosa et al. (2011) , we performed recursive feature removal using cross-validation and selected the final number of features.
When working with a dataset composed of real data, it is common to have imbalanced classes; this different percentage of representation of the classes in a dataset can affect the different evaluation metrics of the learning models.Although there are several methods to generate synthetic data that promote the equal representation of the different classes in a dataset, such as SMOTE Chawla et al. (2002) and MLSMOTE Charte et al. (2015) , we have chosen to work only with real data and evaluate the algorithms and models with specific metrics that take in consideration the different percentage of representation of the classes in the dataset: Accuracy, Precision, Recall, F-Score, Hamming Loss, Hamming Score, Jaccard Similarity and ROC AUC.The selected features and labels are detailed in Table 5 .

Models
The SR-BH 2020 dataset has a set of 13 labels.First label indicates whether the web request is considered normal or not, so it is assumed that if its value is 1 (normal request), the rest of the labels of the set should be 0. On the contrary, if the value of the first label is 0 (possible attack), there should be one or more of the labels of the remaining set with its value at 1.
This assumption allows us to establish a division of classifications by phases: Classifications can be established with a singlephase model, in which an attempt is made to predict the entire set of labels independently of the value obtained by the first label, i.e., even if the first label indicates that the request is normal, an attempt will be made to predict the remaining labels in the set.On the other hand, it is possible to generate two-phase prediction models in which the algorithm will only predict the rest of the tags if the value of the first tag is 0; if its value is 1 (normal request), it will automatically set all the remaining tags in the set to 0. A customized model is also generated in which the best hyperparameters for the classification of each of the labels are cal-culated using GridSearchCV with the LightGBM and CatBoost algorithms.The prediction of each label will be performed with the corresponding algorithm adjusted with the calculated hyperparameters.
In order to obtain a model that provides the best possible results in predicting the normality of the web request, or the resulting CAPEC classification in the case of a web attack, the two algorithms (LightGBM and CatBoost) have been evaluated, using five different models:  iv A two-phase model in which, in the first phase, it detects the request is normal or not and, in the case of an anomalous request, it passes to the second phase of the model to obtain its CAPEC classification using the MultiOutputClassifier class of the sklearn.multioutputmodule of Scikit-learn library ( Pedregosa et al., 2011 ).v A customized model in which, the prediction is performed through the specific algorithm(LigthGBM or CatBoost) tuned with the hyperparameters calculated for each CAPEC label.
Figs. 4 , 5 , and 6 show the flowcharts for the single-phase, twophase, and customized models respectively.
The combination of models designed and algorithms allows the presentation to the security specialist of information about the attack(s) the system is suffering, including the CAPEC classification(s), as shown in Fig. 7 .

Results and discussion
The results of the algorithms and model combinations have been evaluated according to the metrics and scenarios discussed in Section 2.1.3 .One of the main objectives of the present work is to propose the best combination of algorithms and models able to classify as accurately as possible the different types of attacks in each possible scenario, following the CAPEC classification.Since a web request can be simultaneously classified in more than one CAPEC key, it is necessary to work with multi-label models.The evaluation has been carried out with data from the SR-BH 2020      As can be seen in tables 6, 7, 8 , the combination of the CatBoost algorithm and a two-phase model in which the MultiOutputClassifier class of the Scikit-learn library is applied is the one that obtains the best results in practically all the metrics.This model obtains a strict Accuracy (EMR) of 0.8844, an average F measure of 0.8891, an AUC ROC of 0.9132, and a Hamming Loss of 0.01703.
The two-phase models with the MultiOutputClassifier class are clearly superior to the other models, regardless of the algorithm (LightGBM or CatBoost) included in them: using LightGBM, their EMR is 0.88095, their average F measure is 0.8862, their ROC AUC is 0.9114 and their Hamming Loss is 0.01754.
See Figs. 8 , 9 for a detail of the metrics obtained by each algorithm/model combination.
Although all algorithm/model combinations obtain good F-Score and Accuracy scores in each CAPEC classification, it becomes evident that the combination of the CatBoost algorithm and the two-phase model with the MultiOutpoutClassifier class yields the best results, obtaining the highest Accuracy scores in all CAPEC classifications and the highest F-Score in 9 of the 13 CAPEC classifications.
The superiority of the CatBoost algorithm over LightGBM can also be clearly seen, regardless of the model with which they are combined.See Table 9 for a comparison of the best scores obtained by LightGBM and CatBoost in the various metrics evaluated.Note that in the case of the Hamming Loss metric, a lower score is indicative of better performance, as indicated in Section 2.1.3.1 .
Following the work of Antunes and Vieira (2015) , which recommends the most appropriate metrics according to the type of scenario, the best algorithm/model combination has been selected based on the recommended metric for each different scenario as detailed in 2.1.3.2 .Table 10 shows the recommended metric for each of the four scenarios, the top three values obtained in this metric, as well as the algorithm/model combinations that obtained this value.
From the results in Table 10 , it is concluded that the combination of the CatBoost algorithm and the two-phase model with the MultiOutpoutClassifier class, is the one that obtains the best performance with the recommended metrics in the three most critical scenarios analyzed ( Very high critical scenario, Heightened-critical scenario , and Medium-critical scenario ).Only in the Low-critical scenario , this algorithm/model combination is outperformed by the combination of CatBoost and the two-phase model with the Bina-

Conclusions
In this work we have presented the SR-BH 2020 multi-label dataset, which includes a set of 13 different labels, providing information about the normality of each web request and its possible classification into 12 different CAPEC categories.
A new way to give a numerical value to the alphanumeric strings and symbols that constitute the different fields that conform a web request, by calculating the average of the sum of the ASCII values of each of the characters in each field, is proposed; this numerical value, calculated easily and quickly, allows the extraction of features and the training of the different machine learning models.
We have also designed and evaluated different multi-label classification models, using modules and classes from the scikitlearn and scikit-multilearn libraries.Two leading algorithms in the field of machine learning have been tested with these models: LightGBM and CatBoost.The results obtained by our experiments show a clear superiority of the combination of the CatBoost algorithm and the two-phase model with the MultiOut-putClassifier module of the scikit-learn library, in multi-label classification tasks.In future work, the possibility of executing automatic remediation actions, based on CAPEC attack patterns, could be considered.
Consideration should also be given to the possibility of generating a new dataset with data from traffic originating from communications between SOAP or REST web services or even data from web API communication.The labeling of the attacks performed on these services and their CAPEC classification would allow the application of the combinations of algorithms and models generated in this work to new datasets as well as to determine the possible generalization of results to other contexts.
In addition, due to the excellent results obtained in all the scenarios analyzed by the Two-phase MultiOutput CatBoost model, more effort and research could be devoted to the development and improvement of this model to achieve its integration into commercial or open-source web application protection tools that can be used in all types of scenarios with different levels of criticality.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
classes weight of class x Markedness of class • ROC AUC: Shows the global efficiency of a classification model at all classification levels, by plotting the true positive rate (TPR) versus the false positive rate (FPR).
Jin et al. (2018) apply AutoEncoder and RNN to identify web attacks based on payloads.Instead of using neural networks, various combinations of models and novel algorithms (LightGBM and Cat-Boost) in the field of machine learning are evaluated.Pan et al. (2019) detect runtime intrusions by mining call traces in web applications and learning the correct program execution model through a stacked denoising autoencoder; they call their model Robust Software Modeling Tool (RSMT).Truong et al. (2019) propose detecting anomalous HTTP queries using Sum Rule and Xgboost and combining the related results with several stacked denoising autoencoders (SDAE).In our work, we do not use deep learning techniques, but train models of different phases, using LightGBM and CatBoost.Tama et al. propose an architecture of stacked ensembles where its base learners are other ensembles learners Tama et al. (2020) .Tekerek proposes an anomaly-based Web attack detection architecture that relies on Convolution Neural Network (CNN) ( Tekerek, 2021 ).Our work proposes a two-phase model architecture with CatBoost algorithm.Montes et al. (2021) use deep learning techniques to enhance Modsecurity performance using a two-phase model: first, using

•
First, ModSecurity for Apache with Core Rule Set (CRS) version 3.3.0 is installed on a web server exposed to the Internet.Mod-Security is configured in "Detection Only" mode.• During the exposure period, the logs generated by the ModSecurity activity are collected on a daily basis.• The logs are reviewed and cleaned manually and semiautomatically.The correct labeling of each log made by Mod-Security is verified, thus ensuring a proper CAPEC classification.• The reviewed logs are saved in CSV format, resulting in the SR-BH 2020 dataset.• Each of the input fields of the dataset is numerically encoded by calculating the mean value of the sum of the ASCII code assignment of each of the characters in each field.• The values obtained are normalized and a selection of the relevant features of the dataset is made.• The performance of different combinations of algorithms and multi-label classification models in predicting the CAPEC classification is evaluated.

Fig. 2 .
Fig. 2. Example of a web request numerical coding.

Fig. 7 .
Fig. 7. Information on attacks, received by the security operator.
Perez- Villegas and Alvarez Torrano-Gimenez et al. (2009)Alvarez Torrano-Gimenez et al. (2009), this dataset was produced at the Consejo Superior de Investigaciones Científicas (CSIC).Originated from simulated web requests to an ecommerce web application, this dataset is composed of 36,0 0 normal requests and over 25,0 0 0 anomalous ones, marked as either normal or anomalous.•ECML/PKDD 2007: Generated from real traffic and replacing parameter values and names with random values in order to anonymize the data, this dataset includes 35,006 normal records and 15,110 records labeled as attacks Shiravi et al. (2012) were retrieved from KDD Cup '99, plus 10 additional ones, but not including those features that contain duplicate records ( Proti ć, 2018 ).• ISCX: Emerged from the work of Shiravi et al.Shiravi et al. (2012), this dataset was built by generating simulated traffic for seven days.It contains 11 features and, in addition to the requests being labeled as normal or attack, provides a description of the network traffic.•CSIC-2010:

Table 1
Number of web requests by CAPEC classification.

Table 2
Number of different CAPEC classifications assigned to a web request.

Table 3
Mean field ASCII values of normal and attack web request.

Table 4
Detail of fields with different mean ASCII values.

Table 5
Final set of selected features and labels.

Table 6
Summary of metrics by algorithm and model, ordered by recall score.

Table 9
Best scores obtained for LightGBM and CatBoost Algorithms., created specifically to allow this type of multi-label classification.70% of web requests present in the dataset have been used to train the different algorithms, while the remaining 30% of web requests have been used to validate the models generated.Table6presents a summary of the different metrics obtained for each algorithm/model combination, ordered by recall score.Table7details the F measure obtained for each algorithm/model combination in the different CAPEC classifications.Table8details the Accuracy obtained for each algorithm/model combination in the different CAPEC classifications. dataset

Table 10
Recommended metrics for scenario and top three algorithm/model combination.class, obtaining the second-best score.From the review of the scores obtained in the different metrics, it can be concluded that the Two-phase MultiOutput CatBoost model is adequate for all the scenarios considered. ryRelevance