Detection of SQL Injection Attacks Based on Supervised Machine Learning Algorithms: A Review

In the ever-changing world of cybersecurity, it is becoming more important to ensure integrity of web applications as well as securing sensitive data. Among a variety of vulnerabilities, SQL injection is considered a significant risk with severe consequences. Addressing this crucial threat has always attracted the researchers to explore various approaches to identify and detect SQL injection attacks. The machine learning has captured the attention of the researchers to explore its potential due to its success in several different fields and the limitation of other rule-based approaches. This study provides a comprehensive review on a variety of the most recent researches that have been carried out using supervised learning algorithms. The study reveals that machine learning has a huge potential in the process of identification and detection of SQL injection attacks.


INTRODUCTION
In the rapidly evolving cybersecurity landscape, ensuring the confidentiality and integrity of sensitive data is extremely important by securing web applications against vulnerabilities.Among the various security threats, SQL injection vulnerability is considered a serious threat that poses a serious risk to web applications.Based on the Open Web Application Security Project (OWASP), SQL Injection is a crucial vulnerability and ranked in the list of top 10 vulnerabilities (Demilie & Deriba, 2022).SQL injection vulnerability is exploited by attackers via injecting malicious SQL code to web applications in order to gain unauthorized access to sensitive data stored in the databases (Bharati & Kumar, 2022) (Goyal & Matta, 2023).SQL injection can be classified into several types which exploits various weaknesses in web applications (Hubskyi, et al., 2020;Singh, et al., 2015).Researchers have tried to eliminate the risk of this threat and utilized various approaches such as static, dynamic and hybrid approaches to identify and detect SQL injection vulnerabilities and attacks (Zhumabekova et al., 2023) (Sadeeq & Abdulazeez, 2023).However, due to the limitations of these rule-based approaches, the researchers investigated about more robust and versatile solutions (Abdulmalik, 2021;Hasan, et al., 2019;Nasereddin, et al., 2023).Therefore, the success of machine learning in many fields, attracted the researchers to explore its capabilities it in the field of security.Moreover, machine learning approaches have proved to be a good solution to identify SQL injection attacks instead of rule-based approaches (Deriba, et. al., 2022;Roy, et. al., 2022) (Kunang, et al., 2021).This paper aims to review recent researches that utilized supervised machine learning algorithms to identify, detect and prevent SQL injection attacks.First a comprehensive review has been conducted on recent researches.Then the methods and algorithms of each research have been analyzed and extracted as well as the accuracy results.Moreover, according to the conducted review, the supervised machine learning algorithms have obtained promising results in SQL injection attacks identification and detection.The paper's organization is as follows: Section 2 introduces the background of SQL injection, its types, and prevention methods as well as supervised machine learning algorithms.Section 3 explains the method of the research.Section 4 presents the results and discussions of the review.Finally, Section 5 provides the conclusion.

SQL injection is
a security vulnerability in web applications which enables attackers to access sensitive information stored in the databases of web applications via injecting malicious SQL code (Lakhani, et al., 2022) (T.Zhang & Guo, 2020).This vulnerability emerges when the user input data is not handled properly by the web application.Generally, this vulnerability is considered crucial due to its severe impact on revealing sensitive data (Brindavathi, et al., 2023;Mondal, et al., 2022).For this reason, it is listed in the top ten vulnerabilities issued by Open Web Application Security Project (OWASP) (Demilie & Deriba, 2022).Typically, the attacker inserts SQL code into a web form or other input field and then executed by the backend database as in Fig. 1 (Roy et al., 2022).However, the source of the attack might be cookies, server variables and stored procedures too (Padmaja, et al., 2022).The malicious SQL code could be executed in case the user input is not sanitized or validated properly by the application (Fidalgo, et al., 2020;Johny, et al., 2021).Usually, successful SQL injection attacks enable the attacker to access sensitive data, modify and delete data or even have access to the underlying system, which are considered severe and serious consequences (Sivasangari, et al., 2021).In order to prevent SQL injection attacks, the user input data needs to be checked and validated, as well as other security measures should be implemented, such as firewalls and access controls in addition to secure coding practices that should be followed by Web application developers and finally, stay up-to-date with the latest security vulnerabilities and patches (Jemal, et al., 2020;Tasevski & Jakimoski, 2020).There are various types of SQL injection attacks as in the following (Saran, et a., 2022) (Azman, et al., 2021).

Classic SQL Injection
It is considered as the most common type of SQL injection attacks.The way it works is by injecting malicious SQL code into the vulnerable SQL Statement.Usually, when the web application doesn't validate user input in a proper way, this vulnerability arises, which in turn enables the attacker to enter SQL commands into the input fields such as login and search boxes (Sommervoll, et al., 2023).Furthermore, in order to exploit this vulnerability, the attackers might use several techniques such as using a single quote character (') to add their malicious code at the end of the original SQL statement.Thus, the injected code will always be true which will return information from the database, and hence bypass the authentication process.The consequences of Classic SQL injection attacks might be data loss or corruption and unauthorized access to sensitive information.Therefore, to prevent Classic SQL injection attacks, the validation of user input is necessary in addition to the use of parametrized queries (Azman et al., 2021;Sommervoll, et al., 2023).

Blind SQL Injection
The reason it is called "blind" is because no feedback about the query result is returned to the attacker.Typically, with this type of injection, the attacker tries to extract sensitive information or modify database contents by exploiting a vulnerability in the web application.When sending SQL query to the database it will behave in a certain way.Therefore, by observing the behavior of the application, the attackers can determine whether the condition is true or false (Jemal et al., 2020).Later, they can construct more complex queries based on the obtained information.Usually, this type of attack is difficult to detect and mitigate compared to other SQL injection attacks because no direct feedback is received from the database.In order to mitigated the risk of this attack, prepared statements and input validation techniques should be used as well as implementing strict access controls (Erd Hodi, et al., 2021;Jemal et al., 2020) (Erd Hodi et al., 2021).

Error-Based SQL Injection
This type of attack tries to extract useful information from the database depending on the error messages returned from the database after executing malformed SQL queries.Typically, these useful information include database structure, tables names as well as usernames and passwords in some cases.To prevent this type of attack, the developers must take into consideration using parametrized queries and sanitize user input before sending it to the database (Crespo Mart'inez et al., 2023) (Tasevski & Jakimoski, 2020) (Mondal et al., 2022).

Union-Based SQL Injection
In this type of attack, the results of two or more SELECT statements are combined into a single result set using UNION operator.Typically, the attack involves injecting malicious SQL code into input fields like login forms or contact forms in order to change the behavior of the application or get useful information from the database (Mondal et al., 2022).The prevention from the this type of SQL injection attack involves the use of prepared statements which don't allow the injection of any additional code by separating the user input from the SQL code.Additionally, validation is necessary to ensure the limit of user input (Sommervoll, et al., 2023) (Abdulmalik, 2021).

Time-Based SQL Injection
The idea behind this type of attack is inferring information about the database structure based on the time delay of the database response.Typically, the attacker observes the response time for each of the injected malicious SQL statements then analyzes the response time to extract sensitive information from the database.The method of prevention againt this attack is to perform regular security assessments to identify and mitigate vulnerabilities (Azman et al., 2021) (Erd Hodi et al., 2021).

Out-of-Band SQL Injection
It is called "out-of-band" because it doesn't use the normal method of retrieving data.Instead it uses HTTP or DNS requests for obtaining data from the database.This method is useful when the web application allows functions that make HTTP requests or send emails.When the application is exploited, the response of the malicious SQL code is received on a different channel.Typically, this attack is more difficult to detect compared to in-band attacks because it doesn't show any signs of being exploited.However, there are ways of protection against this attack such as monitoring the network traffic for any suspicious requests (Azman et al., 2021;Johny et al., 2021;Pinzon et al., 2013) (McIvera, et al., 2017).

Second-Order SQL Injection
This type of SQL injection is also known as persistent or stored SQL Injection.Typically, it involves two steps, first, when the user input is saved in the database.Second, which might happen at a later time, when the saved user input is used in the SQL Query to get or modify data in the database.Usually this attack happens when the application allows the user to store some data in the database.Therefore, the attacker might store malicious SQL code.Another way is when the attacker successfully injects malicious SQL code in a stored procedure.As a result, every time the stored procedure is called the malicious SQL code is executed.To prevent this type of attack, the developers must take into consideration using parametrized queries and sanitize user input before storing in the database (Johny et al., 2021) (Tasevski & Jakimoski, 2020).

Machine Learning
The researchers have used several approaches to detect SQL Injection attacks.First approach was static analysis, which relies on validating user input to identify syntactic and grammatical errors.The downside of this approach is that it cannot detect malicious SQL code when the syntax is correct (Abdulmalik, 2021) (Saleem, et al., 2020) (Hassan, et al., 2022).The second approach is Dynamic analysis, which is based on scanning and comparing the web application response for the queries sent, however the limitation of this method is that it can only identify predefined vulnerabilities (Singh et al., 2015) (Abdulmalik, 2021).The third approach is combined analysis, which is basically benefiting from both static and dynamic analysis techniques to detect SQL injection attack.All the previously mentioned approaches are rule-based, meaning they cannot detect attacks which are not covered by the rules (Abdulmalik, 2021) (Nasereddin et al., 2023).For this reason, there was a strong need for a more robust and reliable approach.The success of machine learning in a variety of fields has motivated many researchers to explore its capabilities in detecting SQL injection attack (Falor, et al.,2022) (Ashlam, et al., 2022).Machine learning based approaches to detect SQL injection attacks are considered a replacement for the rule-based one.Since, machine learning approach can detect new SQL injection attack types (McIvera et al., 2017) (Singh et al., 2015) (Zolanvari, et al., 2019).There are three types of machine learning which are supervised, unsupervised, and reinforcement learning (Salih & Abdulazeez, 2021) Abdullah, et al., 2021).The supervised algorithms has successfully proven to be effective in analyzing a broad and annotated training data.Below are some useful algorithms of supervised learning for identifying and detecting SQL injection attacks (Praveen, et al., 2022) (Islam, et al., 2019) (Kunang et al., 2021;Zolanvari et al., 2019).

Naive Bayes
This algorithm has been used in detecting SQL injection attacks.It is based on Bayes Theorem which depends on conditional probability.It simple and fast which is mainly used in text classification (R. Gupta, et al., 2020;Pinzon et al., 2013).

SVM
Support Vector Machine it used mainly in classification of problems.The target of this algorithm is separating the data into two groups by finding the best line which is called hyperplane with the aim of increasing the margin between the two groups (V.Gupta, et al., 2022) (R. Gupta, al., 2020)

Logistic Regression
It is a predictive modeling technique which determines the relationship between a dependent variable and an independent variable or variables.It is considered prediction model because it is fast and simple (R. Gupta, et al., 2020;V. Gupta et al., 2022).

Decision Tree
Decision tree is one of the most used algorithms in machine learning.It can be with classification and regression too.It is a good way to decide between various actions.However, one of the most obvious challenges with this algorithm is overfitting which can cause errors in the the final decisions (R. Gupta, et al., 2020;V. Gupta, et al., 2022;Sadeeq, et al., 2022).

Random Forest
Random Forest (RF) is an algorithm which uses supervised learning methods to solve regression and classification problems.Random forest forms subsets of data which solves overfitting issues present in decision tree (Islam et al., 2019) (R. Gupta, et al., 2020;Islam, et al., 2019).

METHOD
First of all, a literature review has been conducted by utilizing the most popular digital libraries Science Direct, IEEE Xplore, Springer and Scopus.The aim of the study is to review papers in these databases that discuss the process of identification, detection and prevention of SQL Injection attacks.The period covered was papers from 2018 till 2023.The selection process included removing duplicates and review the most relevant papers.

RESULT AND DISCUSSION
In this work we reviewed and compared many researches that used supervised machine learning algorithms to identify and detect SQL Injection attacks.The summary of the reviewed papers is presented in Table 1, which contains the algorithms used, the method of the research and the results in terms of accuracy.Natarajan et al. used Naïve Bayes, logistic regression, CNN and random forests algorithms.In addition, they utilized two datasets, one for training and the other on for validation and testing.They have obtained 99.29% accuracy with CNN (Natarajan, et al, 2022).Ibrohim et al. utilized two algorithms only, SVM and Naïve Bayes.The result of SVM was better than Naïve with 93.98% accuracy (Ibrohim & Suryani, 2023).Roy et al. used Kaggle dataset to detect SQL injection attacks with a variety of machine learning algorithms such as Logistic Regression and Naïve Bayes.The results showed that Naive Bayes was the best model with 98.3% accuracy (Roy et al., 2022).Deriba et al. developed a comprehensive framework for SQL injection detection and prevention using a hybrid approach and machine learning techniques.Other models like ANN and SVM were tested as well.According to the results, the best performing model was the hybrid approach with 99.2% accuracy (Deriba et al., 2022).Krishnan et al. tested various machine learning models to identify and detect SQL injection attack, including SVM,CNN, Naïve Bayes and Logistic regression.The best performing model was CNN with 97% accuracy (Krishnan, et al., 2021).Gandhi et al. compared various types of machine learning algorithm in terms of detecting SQL injection attack.A hybrid CNN-BiLSTM model has been proposed by the authors with the accuracy of 98% (Gandhi, et al., 2021).Ahmed and Uddin tested and compared a variety of supervised learning algorithms such as SVM, Naïve Bayes, KNN random forest and decision tree together with Natural Language Processing (NLP) and obtained 98.15% accuracy with random forest and NLP (Ahmed & Uddin, 2020).Tang et.al, used SVM and neural networks LSTM and CNN algorithms for detecting SQL injection attacks and obtained 99.85% accuracy with LSTM (Tang, et al., 2020).Tripathy et al. trained a variety of supervised learning models such as decision trees and random forest on the dataset.The results obtained from the random forest classifier was the best with 99.8% accuracy (Tripathy, et al., 2020).Hasan and Tarique created datasets which contained malicious SQL syntax.They tested and compared many machine learning algorithms including SVM, ensemble bagged and boosted trees, as well as cubic KNN.The results showed that the best performing model was ensemble bagged and boosted trees with 93.8% accuracy (Hasan et al., 2019).Xie et al. used Elastic-Pooling CNN (EP CNN) to detect SQL injection attack.The authors compared the result of the other methods.The accuracy of EP CNN was outstanding with 99.98% (Xie, et al., 2019) .Luo et al. used the network traffic to extract SQL injection payloads.The authors used a CNN-Based model for their experiment which resulted in an outstanding accuracy of 99.5% (Luo, et al., 2019).Li et al. used offline and online training stages.They tested and evaluated various methods like KNN, Adaptive random forest (ADF), SVM .The result showed that ADF was the model with the highest accuracy of 98% (Li, et al., 2019).Zhang utilized several machine learning models such as SVM, CNN, MLP, LSTM.The result of the evaluation presented that CNN outperformed other models with the accuracy of 95.4% (K.Zhang, 2019).Ross et al. created a system containing three phases; creating traffic, capturing data and data pre-processing.They tested various models such as ANN, random forest and SVM which presented the best result of 95.7% accuracy (Ross, et al., 2018).From the reviewed papers, it can be noticed that CNN, SVM and random forest were the most used supervised machine learning algorithms to detect SQL Injection as shown in Figure 2. Furthermore, the results of CNN and random forest algorithms were the best among other algorithms in terms of accuracy.

CONCLUSION
SQL injection poses a significant threat to the security of web applications and their sensitive information.Many researches have been carried out to identify and detect this threat and help protect the web applications from such attacks.Machine learning has proved to be successful in eliminating the risk of this threat.This study has reviewed many researches that utilized various supervised machine learning algorithms to detect this type of attack.The study revealed that some algorithms such as CNN and random forest has achieved promising results in terms of accuracy.

Fig. 2 .
Fig. 2. The most used algorithms in detection of SQL injection.