A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method

Software defect prediction (SDP) plays a vital role in enhancing the quality of software projects and reducing maintenance-based risks through the ability to detect defective software components. SDP refers to using historical defect data to construct a relationship between software metrics and defects via diverse methodologies. Several prediction models, such as machine learning (ML) and deep learning (DL), have been developed and adopted to recognize software module defects, and many methodologies and frameworks have been presented. Class imbalance is one of the most challenging problems these models face in binary classification. However, When the distribution of classes is imbalanced, the accuracy may be high, but the models cannot recognize data instances in the minority class, leading to weak classifications. So far, little research has been done in the previous studies that address the problem of class imbalance in SDP. In this study, the data sampling method is introduced to address the class imbalance problem and improve the performance of ML models in SDP. The proposed approach is based on a convolutional neural network (CNN) and gated recurrent unit (GRU) combined with a synthetic minority oversampling technique plus the Tomek link (SMOTE Tomek) to predict software defects. To establish the efficiency of the proposed models, the experiments have been conducted on benchmark datasets obtained from the PROMISE repository. The experimental results have been compared and evaluated in terms of accuracy, precision, recall, F-measure, Matthew’s correlation coefficient (MCC), the area under the ROC curve (AUC), the area under the precision-recall curve (AUCPR), and mean square error (MSE). The experimental results showed that the proposed models predict the software defects more effectively on the balanced datasets than the original datasets, with an improvement of up to 19% for the CNN model and 24% for the GRU model in terms of AUC. We compared our proposed approach with existing SDP approaches based on several standard performance measures. The comparison results demonstrated that the proposed approach significantly outperforms existing state-of-the-art SDP approaches on most datasets.


Introduction
Determining source code defects is usually tricky due to software projects' colossal code base. The importance and challenges of defect prediction have made it an active research area in software engineering (Dam et al., 2018). Defects in software are often challenging to detect or identify, and developers spend significant time locating and fixing them. The software life cycle includes many activities to identify source code defects, such as design reviews, code inspections, integration tests, functions testing, unit tests, etc. (Tong et al., 2018). Early detection of a defect in software projects during the development phase helps allocate testing resources reasonably, determine the testing priority of different software modules, and improve the effectiveness of the software development process (Kumar & Singh, 2021). SDP is a process for predicting source code defects using tools or techniques based on historical data. SDP approaches can be divided into with-in-project defect prediction (WPDP), crossproject defect prediction (CPDP) for a similar dataset, and cross-project defect prediction (CPDP) for a heterogeneous dataset (Kalaivani & Beena, 2018;Li et al., 2018).
In this study, we develop our models based on the with-in-project defect prediction (WPDP) approach. In the WPDP approach, a prediction model can be built based on collecting historical data from a software project and predicting defects in the same project. WPDP performs best if there is enough historical data to train the model (Omri et al., 2020). There are two ways in which previous studies have attempted to build accurate SDP models: the first approach is to manually design new features or new sets of features to represent defects more effectively, while the second approach involves applying new and improved ML-based classifiers. Several models have been proposed for SDP based on the second approach (ML-based classifiers). However, there is still a need to develop accurate defect detection models or detectors and robust software metrics to distinguish between defective and non-defective software modules. Latest studies leverage manually designed software metrics such as Halstead features, McCabe features, C.K. features, MOOD features, etc., to build classifiers.
Recently, DL algorithms have been adopted to improve research tasks in software engineering, especially in SDP (Liang et al., 2019;Omri et al., 2020). DL algorithms differ from classical artificial neural networks in one critical aspect: they contain many hidden layers (Ferenc et al., 2020;Koay et al., 2022). DL is a type of ML that allows computational models consisting of multiple processing layers to learn data representations with various levels of abstraction. DL architecture has been widely applied in many fields to solve detection, classification, and prediction problems (Zhu et al., 2020). DL has drawn more and more attention because of its robust feature learning capability and has been successfully used in many domains, such as speech recognition, image classification, etc. The CNN and GRU models are the most popular DL architectures designed to solve the problem of long-term dependencies and gradient vanishing. These models can recognize longer sequences of time series data to provide high predictive performance in SDP (Tong et al., 2018).
Unfortunately, the studies of SDP are facing a big challenge: the class imbalance problem. When there is an uneven distribution of classes in the training data set, this indicates that this data is imbalanced. The class imbalance problem means that the number of non-defective modules (majority class) is much more than that of defective modules (minority class). Imbalanced class classification biases performance towards the majority numbered class in the case of a binary application. Most ML techniques can predict better when each class's instances are roughly equal. This problem severely hinders the efficiency of these models and produces imbalanced false-positive and false-negative results (Lango & Stefanowski, 2018). This study selects imbalanced datasets from the public PROMISE repository for experimental purposes (Chen et al., 2015;Deng et al., 2020a;Phan & Nguyen, 2017);. However, several experiments in the previous studies (Deng et al., 2020a;Khuat & Le 2020;Kumar & Sathyanarayana, 2015;Miholca et al., 2018) were conducted based on these datasets using many ML models; most of the results were very poor because of the class imbalance problem. Very few of these studies are based on CNN and GRU models. However, to our knowledge, there is no experiment using CNN and GRU combined with SMOTE Tomek in the literature.
To bridge these gaps, this study aims to apply data balancing methods to address the problem of class imbalance and investigate the impact of data balancing methods on the performance of ML models in detecting software defects. Firstly, we apply data balancing methods to balance the training set. Secondly, we train and test the proposed models using the balanced training set, and finally, we evaluate the results based on many performance measures. In summary, the goal and main contributions of our study are summarized as follows: (i) In this study, we propose a novel approach that combines CNN and GRU with SMOTE Tomek method to predict software defects. (ii) We evaluate the performance of the proposed approach and compare it with the traditional ML model (RF) as the baseline model and compare it with the existing approaches used in SDP. (iii) We show that the performance of ML models in SDP can be significantly improved when balancing the data set by applying data balancing methods.
The structure of this paper is organized as follows. Section 2 presents a discussion on related work. Section 3 presents background on software defect prediction, convolutional neural networks, and gated recurrent unit. Section 4 presents the hypothesis and research questions. Section 5 presents the motivation for our proposed work. After that, our research methodology is presented in Section 6. Section 7 presents the experimental results and discussion. Section 8 presents the implication of the findings. Section 9 presents threats to validity, followed by conclusions in the last section (Section 10).

Related work
The prediction of defects in software systems is significant, and there is great interest in developing novel high-performance software defect predictors. SDP models aim to improve the quality of software application systems (Khuat & Le, 2020). Many models have been constructed to recognize the defects in software modules using artificial intelligence and statistical methods (Cao, 2020;Dam et al., 2018;Deng et al., 2020a;Qiu et al., 2019;Pan et al., 2019;Tong et al., 2020;Munir et al., 2021). Tong al. (2018) proposed a novel approach for SDP using deep representations combined with the two-stage ensemble to address the class imbalance problem. The experiments were performed on 12 NASA datasets, and results were evaluated based on F-measure, AUC, and MCC. The experimental results showed that (i) deep representations are promising for SDP, (ii) the two-stage ensemble is more effective for addressing the class imbalance problem in SDP compared with classic ensemble learning methods, and (iii) the proposed approach is significantly effective for SDP. HONGLIANG LIANG al. (Liang et al., 2019) proposed Seml, a novel framework that combines word embedding and LSTM for SDP. The model was evaluated based on eight open-source projects. The experimental results showed that the Seml outperforms three state-of-the-art defect prediction approaches on most datasets for both within-project and cross-project defect prediction. Ferenc et al. (2020) proposed a methodology of how to adapt DNNs s for bug prediction. The methodology was applied on a large bug dataset (containing 8780 bugged and 38,838 not bugged Java classes). The results demonstrate that DL with static metrics can indeed boost prediction accuracies. Kun Zhu et al. (2020) proposed a novel just-in-time defect prediction model named DAECNN-JDP based on denoising autoencoder and CNN. The model was evaluated based on six large open-source projects and compared with 11 baseline models. The experimental results showed that the proposed model outperforms these baseline models. Deng et al. (2020a) proposed a novel LSTM method to perform SDP; their method can automatically learn semantic and contextual information from the program's ASTs. The experiment was performed on several open-source projects, showing that the proposed LSTM method is superior to the state-of-the-art methods. Khuat and Le ( 2020) conducted an empirical study to evaluate the importance of sampling techniques in SDP. The experimental results indicated the positive effects of combining sampling techniques with ensemble learning models. This method addressed the class imbalance problem and achieved high prediction accuracy. Miholca et al. (2018) presented a supervised classification approach named (HyGRAR). It is a nonlinear hybrid model that combines gradual relational association rule mining and artificial neural networks to predict software defects. The experiments were conducted using ten open-source datasets. The experimental results showed the excellent performance of the proposed classifier and better performance than most of the previously proposed classifiers. This method achieved high prediction accuracy. Kumar and Satyanarayana (2015) developed a Hybrid Neural Network model with object-oriented and C.K. metrics for software fault prediction. Adaptive Genetic Algorithm has been used for ANN optimization. The proposed model has been tested with PROMISE data sets. The experimental results showed better performance compared to major existing schemes. Hao Xu et al. (Qiu et al. 2019) proposed a novel approach using the transfer CNN model to mine the transferable semantic features for CPDP tasks. The experiments were conducted based on ten benchmark projects with 90 pairs of CPDP tasks. Their results showed that the proposed model is superior to the reference methods. Pan et al. (2019) proposed an improved CNN model for WPDP and compared the experimental results with those of existing CNN studies. An experiment was performed using a 30-repetition holdout validation and a 10 * 10 cross-validation. Their results showed that the CNN model significantly outperformed the state-of-the-art ML models for WPDP. Tong et al. (2020) proposed a novel credibilitybased imbalance-boosting method to address the class imbalance problem in software defect proneness prediction. Experiments were performed on datasets obtained from the NASA and PROMISE datasets. The proposed method was compared with several approaches. The experimental results showed that the proposed method is a more promising alternative for addressing the class imbalance problem than previous methods. Munir et al. (2021) proposed a new framework based on GRU and LSTM for SDP. The experiments were evaluated based on 119,989 C/C + + programs in Code4Bench. The proposed method was compared with several approaches. The experimental results demonstrated that the proposed method performs better regarding a recall, precision, accuracy, and F1 metrics. Li et al. (2017) proposed a framework based on the programs' Abstract Syntax Trees called Defect Prediction via CNN. The model was evaluated based on seven open-source projects in terms of f-measure. The experimental results showed that the model improves the state-of-the-art method by 12% on average. Kukkar et al. (2019) proposed a novel DL model for multiclass severity classification called bug severity classification using a CNN and Random Forest with Boosting based on five open-source projects. Their results prove that the proposed model enhances the performance of bug severity classification over stateof-the-art techniques. Pandey et al. (2020) proposed a new method using deep representation and ensemble learning (BPDET) for software bug prediction; ensemble learning was applied to address the class imbalance problem. An experiment was performed based on 12 data sets from the PROMISE repository. The experimental results showed that the proposed method outperformed existing state-of-the-art techniques. This method addressed the class imbalance problem and achieved high prediction accuracy. Zhao Yang and Hongbing Qian (Yang et al., 2018) proposed an ANNs model, which automated parameter tuning techniques to optimize the defect prediction models. The model was evaluated based on 30 datasets downloaded from the Tera-PROMISE Repository. Their results showed that the proposed model performance improved after tuning parameter settings. The authors suggested that researchers should pay attention to tuning parameter settings by Caret for ANNs instead of using suboptimal default settings if they select ANNs for training models in future defect prediction studies. Zhao et al. (2019) proposed a novel SDP model called Siamese parallel fully-connected networks (SPFCNN), combining Siamese networks' advantages and DL. The authors compared the proposed model with the state-of-the-art SDP models using six datasets from the NASA repository. The experimental results showed that the proposed model contributes significantly higher performance than benchmarked SDP approaches. Farid et al. (2021) proposed a hybrid model using bidirectional long short-term memory and CNN to predict software defects. The proposed model was evaluated using seven open-source Java projects from the PROMISE datasets. Their results showed that the proposed model is accurate for predicting software defects. Fan et al. (2019) presented an SDP framework via an attention-based RNN. The models were evaluated based on an open-source Apache Java project, using F1-measure and AUC. The experimental results demonstrated that the proposed model improves the F1 measure by 14% and AUC by 7% compared with the state-of-the-art methods. Majd et al. (2020) proposed SLDeep using LSTM as a learning model, a technique for statement-level SDP based on more than 100,000 C/C + + projects. The results showed that the proposed model seems effective at statement-level SDP and can be adopted. Feng al. (2021) investigated the role of SMOTEbased and stable SMOTE-based oversampling techniques in improving SDP. The approach was evaluated based on four common classifiers across 26 datasets from the PROMISE Repository. This method addressed the class imbalance problem and achieved high prediction accuracy. The experimental analysis showed that the performance of stable SMOTEbased oversampling techniques is more stable and better than that of SMOTE-based oversampling techniques.
After reviewing previous studies in SDP, we noticed that most proposed methods ignore the class imbalance problem. According to studies that addressed the class imbalance problem (Feng et al., 2021;Khuat & Le, 2020;Pandey et al., 2020;Tong et al., 2018Tong et al., , 2020, the authors point out that the data balancing methods are essential in improving SDP accuracy. So, the primary point from the recent studies is that ML combined with data balancing methods can improve and increase prediction accuracy. Therefore, this work aims to address the class imbalance problem to enhance the efficiency of the proposed models.

Background
This section briefly introduces software defect prediction, convolutional neural networks, and gated recurrent unit.

Software defect prediction
Software defect prediction (SDP) is one of the most popular research areas in software engineering and a vital activity during software development and maintenance (Omri et al., 2020). The goal of SDP is to improve software quality and reduce the cost of software development by identifying and fixing defects early in the development cycle (Ferenc et al., 2020). Software defects are errors or bugs in software code that can cause the software to behave unexpectedly or unintentionally. These defects can result in software crashes, security vulnerabilities, data loss, and other negative consequences (Tong et al., 2018). Identifying and fixing defects early in the development process can save time and money by avoiding costly rework and reducing the risk of software failures. Bug reports are basic software development tools describing software defects, especially open-source software (Pandey et al., 2020).
To warranty software quality, many projects use bug reports to gather and record the reported bugs. The defects are classified into two classes: intrinsic bugs refer to bugs introduced by one or more specific changes to the source code and extrinsic bugs refer to bugs introduced by changes not recorded in the version control system. Several techniques are used in SDP, including statistical models, ML algorithms, and data mining techniques. These techniques use historical data on software defects, such as defect reports and code changes, to predict the likelihood of future defects (Kalaivani & Beena, 2018). Based on the type of data and the context of the prediction, SDP can be categorized into different types, which are: I. The within-project defect prediction (WPDP) approach involves using historical data to predict defects within a single project. WPDP approach uses data from the same project to train the prediction models, such as source code metrics, bug reports, and code reviews. II. Cross-project defect prediction (CPDP) approach for a similar dataset: This approach involves predicting defects in a new project using historical data from similar projects. The CPDP approach uses data from one or more similar projects to train the prediction models and then apply them to the new project. III. Cross-project defect prediction (CPDP) approach for a heterogeneous dataset: This approach involves predicting defects in a new project using historical data from projects that differ in their development context or characteristics. The CPDP approach uses data from one or more heterogeneous projects to train the prediction models and then apply them to the new project.
Each of these SDP approaches has its advantages and limitations. WPDP is usually more accurate since it is based on the specific context of the predicted project, but it requires a significant amount of historical data from the same project. CPDP for a similar dataset can be useful when there is not enough data for WPDP. Still, it assumes that the new project has a similar development context to the projects used for training. CPDP for a heterogeneous dataset can be challenging since the development contexts of the projects used for training and the new project may differ significantly. Still, it can be useful when there is insufficient data for WPDP or CPDP for a similar dataset Omri et al., 2020).

Convolutional neural network
A convolutional neural network (CNN) is a feedforward neural network that processes data with a known, grid-like topology. It can be used for both supervised learning and unsupervised learning. CNN was mainly designed for field image processing but has achieved tremendous success in practical applications, including speech recognition, natural language processing, etc. (Cao et al., 2020;Zhu et al., 2020). CNN model is inspired by the typical CNN architecture used in image classification and consists of a feature extraction part and a classification part, as shown in Fig. 1. These parts consist of convolution, batch normalization, and maximum merge layers. These layers constitute the hidden layer of the architecture. The convolutional layer performs convolution operations based on the specified filter and kernel parameters. It calculates the network weights to the next layer, while the maximum pooling layer achieves a reduction in the dimension of the feature space. Batch normalization is used to mitigate the effect of different input distributions for each training mini-batch to improve training. Activation functions enable the training of the CNN model quickly and accurately (Phan & Nguyen, 2017). There are many activation functions used in CNN, such as Sigmoid, Rectified Linear Unit (ReLU), and hyperbolic tangent (Tanh) (Li et al., 2017;Pan et al., 2019). In this model, we used two activation functions, the ReLU function for the input and hidden layers and the Sigmoid function for the output layer, as shown in the equations below. where X i is the input, W i is the weight of the input, e is [Euler's number] = 2.781…, and b is the bias.

Gated recurrent unit
A Gated recurrent unit (GRU) network is an optimized structure of the recurrent neural network (RNN). Due to the problem of long-term dependencies that arise when the input sequence is too long, RNN cannot guarantee a long-term nonlinear relationship. This means the learning sequence has a gradient vanishing and gradient explosion phenomenon. Many optimization theories and improved algorithms have been introduced to solve this problem, such as GRU networks, long short-term memory networks, Bidirectional long short-term memory, echo state networks, and independent RNN (Cao, 2020). The GRU network aims to solve the long-term dependence and gradient disappearance problem of RNN. A GRU network is similar to long short-term memory networks with a forget gate but has fewer parameters than long short-term memory LSTM. The GRU network uses the update and reset gates to optimize the learning mechanism, as shown in Fig. 2. The update gate helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future, and the reset gate helps the model to decide how much of the past information to forget (Li et al., 2019). The update gate model in the GRU network is calculated as shown in the equation below.
the z(t) is the update gate function, h(t − 1) is the output of the previous neuron, x(t) is the input of the current neuron, W(z) represents the weight of the update gate, and σ represents the sigmoid function. The reset gate model in the GRU neural networks is calculated as shown in the equation below.
r(t) is the reset gate function, h(t − 1) represents the output of the previous neuron, x(t) is the input of the current neuron, W(r) represents the weight of the reset gate, and σ is the sigmoid function. The output value of the GRU hidden layer is shown in the equation below.
h(t) is the output value to be determined in this neuron, h(t − 1) is the output of the previous neuron, x(t) represents the input of the current neuron, Wh represents the weight of the update gate and tanh () is the hyperbolic tangent function. Rt is used to control how much

Hypothesis and research questions
In this section, we will discuss the hypothesis and motivation along with research questions and then mention some existing studies addressing the class imbalance problem.
Our hypothesis in this study is if data balancing methods are applied to balance the original data sets, the classification performance of the proposed models will be better in SDP. To investigate our hypothesis, we used a paired t-test to find out whether there was a statistically significant difference between our models on the original and balanced datasets. The formula for the paired t-test is shown in Eq. 7 below. To statistically prove the validity of the impact of data balancing methods on the performance of ML algorithms, the hypothesis is formed as follows: H0: There is no difference in the accuracy of models when there are no data balancing methods and when the data balancing methods are used. H1: There is a difference in the accuracy of models when there are no data balancing methods and when the data balancing methods are used.
where m is the mean differences, n is the sample size (i.e., number of pairs), and s is the standard deviation. Based on our hypothesis, this study aims to understand the impact of data balancing methods on the performance of ML algorithms in SDP. In particular, we aim to address the following research questions.

RQ1: Do data balancing methods improve the accuracy of ML models in SDP?
This RQ. Investigates data balancing methods to improve the accuracy of ML models in SDP.

RQ2: Does the proposed approach outperform the state-of-the-art approaches in SDP?
This RQ. It aims to investigate the performance of the proposed approach in SDP compared to the state-of-the-art approaches.
The motivation for the above research questions relates to the importance of applying data balancing methods in SDP studies. According to the latest research on SDP, applying data balancing methods is important in predicting software defects using ML algorithms to ensure that the model is not biased toward the majority class and can accurately predict defective and non-defective modules. Some studies in SDP that applied data balancing methods to address the class imbalance problem revealed that data balancing methods have an important role in improving the accuracy of ML models in SDP. Tong Haonan al. (Tong et al., 2018(Tong et al., , 2020 prove in their work that the two-stage ensemble and credibilitybased imbalance-boosting are more effective methods for addressing the class imbalance problem in SDP than classic ensemble learning methods. Thanh Tung Khuat and My Hanh Le. (Khuat & Le, 2020) prove in their work that there are positive effects of combining sampling techniques with ensemble learning models in improving the accuracy of SDP. Pandey et al. (2020) prove in their work that combining deep representation with ensemble learning positively enhances the accuracy of software bug prediction. Feng al. (2021) prove in their work that the stable SMOTE-based oversampling techniques are more durable and better than the SMOTE-based oversampling technique.

Motivation
According to the literature, existing SDP studies suffer from the class imbalance problem. So, several reasons motivate us to apply data balancing methods in predicting software defects using ML algorithms: i. Improve the performance of the ML models: Imbalanced datasets can lead to biased ML models that perform poorly on the minority class. Data balancing methods such as the synthetic minority oversampling technique plus the Tomek link (SMOTE Tomek) can help improve the performance of the ML model on the minority class. ii. Better feature representation: Balancing the dataset can help the model learn better feature representations for the minority class. This can lead to better discrimination between defective and non-defective samples and improved model performance. iii. Reduce overfitting: Imbalanced datasets can lead to overfitting, where the model learns to over-emphasize the majority class and ignore the minority class. Data balancing methods can help reduce overfitting by balancing the dataset and making it easier for the model to learn from the minority class samples.

Proposed methodology
In this section, we present our proposed methodology for SDP using novel ML models (CNN and GRU) combined with data sampling methods (SMOTE Tomek method). We have acquired the datasets from the PROMISE repository. We have applied data pre-processing techniques to deal with problems such as noise and unwanted outliers, missing values, feature type conversion, and normalization. We have also used feature selection techniques to choose the features more relevant to the target class. Then, we applied data balancing methods to balance the training stet. The datasets are split into training and testing datasets to train and test the proposed models. Finally, we built and evaluated our models based on many standard performance measures. A series of steps have been taken and described, such as benchmark datasets used, software metrics used, data pre-processing, features selection, dataset balancing, and performance measures used. Figure 3 illustrates the whole workflow of the proposed SDP methodology, where each step is described in the following sections.

Benchmark datasets and software metrics
To verify the validity of the proposed approach, we selected six open-source Java projects from the PROMISE dataset (www.kaggle.com.datasets.nazgolnikravesh.software-defectprediction-dataset). All six projects' source codes and corresponding PROMISE data are public ( (Deng et al., 2020a;Farid et al., 2021;Xia et al., 2016). These projects cover applications such as XML parsers, text search engine libraries, and data transport adapters, and these projects have traditional static metrics for each Java file. The selection of projects was based on the percentage of data imbalance in them. To guarantee the generality of the evaluation results, experimental datasets consist of projects with different sizes and defect rates (in the six projects, the maximum number of instances is 965, and the minimum number of instances is 205. In addition, the minimum defect rate is 2.23% and the maximum defect rate is 92.19%). The reason for selecting these datasets is that (i) the PROMISE datasets are derived from common platform data, publicly available for different domains, and considered as the baseline datasets for SDP studies. (ii) These data sets are freely available (open source) and have public properties, which is beneficial for directly using them and verifying the performance of models. Researchers can use it to verify, compare and iterate their studies, (iii) These datasets are imbalanced, allowing us to apply and assess our proposed method to address the class imbalance problem ( (Feng et al., 2021;Khuat & Le, 2020). Table 1 shows the essential information of selected projects, including project name, project version, number of instances, and defect rate or the percentage of defective instances. Software metrics play the most vital role in building a prediction model to improve software quality by predicting as many software defects as possible. Software metrics in the context of SDP are considered independent variables. Many previous researchers have pointed out that there is a relationship between software metrics and defect predictions (Kumar & Singh, 2021). Generally, the software metrics used in SDP can be divided into  static code metrics and process metrics. Static code metrics represent how the source code is complex and include information about the software codes depending on the type of coding; process metrics represent the complex development process from some values such as developer count, time, effort, and cost . In 1976, McCabe released the first static code metrics standard; in 1977, Halstead developed a new metric standard. Some practitioners use this metric as an indicator of defect-proneness level (Öztürk, 2017). The primary studies use software metrics as independent variables for measuring the quality of software modules. Several researchers used McCabe and Halstead metrics as independent variables in SDP. This study relies on the McCabe and Halstead metrics as independent variables. Table 2 shows the traditional static code metrics contained in the PROMISE repository, and for the descriptions, the readers are referred to (Xia et al., 2016).

Data pre-processing and features selection
Pre-processing the collected data is one of the critical stages before constructing the model.
To generate a good model, data quality needs to be considered. Data pre-processing is a group of techniques applied to the data to remove noise and unwanted outliers from the data set, deal with missing values, feature type conversion, etc., to improve data quality before building the model (Farid et al., 2021;Miholca et al., 2018;Zhao et al., 2019). In Table 2 List of 20 traditional static metrics of PROMISE. Descriptions were given in (Xia et al., 2016) Attribute Description dit The maximum distance from a given class to the root of an inheritance tree noc Number of children of a given class in an inheritance tree cbo Number of classes that are coupled to a given class rfc Number of distinct methods invoked by code in a given class lcom Number of method pairs in a class that do not share access to any class attributes lcom3 Another the lcom metric proposed by Henderson-Sellers npm Number of public methods in a given class loc Number of lines of code in a given class dam The ratio of the number of private/protected attributes to the total number of attributes in a given class moa Number of attributes in a given class that are of user-defined types mfa Number of methods inherited by a given class divided by the total number of methods that can be accessed by the member methods of the given class cam The ratio of the sum of the number of different parameter types of every method in a given class to the product of the number of methods in the given class and the number of different method parameter types in the whole class ic Number of parent classes that a given class is coupled to cbm Total number of new or overwritten methods that all inherited methods in a given class is coupled to amc The average size of methods in a given class ca Afferent coupling, which measures the number of classes that depend on a given class ce Efferent coupling, which measures the number of classes that a given class depends on max_cc The maximum McCabe's cyclomatic complexity (CC) a score of methods in a given class avg_cc The arithmetic mean of McCabe's cyclomatic complexity (CC) scores of methods in a given class addition, normalization is necessary to convert the values into scaled values (scaling data in numeric variables in the range of 0 to 1) to increase the model's efficiency. Min-Max normalization is a simple and easy-to-implement technique. It can preserve the shape of the original distribution of the data because it scales and shifts the data without changing its relative position. Normalizing the data using Min-Max normalization can improve the convergence of ML models. It helps reduce the data's range and makes it easier for the optimization algorithms to find the optimal weights. Further, it reduces the impact of outliers by scaling the data to a fixed range (Qiao et al., 2020). Therefore, the data set was normalized using Min-Max normalization. The formula for calculating the normalized score can be described in (8). Feature selection is crucial in selecting the most discriminative features from the list using appropriate feature selection methods (Agarwal & Tomar, 2014;Li et al., 2018)). The goal of feature selection is to choose the features more relevant to the target class from high-dimensional features and remove the redundant and uncorrelated features (Shippey et al., 2019;Zhao et al., 2018). Feature selection is categorized into three categories: filter methods, wrapper methods, and embedded methods. Each method has rules for selecting the most relevant features as independent variables for training ML models (Jain & Saha, 2021)). In this study, our models were based on embedded methods because these methods fit ML models.
where max(x) and min(x) represent the maximum and minimum value of the attribute x, respectively.

Class imbalance and sampling techniques
Class imbalance in classification models represents those situations where the number of examples of one class is much smaller than others (Bashir et al., 2018). If the model is trained on imbalanced datasets, the prediction results will be biased towards the majority class. So, the problem of imbalanced data often leads to the misclassification of cases in the minority class. The datasets used in our study suffer from a common problem in SDP studies: class imbalance (Chen et al., 2015;Deng et al., 2020a;Phan & Nguyen, 2017). The reference datasets are not correctly distributed, showing a lack in the actual distribution of learning instances (The number of defective cases is smaller than non-defective), as shown in Table 1. We manage this problem by modifying the original datasets to increase the realism of the data (Öztürk, 2017). Several data balancing methods have been developed to overcome the imbalanced classes problem; these techniques include subset methods, costsensitive learning, algorithm-level implementations, ensemble learning, feature selection methods, clustering methods, optimization methods, and data sampling techniques. Each method can be useful in different contexts, depending on the problem being addressed. Data sampling techniques are the most commonly known methods to address the distributions of imbalanced classes in datasets. Data sampling techniques are more prevalent in the studies of the prediction of software defects due to their easy employment and independence (i.e., they can be applied to any prediction model) (Deng et al., 2020b;Tong et al., 2020). Data sampling techniques aim at modifying the dataset to be processed and obtaining a representative sample of the data. Data sampling techniques tend to adjust the prior distribution of the majority and minority classes in the training data to get a balanced class distribution. Data sampling techniques can be an effective way to reduce the computational 1 3 burden of analyzing large datasets. They can help ensure that the analysis results apply to the broader population. Data sampling techniques might be divided into oversampling and under-sampling (Feng et al., 2021). Oversampling techniques supplement instances of the minority class to the dataset, while the under-sampling techniques eliminate samples of the majority class to obtain a balanced dataset (Khuat & Le, 2020 Tomek is a new method that was applied using the library from imbalanced learn, which combines the synthetic minority oversampling technique (SMOTE) function for oversampling and the Tomek Link function for under-sampling (Elhassan & Aljurf, 2016, Jonathan et al., 2020Swana et al., 2022)). This study used the SMOTE Tomek method to address the class imbalance problem. Figure 4 shows the distribution of learning instances over the original and balanced data sets.

Models building and evaluation
Most studies of SDP divide the data into two sets: a training set and a test set. The training set is used to train the model, whereas the testing set is used to evaluate the performance of the defects prediction model. Once a defects prediction model is built, its implementation must be considered (Nehéz & Khleel, 2022). We built our models using Keras as a high-level API based on TensorFlow. Training datasets comprise 80% of the dataset (random selection of features), while test datasets include 20%; each model was developed separately with different parameters, as shown in Table 3. We evaluate our proposed models' performance based on standard performance measures such as confusion matrices, MCC, AUC, AUCPR, and MSE as a Loss function. MCC is used for model evaluation by measuring the difference and describing the correlation between the predicted and actual values (Chen et al., 2015). AUC, which plots the false positive rate on the x-axis and the actual positive rate on the y-axis over all possible classification thresholds (Pandey et al., 2020). AUCPR is a curve that plots the precision versus the recall or a single number summary of the information in the precisionrecall curve. MSE is a metric that measures the amount of error in the model. It assesses the average squared difference between the actual and predicted values (Nehéz & Khleel, 2022). A confusion matrix is a specific table used to measure the performance of a model. A confusion matrix summarizes the results of the testing algorithm. It presents a report of True Positives (T.P.), False Positives (F.P.), True Negatives (T.N.), and False Negatives (F.N.) (Koay et al., 2022;Napierala & Stefanowski, 2012), as shown in Table 4.   where n is the number of observations, x(i) is the actual value, y(i) is the observed or predicted value for the i th observation.

Experimental results and discussion
In this section, we evaluate the efficiency of our proposed models. The experimental environment was based on Python and used data from the same project for training and testing. The study has considered six open-source datasets for empirical analysis using CNN and GRU. As part of our experimental analysis, we employed the traditional ML (RF) algorithm as a baseline model and compared it with our proposed models.
To answer the research question-RQ1, the performance of the prediction models is reported in Tables 5, 6 ,7,8,9,10,11,12,and Figs. 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21 are mentioned below. Table 12 below presents the statistical analysis results (paired t-test) of proposed models on the original and balanced datasets regarding mean, Standard Deviation (STD), min, max, and P value. We notice that the mean values of the CNN model are 0.90 on the original datasets and 0.92 on the balanced datasets, while the mean values of the GRU model are 0.89 on the original datasets and 0.91 on the balanced datasets. The      Figures 6,7,8,9,10,11,12, 13 below show the training and validation accuracy and training and validation loss of the models on the original and balanced datasets. Figure 6 shows the accuracy values of the CNN model on the original data sets. The accuracy values are 0.83 on the ant data set, 0.82 on the camel data set, 0.90 on the ivy data set, 0.96 on the jedit data set, 0.95 on the log4j data set, and 0.94 on the xerces data set. Figure 7 shows the accuracy values of the CNN model on the balanced data sets. The accuracy values are 0.85 on the ant data set, 0.84 on the camel data set, 0.95 on the ivy data set, 0.97 on the jedit data set, 0.97 on the log4j data set, and 0.95 on the xerces data set. Figure 8 shows the accuracy values of the GRU model on the original data sets. The accuracy values are 0.81 on the ant data set, 0.79 on the camel data set, 0.92 on the ivy data set, 0.97 on the jedit data set, 0.95 on the log4j data set, and 0.91 on the xerces data set. Figure 9 shows the accuracy values of the GRU model on the balanced datasets. The accuracy values are 0.83 on the ant data set, 0.82 on the camel data set, 0.95 on the ivy data set, 0.99 on the jedit data set, 0.96 on the log4j data set, and 0.93 on the xerces data set. Figure 10 shows the loss values of the CNN model on the original data sets. The loss values are 0.131 on the ant data set, 0.136 on the camel data set, 0.086 on the ivy data set, 0.037 on the jedit data set, 0.048 on the log4j data set, and 0.049 on the xerces data set. Figure 11 shows the loss values of the CNN model on the balanced data sets. The loss values are 0.117 on the ant data set, 0.132 on the camel data set, 0.051 on the ivy data set, 0.027 on the jedit data set, 0.028 on the log4j data set, and 0.043 on the xerces data set. Figure 12 shows the loss values of the GRU model on the original data sets. The loss values are 0.152 on the ant data set, 0.146 on the camel data set, 0.076 on the ivy data set, 0.028 on the jedit data set, 0.048 on the log4j data set, and 0.090 on the  Figure 13 shows the loss values of the GRU model on the balanced data sets. The loss values are 0.130 on the ant data set, 0.144 on the camel data set, 0.055 on the ivy data set, 0.026 on the jedit data set, 0.073 on the log4j data set, and 0.064 on the xerces data set. As shown in the figures, the accuracy of training and validation increases and the loss decreases with increasing epochs. Regarding the high accuracy and low loss obtained by the proposed models, we note that the models are well-trained and validated. Figures 14, 15, 16, 17 below show the ROC curves of the models on the original and balanced datasets. Figure 14 shows the AUC values of the CNN model on the original data sets. The best AUC obtained is 95% on the xerces data set, while the worst AUC is 46% on the log4j data set. Figure 15 shows the AUC values of the CNN model on the balanced data sets. The best AUC obtained is 99% on the log4j and xerces data sets, while the worst AUC is 90% on the camel data set. Figure 16 shows the AUC values of the GRU model on the original data sets. The best AUC obtained is 93% on the jedit data set, while the worst AUC is 29% on the log4j data set. Figure 17 shows the AUC values of the GRU model on the balanced data sets. The best AUC obtained is 100% on the jedit data set, while the worst AUC is 87% on the camel data set. Figures 18, 19, 20, 21 below show the AUCPR of the models on the original and balanced datasets. Figure 18 shows the AUCPR values of the CNN model on the original data sets. The best AUCPR obtained is 98% on the xerces data set, while the worst AUCPR is 7% on the jedit data set. Figure 19 shows the AUCPR values of the CNN model on the balanced data sets. The best AUCPR obtained is 99% on the log4j and xerces data sets, while the worst AUCPR is 88% on the jedit data set. Figure 20 shows the AUCPR values of the GRU model on the original data sets. The best AUCPR obtained is 93% on the log4j data set, while the worst AUCPR is 24% on the jedit data set. Figure 21 shows the AUCPR values of the GRU model on the balanced data sets. The best AUCPR obtained is 100% on the jedit and jedit data sets, while the worst AUCPR is 84% on the camel data set.
After comparing the results obtained by the proposed models on the original datasets with results obtained by the proposed models on the balanced datasets, as shown in the tables and figures, we note that the models got good scores on the balanced datasets. The results improved further due to balancing, which indicated that the proposed models performed well and data balancing methods play an important role in improving the models' accuracy.
To answer the research question-RQ2, we compared the results produced using our models with those obtained using the baseline model (RF) based on six performance measures: accuracy precision, recall, f-Measure, MCC, and AUC. Table 13 below compares our models with the baseline model (RF). According to Table 13, our models outperform the baseline model in some datasets. We also compared the results produced using our models with those obtained in previous studies based on six performance measures: accuracy  Table 14 below compares the performance measures obtained by our models and the performance values in previous studies. The best values are indicated with bold text and "-"to indicate the approaches that did not provide results in a particular data set. According to Table 14, some of the results in the previous studies are better than ours. Still, in most cases, our models outperform the state-of-the-art approaches and provide better predictive performance.

The implication of the findings
The findings have implications for researchers. Researchers are interested in quantitatively understanding the effectiveness and efficiency of applying data balancing methods with ML techniques in SDP. Furthermore, the formers are concerned about the qualitative perspective

Threats to validity
This section discusses the threats to our study's validity and experiment limitations and how we mitigate them. It is vital to assess the threats to validity, such as construct, internal, external, and experiment limitations, particularly constraints on the search process and deviations from the standard practice.
Construct validity concerns the study's design and its possibility to reflect the actual goal of the research. To avoid threats in study design, we have applied a procedure of To ensure that researched area is relevant to the study goal, we cross-checked the research questions and adjusted them several times. Besides, the metrics considered may be a threat to our study. We only adopt static code metrics to predict defects. Thus, we cannot claim that we could generalize our conclusion to other metrics. However, many previous studies also widely adopted static code metrics (Chen et al., 2015;Feng et al., 2021). Another threat is the construction of ML models. We considered several aspects that could have influenced the study, i.e., data pre-processing, which features to think about, how to train the models, etc. However, the procedures followed in this respect are precise enough to ensure the study's validity.
Threats to internal validity are related to the correctness of the experiment's outcome or the study's process. The main threat to internal validity is datasets. The reference datasets are imbalanced datasets that show a lack in the actual distribution of the percentage of defects and non-defective classes. We manage this threat by modifying the original datasets to increase the realism of the data in terms of the defect's actual presence in the software system. The distribution of the dataset is modified by applying two data sampling techniques. Another threat is that most of our datasets have a small number of defects. These small number of defects make it challenging to generate statistically significant results; we tried to minimize that threat by applying standard performance measures for SDP; however, we acknowledge that several statistical tests (Arcuri & Briand, 2014) can be used to verify the statistical significance of our conclusions. Therefore, we plan to conduct more statistical tests in our future work.
External validity relates to the study's generalizability to a broader range of applications. We tried to select and gather different types of datasets from various projects of the PROMISE repository to test our experiment. Our criteria in project selection were based on the ratio of defects. So, we chose projects with a high and low percentage of defects (projects with imbalanced classes) to help us apply data balancing methods. We selected six open-source Java projects of the PROMISE dataset as our evaluation datasets. However, we cannot declare that our results can be generalized. Future replication is necessary to confirm the generalizability of our findings in this study.
The limitations of the experiments are summarized as follows. First, the datasets used in our experiments are limited to only six open-source Java projects. Second, our findings may not be sufficient for generalization.

Conclusion
Various ML and DL techniques have recently been used to build SDP models. Software defects significantly impact the software development life cycle, and defect prevention plays a vital role in software quality assurance and the effective help of software maintenance. SDP is a process of generating models or tools to predict software defects based on historical data. Early defect prediction helps prioritize and optimize effort and costs for inspection and testing. Historical software metrics that indicate defective data are primary inputs to the models. To improve the existing state-of-the-art approaches to predict software defects, we proposed a novel approach based on CNN and GRU combined with SMOTE Tomek to predict defects in the source code. The data sampling method (SMOTE Tomek) was used to

Table 14
Comparison of the proposed models with other existing approaches  address the class imbalance problem. To evaluate the effectiveness of the proposed models, we performed a series of experiments on six public software defect datasets. The results were compared with random forest (RF) as a baseline model. We found that the proposed models on the balanced datasets with an average precision of 90% for the CNN model and 92% for the GRU model compared with the RF model (62%). Our results showed that the proposed models on the balanced datasets improve the average precision by 28% and 30%, respectively, compared to the RF model, which proves that the proposed models outperform a baseline model. The average Accuracy, Precision, Recall, F-Measure, and MCC of the proposed models on the original datasets was 90%, 64%, 48%, 52%, and 32%, respectively, for the CNN model and 89%, 58%, 49%, 51%, and 28%, respectively, for the GRU model. In comparison, the average Accuracy, Precision, Recall, F-Measure, and MCC of the proposed models on the balanced datasets was 92%, 90%, 94%, 92%, and 84%, respectively, for the CNN model and 91%, 92%, 91%, 91%, and 82%, respectively for the GRU model. The experimental results demonstrate that the proposed models perform better and that there are positive effects of combining CNN and GRU models with SMOTE Tomek method on the performance of SDP regarding datasets with imbalanced class distributions, and our approach is a more promising alternative for addressing the problem of class imbalance in SDP as compared with previous methods. The robustness and accuracy of our proposed approach will be evaluated on various datasets in our future work.

Declarations
Competing interests The authors declared that they have no competing interests in this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Ethics approval and Consent to participate Not applicable.

Consent for publication Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.