Software Defect Prediction Based on Stacked Contractive Autoencoder and Multi-Objective Optimization

: Software defect prediction plays an important role in software quality assurance. However, the performance of the prediction model is susceptible to the irrelevant and redundant features. In addition, previous studies mostly regard software defect prediction as a single objective optimization problem, and multi-objective software defect prediction has not been thoroughly investigated. For the above two reasons, we propose the following solutions in this paper: (1) we leverage an advanced deep neural network—Stacked Contractive AutoEncoder (SCAE) to extract the robust deep semantic features from the original defect features, which has stronger discrimination capacity for different classes (defective or non-defective). (2) we propose a novel multi-objective defect prediction model named SMONGE that utilizes the Multi-Objective NSGAII algorithm to optimize the advanced neural network—Extreme learning machine (ELM) based on state-of-the-art Pareto optimal solutions according to the features extracted by SCAE. We mainly consider two objectives. One objective is to maximize the performance of ELM, which refers to the benefit of the SMONGE model. Another objective is to minimize the output weight norm of ELM, which is related to the cost of the SMONGE model. We compare the SCAE with six state-of-the-art feature extraction methods and compare the SMONGE model with multiple baseline models that contain four classic defect predictors and the MONGE model without SCAE across 20 open source software projects. The experimental results verify that the superiority of SCAE and SMONGE on seven evaluation metrics.

testers to detect the potentially defective modules by reasonably allocating limited resources [Tantithamthavorn, McIntosh, Hassan et al. (2016)]. When building the software defect datasets, researchers can inspect software defect modules by designing many software features based on the software development process or code complexity, so these defect datasets may be high dimensional [Xu, Liu, Luo et al. (2018)]. But not all the features are helpful to the performance of the defect prediction model since the datasets may contain some irrelevant and redundant features. Jiarpakdee et al. [Jiarpakdee, Tantithamthavorn, Ihara et al. (2016)] demonstrate that 10%-67% of features in the 101 open source defect datasets are irrelevant or redundant, and these features seriously degrade the prediction performance and increase the training time of the model. Therefore, it is very necessary to conduct feature selection or extraction for defect datasets in software defect prediction. For the above reason, some feature selection or extraction methods are proposed to solve the high-dimensional problem of software defect datasets by removing irrelevant and redundant features [Kondo, Bezemer, Kamei et al. (2019); Xu, Liu, Yang et al. (2016)]. Feature selection techniques reduce the number of features by selecting an optimal representative and important feature subset, while feature extraction techniques decrease the number of features by constructing new, combined features from the original features [Kondo, Bezemer, Kamei et al. (2019)]. At present, most previous studies mainly leverage feature selection techniques for defect prediction, feature extraction techniques have not been thoroughly investigated in software defect prediction. Because feature selection techniques directly remove some features, which will lead to the loss of some feature information, we adopt feature extraction technique in this paper. For feature extraction techniques, most researchers use Principal Component Analysis (PCA) [Kondo, Bezemer, Kamei et al. (2019)] to conduct software defect prediction. Traditional features extracted by PCA focus on the statistical features of software modules and these features are easily affected by the unbalanced data, so the inherent structure information hidden behind the original defect features may not be represented fully. Currently, deep learning techniques have been successfully applied in many fields by constructing a deep network architecture to automatically learn deep semantic feature representation, such as speech recognition [Mohamed, Dahl and Hinton (2012)], image classification [Krizhevsky, Sutskever and Hinton (2012)], traffic sign classification [Zhang, Wang, Lu et al. (2019)], concentration prediction of PM10 [Oh, Song, Kim et al. (2019)] etc. Previous studies [Guo, Cheng and Cleland-Huang (2017); Wang, Liu and Tan (2016)] have verified that the deep semantic features have stronger discrimination capacity for different classes (defective or nondefective). For these reasons above, we leverage an advanced deep neural network-Stacked Contractive AutoEncoder (SCAE) [Rifai, Vincent, Muller et al. (2011);Ning, Chen, Tie et al. (2018)] to extract the robust deep semantic features from the original defect features. On the one hand, SCAE adopts the Frobenius norm of Jacobian matrix as the regularization penalty term, which can enhance the locally invariant and robust encoding representation. On the other hand, the unsupervised deep network SCAE is stacked by some unsupervised contractive autoencoders (CAE), and the hidden layer of each subnetwork serves as the input layer for the next subnetwork, thereby further improving the robustness and discrimination capacity of deep feature representation. The SCAE can not only prevent the deep network from overfitting but also effectively provide a deep combination of basic features with its excellent nonlinear mapping capability.
Currently, most previous studies utilize classical machine learning methods to build defect prediction models, but these traditional methods have some inevitable flaws. For instance, traditional machine learning methods usually require complex feature engineering, and the prediction performance is not good enough, and their adaptability and migration capacity are not strong enough [Wang, Liu and Tan (2016)]. Based on the above analysis, we adopt an advanced neural network-Extreme Learning Machine (ELM) to construct defect prediction model according to the robust deep semantic features extracted by the SCAE in this paper. The ELM has obvious advantages in classification, including strong discrimination capacity, good generalization performance and fast training speed [Huang, Zhu and Siew (2006)]. At present, search-based software engineering has become a research hotspot in the field of software engineering because it can provide automated or semi-automated solutions for software engineering problems with large-scale complex problem space, which may have multiple competing or even conflicting objectives based on state-of-the-art Pareto optimal solutions [Ni, Chen, Wu et al. (2019)]. Prior studies mostly treat software defect prediction as a single objective optimization problem, and multi-objective software defect prediction has not been thoroughly investigated. In this paper, we propose a novel multi-objective defect prediction model named SMONGE, which leverages the Multi-Objective NSGAII algorithm to optimize the number of hidden neurons and output weight norm of ELM based on state-of-the-art Pareto optimal solutions according to the features extracted by SCAE. We mainly consider two objectives. One objective is to maximize the performance of the constructed defect prediction model, which refers to the benefit of the prediction model. Another objective is to minimize the output weight norm, which is related to the cost of the prediction model. Therefore, we need to make a compromise between these two contradictory objectives. The main contributions of this paper are as follows: (1) We leverage an advanced deep neural network-Stacked Contractive AutoEncoder (SCAE) to extract the robust deep semantic features from the original defect features, which has stronger discrimination capacity for different classes (defective or non-defective).
(2) Motivated by the idea of search based software engineering, we propose a novel multiobjective defect prediction model named SMONGE that utilizes the multi-objective NSGAII algorithm to optimize two objectives of the advanced ELM predictor based on state-of-the-art Pareto optimal solutions. One objective is to maximize the model performance, which refers to the benefit of the prediction model. Another objective is to minimize the output weight norm, which is related to the cost of the prediction model. To the best of our knowledge, it is the first time that the multi-objective NSGAII algorithm is used to optimize the advanced neural network-ELM.
(3) To verify the performance of SCAE and SMONGE, we conduct extensive experiments for feature extraction and defect prediction across 20 software defect projects from large open source datasets. We compare the SCAE with six state-of-the-art feature extraction methods, and compare the SMONGE model with multiple baseline models that contain four classic defect predictor and the MONGE model without SCAE. The experimental results demonstrate that the effectiveness of SCAE and SMONGE on seven evaluation metrics.
The reminder of this paper is organized as follows. Section 2 describes the background and related work. Section 3 details feature extraction based on SCAE. Section 4 details the proposed SMONGE model. Section 5 shows the experimental setup, including benchmark datasets, evaluation metrics and baseline models. Section 6 evaluates the performance of SCAE and SMONGE. Section 7 introduces the threats to validity. We conclude this paper and describe the future work in Section 8.

Background and related work
In this section, we introduce the typical software defect prediction models, feature selection and extraction methods for software defect prediction, and the application of deep learning techniques in software engineering.

Software defect prediction
Software defect prediction is a research hotspot in software engineering domain, which can be used to identify potential defective modules in advance by constructing the effective prediction model, and then allocate more testing resources on these defective modules [Tantithamthavorn, McIntosh, Hassan et al. (2017, 2016]. The granularity of the modules can be classified as component, file, class or code change [Yasutaka, Takafumi, Shane et al. (2016)]. Existing software defect prediction methods focus on how to use machine learning methods to construct effective defect prediction models [Ren, Qin, Ma et al. (2014); Chen and Ma (2015); Lu, Kocaguneli and Cukic (2014)]. Chen et al. [Chen and Ma (2015)] conduct extensive empirical studies by using six regression algorithms and find that decision tree regression can achieve best performance. Lu et al. [Lu, Kocaguneli and Cukic (2014)] use active learning method to conduct defect prediction model, and the method can significantly improve the prediction effect. Nam et al. [Nam, Pan and Kim (2013)] successfully apply Transfer Component Analysis (TCA) technique to software defect prediction. Abaei et al. [Abaei, Rezaei and Selamat (2013)] propose the self-organizing mapping (SOM) prediction model with the threshold, and it can help testers to mark modules without experts. Previous studies mostly regard software defect prediction as a single objective optimization problem, and multi-objective software defect prediction has not been thoroughly investigated. As far as we know, only the MOFES method proposed by Ni et al. [Ni, Chen, Wu et al. (2019)] considers the multi-objective optimization in software defect prediction, but their study only considers feature selection as a multi-objective optimization problem. Different from the study of Ni et al. [Ni, Chen, Wu et al. (2019)], we regard the optimization of the defect prediction model as a multi-objective optimization problem, and the model leverages the NSGAII algorithm to optimize two objectives of the advanced ELM predictor.

Feature selection and extraction for software defect prediction
Recently, feature selection and extraction techniques have been applied to software defect prediction, which can eliminate irrelevant and redundant features in the defect datasets [Kondo, Bezemer, Kamei et al. (2019);Xu, Liu, Yang et al. (2016)]. Feature selection methods reduce the number of features in a model by selecting an optimal representative feature subset, while feature extraction methods decrease the number of features by constructing new, combined features from the original features [Kondo, Bezemer, Kamei et al. (2019)]. Feature selection methods are mainly divided into two types: filter-based feature ranking methods or wrapper-based feature subset selection methods [Majdi and Seyedali (2017)]. Most previous studies mainly use feature selection techniques for defect prediction, while feature extraction techniques have not been thoroughly investigated in software defect prediction. Previous studies have applied many feature selection techniques to software defect prediction [Liu, Miao and Zhang (2014); Khoshgoftaar, Gao and Napolitano (2012); Gao, Khoshgoftaar and Wang (2011)]. Liu et al. [Liu, Miao and Zhang (2014)] propose three new cost-sensitive based feature selection methods, including Cost-Sensitive Variance Score (CSVS), Cost-Sensitive Laplacian Score (CSLS), and Cost-Sensitive Constraint Score (CSCS), which incorporate cost information into traditional feature selection methods. Khoshgoftaar et al. [Khoshgoftaar, Gao and Napolitano (2012)] compare seven filter-based feature ranking techniques (e.g., information gain (IG), gain ratio (GR)) on sixteen defect datasets. Gao et al. [Gao, Khoshgoftaar and Wang (2011)] verify the performance of hybrid feature selection framework based on seven filter-based methods and three feature subset search methods, and the experimental results show that the reduced features are unable to adversely affect the performance of the prediction model in most cases. Xu et al. [Xu, Liu, Yang et al. (2016)] investigate the impact of 32 feature selection techniques on the software defect prediction, and the experimental results verify that these feature selection techniques have significant performance differences on each dataset. Ni et al. [Ni, Chen, Wu et al. (2019)] use five different multi-objective optimization algorithms (i.e., MOCell, SPEA2, NSGA-II, PAES and SMSEMOA) to conduct feature selection respectively, and the experimental results verify that the effectiveness of the multi-objective optimization feature selection algorithms. For feature extraction techniques, most researchers use principal component analysis (PCA) [Kondo, Bezemer, Kamei et al. (2019)] to conduct feature extraction in software defect prediction. Marco et al. [Marco, Michele and Romain (2010)] adopt PCA to conduct class-level defect prediction, which can void the problem of multicollinearity among the independent variables. Rathore et al. [Rathore and Gupta (2014)] compare PCA with feature selection techniques. The experimental results prove that PCA is one of the best-performing techniques. Different the previous studies, we leverage an advanced deep neural network-stacked contractive autoencoder (SCAE) to effectively learn the robust deep semantic feature representation from the original defect features, which has stronger discrimination capacity for different classes.

The application of deep learning techniques in software engineering
Recently, some researchers adopt deep learning techniques to improve various tasks in the field of software engineering [Yang, David and Zhang (2015); Wang, Liu and Tan (2016); Gu, Zhang, Zhang et al. (2016)]. Gu et al. [Gu, Zhang, Zhang et al. (2016)] utilize the RNN encoder-decoder to address the problem of retrieving API call sequences based on the user's natural language query. Wang et al. [Wang, Liu and Tan (2016)] leverage deep belief network (DBN) to learn deep semantic features automatically. The experimental results verify that the deep semantic features-based method outperforms traditional software metrics. Yang et al. [Yang, David and Zhang (2015)] propose a novel just-in-time defect prediction model named Deeper, which can combine initial change features into high-level features by deep belief network (DBN), and then utilize the new high-level features to construct the defect prediction model. Deep learning techniques are also used for software traceability [Guo, Cheng, Cleland-Huang et al. (2017)], test report classification [Wang, Cui, Wang et al. (2017)], link prediction in developer online forums [Xu, Ye, Xing et al. (2016)] and so on.

Feature extraction based on stacked contractive autoencoder
In this paper, after class imbalance processing (SMOTE) [Chawla, Bowyer, Hall et al. (2002)] and data normalization (min-max) [Witten, Frank and Hall (2011)] operations, we utilize an advanced unsupervised deep neural network-stacked contractive autoencoder (SCAE) to extract the robust deep semantic features from the original defect features with its nonlinear mapping capability, which can properly characterize the complex data structures and increase the probability of linear separability of the data. SCAE is a variant of regularized autoencoder, which adopts the Frobenius norm of Jacobian matrix of encoder activations as the regularization penalty term, so as to form a localized space contraction and yield robust features on the activation layer. In addition, SCAE regards the hidden layer of each subnetwork as the input layer of next subnetwork, which further enhances the robustness and the discrimination capacity of deep feature representation. The training process for SCAE is as follows. A basic autoencoder subnetwork consists of two parts: encoder and decoder. The encoder f(x) is used to output the representation ℎ ∈ ℎ after feature extraction, while the decoder g(h) reconstructs the original input ∈ and output r from the output h of the encoder by minimizing the cost function. The internal structures of the encoder f(x) and decoder g(h) are all mapping functions with nonlinear activation functions, as shown in Eqs. (1) and (2): where and represent the nonlinear activation functions of encoder and decoder, respectively. we adopt the sigmoid() as the nonlinear activation function in this paper, ( ) = 1 1+ − . W represents the ( × ℎ )-dimension weight from the input layer to the hidden layer, and ′ represents the ( ℎ × )-dimension weight from the hidden layer to the reconstruction layer. ℎ ∈ ℎ and ∈ denote the bias vectors of encoder and decoder, respectively. The parameters of the autoencoder are shown below: = { , ′ , ℎ , }. In order to improve the robustness of small perturbations around the training points and learn a mapping with stronger contraction effect on the training instances, we introduce a penalty term that penalizes the highly sensitive inputs to increase the robustness of the network in the form of the mapping f(x) of the encoder with respect to the Frobenius norm of the Jacobian matrix of the input x , and the sensitivity penalty term is the sum of squares of all partial derivatives for the extracted features according to input dimensions, as shown in Eq. (3): Assume the training set is Dtr, we learn the parameters of the SCAE by minimizing the reconstruction error and penalizing the gradient. The entire loss function of SCAE is as follows: where L is the reconstruction error in the form of cross entropy loss (nonlinear error), and is a superparameter that controls the intensity of regularization, n presents the number of classes, y denotes the true value classification and a denotes the prediction value. Multiple contractive autoencoders can be stacked to construct an unsupervised deep neural network SCAE with more than one hidden layer. The schematic diagram of SCAE is shown in Fig. 1, in which the output of previous hidden layer is the input of next hidden layer. In this paper, we train a SCAE with four contractive autoencoders to extract and reconstruct the defect features, where the output of the hidden layer for first contractive autoencoder is extracted as first-order feature representation, and then the first-order feature representation is regarded as the input of the hidden layer for second contractive autoencoder, and the same strategy is also used for the subsequent contractive autoencoders. Based on the above strategy, the SCAE can learn the first-order feature, second-order feature, third-order feature and fourth-order representations from the original defect features. Through the continuous stacking process, the SCAE can extract more robust and abstract deep semantic features from the original defect features than a single contractive autoencoder. In addition, since the SCAE is an unsupervised model, it not only can prevent the training network from overfitting when the number of labeled defect instances is relatively small, but also can effectively achieve a deep combination of defect features with its nonlinear mapping capacity.

The proposed multi-objective SMONGE model
In this section, we propose a novel multi-objective defect prediction model called SMONGE that leverages the multi-objective NSGAII algorithm to optimize the number of hidden neurons and output weight norm of extreme learning machine (ELM). We first derive the training process of ELM, and then present the multi-objective optimization problem and our multi-objective SMONGE model.

Extreme learning machine
Different from traditional single hidden-layer feedforward neural network (SLFN), for ELM, the connection weights of the input layer and the hidden layer and the biases of the hidden layer can be assigned randomly, and need not be adjusted after setting. The connection weights between the hidden layer and the output layer do not need to be tuned iteratively through the back propagation process of network error, which can be determined once by solving a linear model [Huang, Zhou, Ding et al. (2012)]. In addition, ELM has obvious advantages in classification, including strong classification capacity, fast training speed and easily adjust parameters. The network structure of ELM is shown in Fig. 2.
where g(.) denotes the activation function, = [ 1 , 2 , … , ] ∈ , i=1,2,...,K represents the input weight vector between the input neurons and ith hidden neuron, bi denotes the bias of the ith hidden neuron, = [ 1 , 2 , … , ] ∈ denotes the output weight vector between the ith hidden neuron and output neurons, = [ 1 , 2 , … , ] ∈ represents the network output value. The learning goal of SLFN is to minimize the output error, which can be expressed as follows: The Eq. (6) can approximate zero error if there are suitable , , and , which has been proved in Huang et al. [Huang, Chen and Siew (2006)]. Therefore, the Eq. (5) can be rewritten as Eq. (7): The Eq. (7) can be transformed into a matrix form, as shown in Eq. (8): where H is the output of the hidden neurons, is the output weight, and L is the expected output. H, , L can be expressed respectively as follows: The output weight can be computed by solving the linear least squares problem. = H + L, (12) where H + represents the Moore-Penrose generalized inverse of the matrix H.

Multi-objective NSGAII optimization based extreme learning machine
In this paper, we adopt extreme learning machine based on multi-objective NSGAII optimization to construct our defect prediction model, so as to transform software defect prediction into a multi-objective optimization problem based on state-of-the-art Pareto optimal solutions. We mainly consider two objectives. One objective is to maximize the performance (i.e., accuracy) of the constructed defect prediction model, which refers to the benefit of the prediction model. Another objective is to minimize the output weight norm as much as possible, which is related to the cost of the prediction model. There is a serious contradiction between these two objectives in most cases. The smaller the output weight norm, the smaller the influence of each feature component, which is equivalent to reducing the number of parameters, thus realizing the limitation of the model space. The simpler the model, the lower the cost, and the less likely it is to produce overfitting phenomenon. However, the performance of the prediction model may reduce to some extent as the weight output norm decreases, and vice versa. Therefore, we should make a compromise between these two contradictory objectives. In this section, we first give some definitions for multi-objective optimization. Then, we define the multi-objective optimization problem for software defect prediction. Finally, we introduce extreme learning machine based on multi-objective NSGAII optimization.

Definitions for multi-objective optimization
We give the following five definitions for multi-objective optimization based on Pareto optimal solutions. Since there are two optimization objectives in this paper, we take two optimization objectives as example.
. . ∈ Ω where x is the decision vector and Ω is the decision space. ( ): Ω ⟶ 2 contains two objective functions and 2 represents the objective space.

Definition 3 (Pareto Optimal Solution and Pareto Optimal Vector) If and only if
x* is not dominated by other solutions, the solution x* is called Pareto optimal solution. F(x*) is called a Pareto optimal vector. Definition 4 (Pareto Optimal Set) The Pareto optimal set is composed by all the Pareto optimal solutions. Definition 5 (Pareto Front) In the objective space, the surface composed by the target value vectors corresponding to all the Pareto optimal solutions is called Pareto front.

The multi-objective optimization problem
In this paper, we leverage the multi-objective NSGAII algorithm to optimize the number of hidden neurons and output weight norm of ELM. Therefore, the individual of the multiobjective optimization model includes the number of hidden neurons H and the control parameter of output weight norm. We define the initialized individual and population as follows: where , is the kth individual of the Gth evolution generation and is the population with NP individuals. NP denotes the size of population. The output weight norm adopts the L2 norm, ∈ (0,1]. In the multi-objective SMONGE model, the mean square error (MSE) and the control of output weight norm are regarded as two computable objectives. For each evolution generation of the SMONGE model, we utilize the parameter vector of each individual to calculate the corresponding output weight according to the Eq. (12). The individual vector is generated by real encoding. The objective function for the training MSE is defined as follows: Another objective is the control of output weight norm, as shown in Eq. (17): where denotes the number of neurons in lth layer, denotes the number of network layers (ELM has three network layers), , denotes the parameter between the jth neuron in (l+1)th layer and the ith neuron in lth layer, and denotes the weight decay parameter. Considering the minimization of the above two objectives, we propose the multi-objective SMONGE model for software defect prediction, which is defined as follows: min , ( , ) = { 1 ( ), 2 ( )} .
. . ∈ (0,1] The training MSE is used to enhance the classification accuracy while the control of output weight norm aims to make the cost of the ELM as low as possible and prevent overfitting.
In the multi-objective SMONGE model, these two contradictory objectives are optimized simultaneously, thereby finding the Pareto optimal solutions.

The multi-objective SMONGE model
Since the ELM has strong classification capacity, we adopt ELM for software defect prediction. In order to further improve the prediction capacity of ELM, we utilize the multiobjective NSGAII algorithm to optimize number of hidden neurons and output weight norm of ELM, which is the above multi-objective optimization problem. The learning process of the proposed SMONGE model is shown in Algorithm 1. In Algorithm 1, we first randomly initialize the population 0 = { ,0 | = 1, 2, … , } and calculate the fitness values (multi-objective functions) of the initialized population by Eq. (18) in Steps 1 and 2. We can combine NSGAII and ELM closely to form a multi-objective optimization problem for software defect prediction by minimizing the multi-objective functions. Since ELM is used for software defect prediction in this paper, we need to calculate the output weight , of ELM by Eq. (12) in Step 5. Next, we adopt the generateNewPop() function to produce new population Qk by continuous selection, crossover and mutation in Step 6, and combine parent and offspring population to generate the in Step 7. By the nondominated sorting for Fi, we can obtain a set of classification subsets (all nondominated fronts of ) = ( 1 , 2 , … ) in Step 8. We calculate crowding-distance for Fi (a measure of solutions density in the neighborhood), and select part of the individual Fi to merge into the new population +1 until the population size reaches NP in Steps 11-15. Then, we establish a partial order relationship for Fi, and choose the first ( − | +1 |) elements of Fi until +1 is filled in Steps 16-17. After enough population evolution, SMONGE will meet the termination criteria and converge to a stable solutions. Finally, SMONGE can return all Pareto optimal solutions in the current population, thereby obtaining software defect prediction results. In this process, in order to comprehensively reflect the performance of the defect prediction model, we also implement other prediction metrics, including accuracy, precision, recall, F1, pf, G-measure and MCC.

Experimental setup
In this section, we introduce the experimental setup, including benchmark datasets, evaluation metrics and baseline methods. We conduct the experiments on a 3.6 GHz i7-4790 CPU machine with 8 GB RAM.

Benchmark datasets
To verify the effectiveness of SCAE and SMONGE, we conduct extensive experiments on 20 real software projects (i.e., 5 projects from the NASA data repository and 15 projects from the PROMISE data repository), which are open source and commonly used benchmark datasets in software defect prediction studies [Tantithamthavorn, McIntosh, Hassan et al. (2016)]; Chen and Ma (2015); Hosseini, Turhan and Gunarathna (2019); Peters, Menzies and Layman (2015)]. The basic attributes of NASA (the first five rows) and PROMISE (the latter fifteen rows) are shown in Tab. 1, including project name, the number of features, the number of instances, the number of defective instances, the number of non-defective instances, defective ratio, and imbalance ratio. For the NASA dataset, we can observe that the defect ratio of PC2 is the smallest with 2.15%, and the defect ratio of KC2 is the largest with 20.50%. The imbalance ratio varies from 3.88 to 45.56. Tab. 2 describes the features of 5 projects from the NASA data repository, which tabulates the 20 common features among the 5 projects and other 19 specific features for each project (The symbol ✓ represents that the project has a certain feature, while the symbol ✘ represents that the project does not have a certain feature). For the PROMISE dataset, we can observe that the defect ratio of jedit-4.3 is the smallest with 2.24%, and the defect ratio of xerces-init is the largest with 47.53%. The imbalance ratio varies from 1.10 to 43.73. Tab. 3 describes all features of 15 projects from the PROMISE data repository. Each instance in any project contains 20 object-oriented features and a dependent variable that presents the number of defects. For all these software defect projects, we adopt the SMOTE (Synthetic Minority Oversampling Technique) algorithm [Chawla, Bowyer, Hall et al. (2002)] for class imbalance processing and the min-max method [Witten, Frank and Hall (2011)] for data normalization in this paper. In addition, we perform 10 times 10-fold cross-validation to evaluate the performance of these models in this paper.

Evaluation metrics
In this paper, we adopt seven widely used evaluation metrics-accuracy, precision, recall,

Accuracy:
The ratio of correctly predicted defect files to all files.
Precision: The ratio of correctly predicted defect files to all files predicted to be defective.

MCC (Matthews correlation coefficient):
The correlation between the actual and predicted outputs which is a comprehensive evaluation by considering TP, TN, FP and FN.
Except for pf, the larger the values of these metrics, the better the prediction performance.

Baseline methods
To verify the performance of SCAE and SMONGE, we conduct extensive experiments for feature extraction and software defect prediction.
For feature extraction, we compare the deep neural network SCAE with six state-of-the-art feature extraction methods, including Maximally Collapsing Metric Learning (MCML) [Globerson and Roweis (2005)], Stochastic Neighbor Embedding (SNE) [Parviainen (2016)], Manifold Charting (MC) [Saini, Rambli, Sulaiman et al. (2013)], Locality Preserving Projection (LPP) [Lu, Wang, Zou et al. (2018)], Locally Linear Embedding (LLE) [Ji, Liu, Cao et al. (2017)], Locally Linear Coordination (LLC) [Huang, Wang, Xu et al. (2009) In addition, from Tab. 6, we can find that our method SCAE is not ideal in terms of pf, but it is better than MCML, MC and LLE (the smaller the pf, the better the performance). Fig. 3 depicts the box-plots of four metrics for our method SCAE and six feature extraction methods across all 20 projects. From Figs. 3(a), 3(c) and 3(d), we can find that the median values achieved by SCAE are higher than those achieved by six feature extraction methods from the point of F1, G-measure and MCC, respectively, which can fully demonstrate the superiority of our method SCAE, and the cases are consistent with the observations in Tabs. 5, 7 and 8. Moreover, the lowest F1, G-measure and MCC by SCAE are higher than the median values by MCML, MC, LLE and LLC, respectively.

Conclusion:
Our method SCAE outperforms six state-of-the-art feature extraction methods in terms of F1, G-measure and MCC. The SCAE yields the average 13.25%, 21.99% and 43.39% performance improvements compared with six feature extraction methods across all 20 projects in terms of F1, G-measure and MCC.

RQ2: How about the prediction performance of the proposed multi-objective SMONGE model compared to four classic defect predictors with the same feature extraction method SCAE?
Our multi-objective SMONGE model combines the feature extraction method SCAE and the ELM optimized by the multi-objective NSGAII algorithm. Since we adopt the ELM optimized by the multi-objective NSGAII algorithm as the defect predictor in this paper, this question is designed to evaluate the effectiveness of the SMONGE model compared with four classic defect predictors with the same feature extraction method SCAE, including SDT, SKNN, SNB and SSVM. Tabs. 9-12 show the F1, pf, G-measure and MCC of the SMONGE model compared with those of four classic predictors across all 20 projects, respectively. Note that the best value of each project is in bold font. From Tabs. 9, 11 and 12, we can find that the SMONGE model can achieve the best average performance in terms of F1, G-measure and MCC (expect for SSVM) across all 20 projects. More specifically, the average F1 (0.8088) by SMONGE achieves improvements between 3.75% (for SKNN) and 19.88% (for SNB) with an average improvement of 9.09% and the average G-measure (0.7675) by SMONGE yields improvements between 0.67% (for SDT) and 16.87% (for SNB) with an average improvement of 4.84% compared with four classic predictors with the same feature extraction method SCAE. From the point of MCC, the SMONGE model can gain an average improvement of 14.79% compared with four classic predictors, and it is only 0.2% worse than SSVM. Moreover, from Tab. 10, the SMONGE is not the best predictor from the point of pf, but it is only worse than SDT and SSVM.  Fig. 4 shows the box-plots of four metrics for our SMONGE model and four classic predictors with the same feature extraction method SCAE across all 20 projects. In Figs. 4 (a), 4(c) and 4(d), the median value by SMONGE is higher than that of four predictors in terms of F1, G-measure and MCC (expect for SSVM in terms of G-measure and MCC), respectively. In particular, the median value by SMONGE is higher than the maximum value of SNB in terms of F1, G-measure and MCC. In addition, we can observe that the median pf by SMONGE is only higher than that of SKNN and SNB respectively from Fig.  4(b), which also shows that the SMONGE is not good enough in terms of pf. Conclusion: Our multi-objective SMONGE model performs better than four classic predictors in terms of F1, G-measure and MCC (expect for SSVM) on average. The SMONGE achieves the average 9.09%, 4.84% and 14.79% performance improvements compared with four classic defect predictors across all 20 projects in terms of F1, Gmeasure and MCC. For the proposed multi-objective SMONGE model, the model utilizes the multi-objective NSGAII algorithm to optimize two objectives of the advanced ELM predictor based on state-of-the-art Pareto optimal solutions. One objective is to maximize the model performance, which refers to the benefit of the prediction model. Another objective is to minimize the output weight norm, which is related to the cost of the prediction model. Therefore, we show the best classification accuracy and the minimum output weight norm of ELM gained by the multi-objective SMONGE model on each project in Tab. 13.

Threats to validity
In this section, we introduce the potential threats to validity of our method, including internal validity, external validity and construct validity.

Internal validity
Internal validity is mainly concerned with uncontrolled internal factors that may affect our experimental results, such as errors in the experiment. We check all experiment process carefully, but there may still be errors in the experiment that we don't notice.

External validity
External validity involves that whether our experimental results can be generalized to other software subjects. To guarantee the representative of software subjects used in this paper, we use 15 projects from the PROMISE data repository and 5 projects from the NASA data repository, which are commonly used projects in previous software defect prediction studies [Tantithamthavorn, McIntosh, Hassan et al. (2016)]; Chen and Ma (2015); Hosseini, Turhan and Gunarathna (2019); Peters, Menzies and Layman (2015)]. Moreover, these software projects belong to different application fields and cover a long time.

Construct validity
Construct validity is related to whether the evaluation metrics used in our study reflect the real-world situation. To minimize the threat, we use seven evaluation metrics, including accuracy, precision, recall, F1, pf, G-measure and MCC which have been widely used in recent software defect prediction studies [Kondo, Bezemer, Kamei et al. (2019); Nam, Pan and Kim (2013); He, Shu, Yang et al. (2012); Herbold, Trautsch and Grabowski (2018); Zhu, Zhang, Ying et al. (2020)], so we believe that the construct validity should be acceptable.

Conclusion
In this work, we apply an advanced feature extraction method and a novel multi-objective optimization model to software defect prediction. First, we utilize an advanced deep neural network SCAE to extract the robust deep semantic features, which has stronger discrimination capacity for different classes. Second, we propose a novel multi-objective defect prediction model called SMONGE, which leverages the multi-objective NSGAII algorithm to optimize two objectives of the advanced ELM predictor based on state-of-theart Pareto optimal solutions. One objective is to maximize the model performance, which refers to the benefit of the prediction model. Another objective is to minimize the output weight norm, which is related to the cost of the prediction model. We conduct extensive experiments for feature extraction and defect prediction across 20 software defect projects from large open source datasets, and the experimental results verify that the effectiveness of SCAE and SMONGE. In future work, to verify generalization capability and practicability of SCAE and SMONGE, we will evaluate SCAE and SMONGE in more open source and commercial projects. In addition, we plan to leverage the multi-objective NSGAII algorithm to optimize more classifiers in software defect prediction.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.