A compositional model for effort ‐ aware Just ‐ In ‐ Time defect prediction on android apps

Android apps have played important roles in daily life and work. To meet the new requirements from users, the apps encounter frequent updates, which involves a large quantity of code commits. Previous studies proposed to apply Just ‐ in ‐ Time (JIT) defect prediction for apps to timely identify whether the new code commits can introduce defects into apps, aiming to assure their quality. In general, high ‐ quality features are benefits for improving the classification performance. In addition, the number of defective commit instances is much fewer than that of clean ones, that is the defect data is class imbalanced. In this study, a novel compositional model, called KPIDL, is proposed to conduct the JIT defect prediction task for Android apps. More specifically, KPIDL first exploits a feature learning technique to preprocess original data for obtaining better feature representation, and then introduces a state ‐ of ‐ the ‐ art cost ‐ sensitive cross ‐ entropy loss function into the deep neural network to alleviate the class imbalance issue by considering the prior probability of the two types of classes. The experiments were conducted on a benchmark defect data consisting of 15 Android apps. The experimental results show that the proposed KPIDL model performs significantly better than 25 comparative methods in terms of two effort ‐ aware performance indicators in most cases.

techniques, such as machine learning methods, to predict defective code regions, which attract many researchers to contribute to it for software quality assurance [2,3].
Recently, apps, especially Android apps, have become fashionable. It has the advantage that once the developers release the update of apps, the users can download them immediately from App Stores [4]. Apps are usually frequently updated to meet the functional requirements of users. This characteristic is inevitable to introduce defects into the new versions of apps, which hinder the app quality. Thus, the early detection of defects is an urgent issue during the process of development and maintenance for apps.
Many previous studies [5,6] for defect prediction were based on the class level, which was a lack of immediate feedback for defective-prone codes. To overcome this drawback, researchers proposed Just-In-Time (JIT) defect prediction, which is at the commit level [7,8]. JIT defect prediction aims at identifying whether a new code commit introduces defects, which can offer timely feedback for developers to detect the defects at the earliest. Considering this advantage, it is appropriate to apply JIT defect prediction to software with the characteristic of frequent updates (such as apps), which involves a large number of code commits. If a new commit instance introduces defects into the app, this instance is regarded as defective, otherwise, clean. Catolino et al. [9,10] took the first attempt to build JIT defect prediction models for Android apps based on features derived from the commit information.
In general, the feature representation quality, to a large extent, impacts the classification performance. Thus, learning the high-level feature representation has the potential to promote the performance of the defect prediction model. One alternative solution for this purpose is to make full use of the feature engineering techniques, such as the typical Principal Component Analysis (PCA) technique. PCA learns the linear combinations of the original features in a new space [11,12]. However, PCA holds that the data for learning should be linearly separable and with the Gaussian distribution, which is impractical for the real-world data that consist of complicated structures and is difficult to simplify it into a linear subspace [13]. To overcome this shortage, we employ the non-linear enhanced version of PCA, that is, Kernel-based PCA (KPCA) technique [14]. It introduces the non-linear kernel function to map the raw features of the commit instances into a latent high-dimensional space, in which the ability of handling complicated structures is strengthened and the features can be operated linearly. In addition, we can select a part of the mapped features in the new space to mitigate the negative impact of noise data. Previous studies have already indicated that KPCA is superior to PCA method for software engineering tasks [15][16][17].
Moreover, for the defect data, the number of defective commit instances is much fewer than those of clean ones. In other words, the defect data is class imbalanced. Since the data imbalance characteristic usually deteriorates the performance of the classification model [17], it is crucial to deal with this issue for performance improvement. In this study, we propose a novel Imbalanced Deep Learning method, short for IDL, to address the class imbalanced issue. More specifically, as the deep learning method usually shows excellent performance for classification task, we use the Deep Neural Network (DNN) model to construct an effective classifier on the mapped defect data of Android apps to identify the defective commit instances. However, since the traditional cross-entropy loss function in the DNN model holds that the losses of two classes have the same impact on the total loss, it is not suitable for the imbalanced defect data. To overcome this drawback, we introduce a novel cost-sensitive cross-entropy (CSCE) loss function into DNN. This loss function takes the prior probability of the two kinds of classes into consideration when calculating the cross-entropy loss [18]. In other words, it uses the weighting technique to compensate the class imbalance between the defective and clean commit instances.
In this study, we propose a novel JIT defect prediction model, called KPIDL, which is a compositional framework that incorporates the abovementioned two methods, that is, KPCA for feature representation learning and IDL for classification model construction. More specifically, KPIDL first exploits the KPCA technique to transform the original data into a new space to learn more representative features towards commit instances. Then, with the transformed features of commit instances, KPIDL employs our proposed IDL method to construct a classification model for handling the inherent class imbalanced problem of the defect data.
As the confusion matrix-based performance indicators used in previous defect prediction studies, such as Precision, Recall, and F-measure, assume that the test efforts of developers are always enough, it is unrealistic in the real-world app quality assurance activities. The reason is that apps have quick update iteration with a short development cycle, this causes the availability of only limited testing resources for code inspection. Thus, the classical performance indicators seem to be out of its depth. Arisholm et al. [19] proposed to take code inspection efforts into account for performance evaluation, termed effort-aware indicators, for defect prediction on traditional software projects. It is more appropriate for practical activities. Due to the above advantages, in this study, we fasten the effort-aware performance evaluation of our proposed JIT defect prediction method for Android apps.
To evaluate the prediction performance of our proposed JIT defect prediction framework KPIDL, in this study, we conduct experiments on a benchmark dataset that includes 15 Android apps and employ Effort-Aware Recall (EARecall) and Effort-Aware F-measure (EAF-measure) as performance indicators. Across the 15  This study extends our previous study published as a conference paper [20]. The four main differences between the two studies are highlighted as follows: (1) We take feature learning into account to reduce the impact of noise data and obtain high-quality feature representation to improve the performance of our JIT defect prediction model, which is ignored in the conference version; (2) We replace the original performance measurement with the effort-aware indicators to evaluate our proposed method, which is better to simulate the practical activities in which the code inspection efforts are always limited; (3) We add 5 methods for comparison to verify the effectiveness of our JIT defect prediction framework for Android apps; (4) We extend data by adding 4 real-world apps to enhance the generalisation of our model and make more indepth analysis towards our experimental results.
The main contributions of this study are summarised as follows: (1) To the best of our knowledge, we are the first to consider both the feature learning and class imbalance issue simultaneously for JIT defect prediction on Android apps. The remainder of this study is organised as follows. The next section introduces the related work. Section 3 describes our KPIDL model in detail. We introduce the experimental setup and illustrate the performance evaluation in Section 4 and 5, respectively. The threats to the validity of our study are discussed in Section 6. Finally, we conclude this study and present future work in Section 7.

| Feature learning in defect prediction
The process of feature representation learning usually applies some feature engineering methods to preprocess the raw feature set, making the new feature set better reveal the characteristics of the defect data. The widely used feature engineering methods include the feature selection methods and feature extraction methods [21]. The former ones select a part of features to replace the original feature set. The classic methods include filter-based feature ranking methods and wrapper-based feature subset selection methods. The latter ones transform the raw features into a new space in which the new feature form can well represent the defect data. The classic methods include the PCA and its kernel version KPCA.
Liu et al. [22] proposed a clustering and ranking-based feature selection method FECAR that first applied the symmetric uncertainty measure to cluster the original features into multiple groups and then used three relevance measures to obtain more relevant features from each group. The experimental results on Eclipse and NASA datasets demonstrated that FECAR was more effective in choosing the relevant features. Xu et al. [21] empirically assessed the impact of 32 feature selection techniques on defect prediction performance. Their experiments on the three datasets showed that the principal component analysis technique performed the worst among all the methods. Shivaji et al. [23] assessed the impact of filter-based and wrapper-based feature preprocess techniques on the performance of defect prediction. Their experiments on 11 projects showed that only using 10% of the original feature could still improve the defect prediction. Chen et al. [24] proposed a method MOFES that regarded the feature selection as a multi-objective optimisation issue, aiming at minimising the feature number and maximising the prediction performance simultaneously. They conducted experiments on PROMISE dataset and the results indicated that MOFES selected less features but obtained better prediction performance. Ghotra et al. [25] empirically investigated the impact of 30 feature selection methods applied to 21 classification models on the performance of defect prediction. They conducted experiments on NASA and PROMISE datasets and found that the correlation-based feature subset selection method with the best-first search strategy was more suitable for the defect prediction task compared with other baseline methods. Ni et al. [26] developed a novel cluster-based method FeSCH that used a density-based clustering technique and then proposed three different heuristic ranking strategies to select the useful features for cross-project defect prediction. Their experimental results on ReLink and AEEEM datasets showed the effectiveness of FeSCH.
Different from the above studies that took feature preprocess into account for defect prediction in traditional software projects, we simultaneously consider both feature learning and class imbalanced learning for JIT defect prediction on Android apps.

| Class imbalanced learning in defect prediction
Since most of defect data have an inherent class imbalance property, that is the defective instances are usually far fewer than those of the non-defective ones. The purpose of class imbalanced learning methods is to address the adverse impacts of this issue on the classification models. The widely used class imbalanced learning methods include sampling-based methods, ensemble-based methods, and cost-sensitive-based methods [27]. The former ones make the instance number of two classes the same by increasing or removing some instances. The middle ones combine several weak classifiers to get a better ZHAO ET AL. and more comprehensive classifier with strong ability. The latter ones introduce the concept of misclassification cost to minimise the misclassification errors.
Liu et al. [28] proposed to apply the cost information in both feature selection and classification phases and developed three novel cost-sensitive feature selection methods for software defect prediction. They conducted experiments on NASA dataset and the results showed that the proposed cost-sensitive feature selection algorithms obtained promising prediction performances compared with the traditional methods that only considered cost information in classification. Siers et al. [29] proposed two decision tree-based cost-sensitive methods to minimise the classification cost for software defect prediction. They conducted experiments on six projects and the results illustrated that their techniques performed better than six comparative methods. Bennin et al. [30] empirically explored the statistical and practical effects of six sampling methods on five classification models for the defect prediction task. They conducted experiments on 10 open source projects and the results found that the investigated sampling methods had significant and practical effects in terms of performance indicators Pd, Pf, and G-mean but had no impact on AUC. Bennin et al. [31] empirically assessed the impacts of the percentage of fault-prone modules on seven sampling methods applied to five classification models for defect prediction. They conducted experiments on 10 static metric projects and the results demonstrated that the performance of these classification models could be largely impacted by this parameter except for AUC. Tantithamthavorn et al. [32] empirically investigated the impact of four popularly used sampling-based class rebalancing techniques on the performance of 10 widely used indicators and explored the impact of these sampling methods on the interpretation of defect prediction models. Their largescale experiments with 101 systems showed that samplingbased class rebalancing methods were not helpful to interpret the defect prediction models. Bennin et al. [33] proposed a novel synthetic oversampling approach MAHAKIL that treated two different sub-classes as parents and produced a new sample that inherited traits from both parents to enhance the diversity of data distribution. They conducted experiments on PROMISE dataset and the results indicated that MAHA-KIL was superior to the baseline methods.
Different from the above studies that focussed on relieving the class imbalanced issue with machine learning-based approaches for traditional software defect prediction, we design a deep learning model incorprating a novel cost-sensitive loss function to deal with this issue for JIT defect prediction on Android apps.

| JIT defect prediction for traditional software
Kamei et al. [7] applied some factors derived from characteristics of the software changes to predict JIT defects. They conducted a large-scale empirical study on 11 projects from different domains and the experimental results showed that their method effectively detected the most risky changes. Fukushima et al. [34] employed a JIT cross-project model to alleviate the issues of demand for a large amount of training data during the development stages. Their results on 11 open source projects showed that ensemble learning methods using historical data achieved better performance. Kamei et al. [35] explored a cross-project model for JIT defect prediction on 11 projects and the results showed that the cross-project model provided an alternative way for projects with limited historical data. McIntosh et al. [36] conducted experiments to address whether the important properties of fix-inducing changes were consistent with system evolution on two software systems with 37,524 changes. Their results showed that fluctuations derived from the system evolution impacted on the consistency. Yang et al. [37] proposed a method, called TLEL, which combined ensemble techniques with random forests to predict JIT defects. They conducted experiments on six open source projects with two indicators and the results showed that TLEL could discover over 70% of defects by reviewing only 20% of the code lines. Pascarella et al. [38] explored to what extent the commits were defective and proposed a fine-grained model for JIT defect prediction. Their experiments on 10 open source projects showed that their method could obtain better prediction performance in terms of the AUC indicator. Cabral et al. [39] conducted the first study to take class imbalance into consideration for JIT defect prediction. The experimental results on 10 projects from GitHub repository showed that their method performed better than baseline methods in terms of the g-mean indicator. Kondo et al. [40] defined the context metrics to perform the JIT defect prediction task. They conducted experiments on six open source projects and the results showed that the composites of two extended context metrics performed significantly better than those of the other metrics in terms of MCC and AUC indicators.
Different from the above studies that only conducted experiments on traditional software, in this study, we focus on JIT defect prediction for Android apps.

| Defect prediction for android apps
Ortu et al. [41] analysed defect characteristics from the logs of traditional software and mobile apps using natural language text classification techniques. Their experimental results showed that the High-Priority and Low-Priority defects in the domains of traditional and mobile software were different. Khomh et al. [42] proposed three metrics to capture the patterns of failure occurrences for defect prediction. They conducted experiments on 18 versions of an enterprise app and the results showed that these metrics predicted defects with a shorter time. Scandariato et al. [43] employed a Support Vector Machine (SVM) model to identify and analyse vulnerable components of apps using source code metrics. They analyzsd a popular application in the Android Market and the results showed that their model achieved higher accuracy and precision. Kaur et al. [44] compared the defect prediction performance using static code-based metrics and process-based metrics. They conducted experiments on an open source app with seven machine learning methods and the results showed that process metrics-based models achieved better performance for defect prediction on apps. Ricky et al. [45] proposed an SVM method to predict defects on apps. Their experimental results on five datasets showed that their SVM method achieved better performance than that of the decision trees. Malhotra et al. [46] proposed a framework for identifying defective classes using object-oriented metrics. They conducted experiments on seven widely used Android apps and the results showed that there existed performance differences among 18 classification models. Kaur et al. [47] explored process metrics for defect prediction on an open source app. Their experimental results showed that the models with process metrics achieved better performance than those with code metrics.
Since the timely feedback is a characteristic of JIT defect prediction, it is especially suitable for frequently updated apps. However, there are few studies for identifying JIT defects on apps. Catolino et al. [9,10] took the first attempt to explore the JIT defect prediction for apps and then compared the impacts of multiple machine learning methods and ensemble learning techniques. They extracted the defect data of apps from the COMMIT GURU platform as a benchmark dataset and their experimental results showed that Naive Bayes performed significantly better than other classifiers.
Different from most previous studies that focussed on the traditional mobile software at the class level, in this work, we study JIT defect prediction at the code change or commit level. Different from [9,10] that explored the traditional machine learning classifiers for JIT defect prediction on apps, we propose a novel method to learn effective feature representation for this task.

| Deep learning in defect prediction
Yang et al. [8] proposed Deeper, which employed a deep belief network for defect prediction. Their experimental results on six projects with 137,417 changes showed significantly better performance than those of Kamei et al.'s approach on most projects. Li et al. [48] proposed a method that extracted features from ASTs and then employed Convolutional Neural Networks (CNNs) for feature representation learning. They conducted experiments on seven open source projects and the results showed significant performance improvement of their method. Phan et al. [49] learned semantic features employing directed graph-based CNNs for defect prediction. They conducted experiments on four projects and the results showed the significant superiority compared with the six baseline methods. Manjula et al. [50] proposed a novel method that employed the genetic algorithm and the deep neural network for feature representation learning and classification. Their experimental results on PROMISE dataset showed better accuracy than comparative methods. Xu et al. [51] proposed a method, called LDFR, which employed the deep neural network with a cross-entropy loss function for predicting defective modules. They conducted experiments on 27 project versions and the results showed that LDFR presented significant superiority.
Different from above studies that used deep learning techniques for defect prediction on traditional software projects, we take the first attempt to introduce the deep learning technique into JIT defect prediction on Android apps by considering both feature representation learning and class imbalance learning. Figure 1 depicts the overview of our proposed KPIDL model, which mainly includes two stages. The first stage is the KPCAbased feature representation learning process and the second stage is the IDL-based classification model construction process. The details of the two stages are described as follows:

| Feature representation learning with KPCA
In the first stage, we transform the raw features of commit instances with kernel function-based mapping into a latent space to learn more representative features. Here, we present how to transform the original feature set into a new space using the KPCA method to learn more representative features.
Assume that the commit instance set of apps is defined x im � is the feature set of the i-th commit instance, and Y ¼ fy i ji ¼ 1; 2; …; ng is the corresponding label set. The goal of feature learning is to obtain the high quality feature representation that has a more powerful ability to prompt the classification performance. For this purpose, in this study, we first use the Kernel-based Principal Component Analysis (KPCA) technique to acquire more representative features for each commit instance. KPCA employs the non-linear mapping function Φ to transform the original features into a new feature space F [2,17,52,53]. Assume that the centralised projection point of x i is defined as Φðx i Þ, the covariance matrix C is formulised as follows: We execute the linear principal component analysis transformation by diagonalising the covariance matrix C, which can solve the eigenvalue problem in eigenspace F and is formulised as follows: where λ denotes the eigenvalues (λ ≥ 0) and V denote the corresponding eigenvectors of covariance matrix C.
Since all eigenvectors V are within the scope of the centralised projection points Φðx 1 Þ, Φðx 2 Þ, …, and Φðx n Þ, we multiply both sides of Equation (2) by Φðx l Þ T (l ¼ 1; 2; …; n), which is formulised as follows: Then, the eigenvector V is formulised as follows: where the coefficient α j can be treated as the linear expression of Φðx j Þ.
Since it is unrealistic to appoint a specific form of Φ, we introduce the kernel function κðx i ; x j Þ, which is formulised as follows: By incorporating Equations (2), (4), and (5), we can obtain the following formula: where K is the kernel matrix (with size n � n) corresponding to κ, For a given commit instance x ins , we extract the non-linear projection of k-th kernel component, which is formulised as follows: To perform the non-linear mapping, in this study, we apply the Gaussian Radial Basic Function (RBF) as the basic kernel function, which is formulised as follows: where ‖⋅‖ and 2δ 2 signify the L 2 norm and the width of the RBF function, respectively. After employing the KPCA technique for our defect data, we can obtain the transformed data set S 0 ¼ fðX 0 ; Y Þ|X 0 ∈R n�p ; Y ∈ R n g, where p is the dimension of the new feature space.

| Classification model construction with IDL
In the second stage, we propose a novel deep learning-based method to build the classification model, which has the ability to deal with the class imbalanced issue. As our KPIDL framework incorporates an improved cross-entropy loss function into a DNN method, we first introduce the original cross-entropy loss function, then detail its improved version, that is, the CSCE loss function, and last illustrate the DNN method.

| Cross-entropy loss function
The goal of label classification is to make the output labels of the commit instances generated by an unknown learning function f ðxÞ as close as possible to the real labels. Here, we define a generalised model f ðx|θÞ to obtain the output, where θ is a parameter set of the model. The parameter θ can be estimated by the cross-entropy loss function which is a convex function. The formula is defined as follows: whereŷ i ¼ f ðx i |θÞ is the output of model corresponding to the i-th commit instance. When y i ¼ 0, ℓðθÞ is equal to −logð1 −ŷ i Þ, and when y i ¼ 1, ℓðθÞ is equal to −logðŷ i Þ. Thê y i tends to y i with the loss decreasing logarithmically [18].
On the balanced data, the losses from −logðŷ i Þ and −logð1 −ŷ i Þ account for half of the total loss for a specific model outputŷ i , individually. However, for the imbalanced data, the loss from the instances in the majority class has larger impacts on the total loss ℓðθÞ. The reason is that it ignores which class the instances causing the loss belong to when calculating the total loss.

| Cost-sensitive cross-entropy loss function
From the above analysis, it is found that the traditional cross-entropy loss function could not work well on the imbalanced data. To alleviate the class imbalanced issue, the key point lies in assigning weights to the two kinds of losses, that is, −y i logðŷ i Þ and −ð1 − y i Þ logð1 −ŷ i Þ.
Since the prior probability ratio, such as the ratio of the number of defective commit instances to the total number of commit instances, is helpful to achieve a balance between different classes, in this work, we introduce this term into the cross-entropy loss function, called CSCE loss function, to compensate the imbalance of the commit defect data, which is formulised as follows: where λ ¼ M=N is the percentage of defective instances, M is the number of commit instances with the defective label (y i ¼ 1), and N is the total number of the commit instances. The previous study [18] has proved that the CSCE loss rate was almost constant when the prior probability was taken into account. This would lead to a balance between the two different classes.

| Deep neural network
Deep Neural Network consists of three kinds of network layers, including the input layer, the hidden layer, and the output layer [54]. In general, the first layer is the input layer with many units, which receives the input feature vectors. The last layer is the output layer, which outputs the results generated by the DNN model. The hidden layer consists of one or more layers. Different from the basic multi-layer perceptron that only has one unit in the output layer, DNN extends the network structure with many hidden layers and the output layer with one or more units, which improves the ability of representation learning. In DNN structure, the network units between layers are fully connected and the network units in the same layer are not connected. There are two main steps in the training process of DNN. The first step is the forward propagation, in which each layer takes the original vectors, weighted coefficient matrix, and bias vectors as inputs, and then outputs the results of the linear operation.
In the second step, the back propagation algorithm is applied to optimise the model parameters in each layer. The aim is to make the model output values as close as possible to the real labels. Given a set of commit instances in apps, we input feature vectors of these mapped instances into the first layer of DNN. After the hidden layers and the output layer processing, the model calculates the total loss between the predicted labels and the true labels for commit instances using the CSCE loss function. Then, back propagation is employed to obtain the optimal parameters. These two processes terminate until the total loss attains a certain threshold. The training procedure of DNN is illustrated in Figure 2.

| Research Questions
To evaluate our proposed KPIDL method, in this study, we design the following four Research Questions (RQs).

RQ 1: Is our KPIDL method superior to its variants?
Our proposed KPIDL framework consists of two parts, that is KPCA method for feature learning and IDL method for relieving the class imbalance issue, in which KPCA is the improved version of original linear feature learning method PCA and IDL is the weighted advance of original Deep Learning (DL) model. This question is designed to explore whether our KPIDL method is more effective than these technical combinations to enhance JIT defect prediction performance on Android apps.

RQ 2: Is our KPIDL method superior to sampling-based imbalanced learning methods?
Sampling-based methods relieve the class imbalance issue by adjusting the number of positive and negative samples. This question is designed to explore whether our imbalanced learning method IDL is superior to the sampling-based methods for JIT defect prediction on Android apps.

RQ 3: Does our KPIDL method perform better than ensemblebased imbalanced learning methods?
Ensemble learning methods deal with the imbalanced data by creating multiple base models and then integrating the predictions of these base models to improve the overall performance. This question is designed to explore whether our IDL method achieves better performance than that of the ensemble learning methods on imbalanced defect data of Android apps.

RQ 4: How effective is our IDL method compared with costsensitive-based imbalanced learning methods?
Cost-sensitive-based methods alleviate the class imbalance issue by assigning higher misclassification costs with instances in the minority class and seeking to minimise the high cost errors. This question is designed to explore whether our IDL method is more effective than the cost-sensitive-based imbalance learning methods to improve the JIT defect prediction performance on Android apps.

| Benchmark dataset
In order to evaluate the performance of our IDL method, in this study, we employ a benchmark dataset with 15 Android apps denoted by a recent study [10]. Here, we briefly describe these apps. Android Firewall is a powerful firewall app based on Linux iptables, which allows users to control which apps can access the networks. Alfresco is a business office app, which ensures the corporate documents are accessed securely. Android Sync is an Android synchronisation manager, which transmits data between the Android device and PC only using USB. Android Wallpaper provides a variety of high-quality wallpapers by using manual checking and sorting. Any-SoftKeyboard provides the support of multiple languages and privacy protection for screen keyboard in Android mobile devices. Apg introduces email encryption into the Android devices for privacy protection. Chat Secure provides users with a secure communication app based on open standards, such as XMPP/Jabber and OTR encryption. Kiwix is a lightweight piece of an app, which allows users to read and download files (e.g., Wikipedia, Wiktionary, and TED talks) when the internet connection is unusable. Own Cloud Android provides a cloud storage platform to synchronise the personal privacy files. Page Turner provides an ebook reader, which can maintain the same reading process between multiple devices. Notify Reddit allows users to acquire the favourite notifications from their Android wearable. Android Universal Image Loader provides synchronous and asynchronous image loading. Observable Scroll View provides the listening for scrolling status and can interact with the Toolbar easily. Applozic Android SDK is an in-app solution that makes real time chat in apps to be more convenient. Delta Chat is an email-based instant messaging tool that relieves the tracking or central control. As we can see from the above descriptions, these apps come from various domains. Table 1 summarises the basic information of these apps, including lines of the code (# LOC), the total number of commit instances (# TC), the number of defective instances (# DC), the number of clean instances (# CC), and the ratio of defective instances (% DR). If a new commit instance introduces the defects, this instance is deemed as defective, otherwise, clean. The code lines of these apps are between 9506 and 275,637, which means that these apps have different scales. The commit instances of the apps in the benchmark dataset are characterised by a feature set from different scopes, such as Diffusion, Size, Purpose, History, and Experience. We follow the original work [10] to use the widely used 14 features that have been proved to be the most useful ones to identify defective commit instances in the context of JIT defect prediction for Android apps.

| Performance indicators
Traditional confusion matrix-based indicators, such as Precision, Recall, and F-measure, assume that during the testing process, the efforts for reviewing the distinct code snippets are the same and the test sources for code inspection are always enough. Nevertheless, it is impractical to ignore the availability of sufficient test resources and inspecting different code snippets will expend inconsistent efforts. To overcome the aforementioned weaknesses, in this study, we evaluate the performance of our proposed KPIDL method by effort-aware indicators that take code reviews efforts into account for practical simulation. In this study, we regard the sum of features LA and LD as the substitute to code inspection efforts. In addition, the accessible test resources are deemed as 20% of all the efforts, following the previous studies [55,56]. Below, we briefly depict how to calculate the effort-aware indicators.
Following the previous studies [2,51], we first train a classification model using our IDL method with the transformed features of commit data by the KPCA technique to predict the commit instances from test set as two groups (i.e., defective and clean). Then, the commit instances from each group are ranked in the ascending order via their code inspection efforts, respectively. Next, the ranked commit instances are merged as candidates in which those predicted to be defective are put in the front. After that, we simulate the practitioners to inspect the candidate instances from high to low. This inspection activity continues until the accumulative effort accounts for 20% of all efforts and we can obtain the following statistical information corresponding to the inspected commit instances to calculate the effort-aware indicators.
� N d refers to the total number of defective commit instances in candidate set.
� N i refers to the total number of reviewed commit instances in candidate by inspecting 20% of efforts. � N id refers to the total number of reviewed defective commit instances by inspecting 20% of efforts.
Based on the above statistics, the first effort-aware indicator is Effort-Aware Recall (EARecall) that is defined as the percent of reviewed defective commit instances to the whole defective commit instances in candidate set. EARecall is denoted as follows: Effort-Aware Precision (EAPrecision) refers to the percent of reviewed defective commit instances to the whole instances in the candidate set, which is denoted as EAPrecision The second effort-aware indicator EAF-measure resembles the traditional F-measure, which is the weighted harmonic average considering EARecall and EAPrecision. EAF-measure is denoted as follows: where θ is a trade-off parameter and is set as 2 following the previous studies [2, 51, 57].

| Data partition
In this study, we employ the stratified sampling method to generate the training set and test set to ensure that the two sets have the same instance ratio of the two kinds of labels. More specifically, for each app, we take the data that merges half of the defective commit instances and half of the clean commit instances as the training set and the remainder as the test set to run our KPIDL method and the comparative methods. After that, we exchange the training set and test set and then run these methods again. For each data partition, we can obtain two results. To reduce the negative impacts of the random partition on our experimental results, we repeat this procedure 25 times. Thus, we obtain a total of 25 � 2 = 50 indicator values. In this study, we report the average value and the corresponding standard deviation for each performance indicator.

| Parameter settings
To obtain more representative features with the KPCA technique, in the feature learning stage, we set the kernel parameters of KPCA following the default settings in Scikit-learn library. Also, we specify the transformed data dimension p as 14.
In the model construction stage, we set the structure of DNN as one input layer and two hidden layers with 32 hidden units, following with one output layer with one unit. For the hyper parameters, we set the batch size as 16 and the iterations as 2000. Moreover, we apply the RMSProp algorithm [58] to optimise our DNN model. In each iteration, we set the learning rate as 0.01 with the decay rate as 0.99. In addition, we employ the exponential moving average model [58] with the decay rate as 0.99 for the learning rate. When calculating the loss, the L2 regularisation is applied to reduce the overfitting. The training process is automatically terminated until the total loss is less than 0.05.

| Statistic test
In this study, we apply a state-of-the-art method, namely Scott-Knott Effect Size Difference (SKESD) test [59], to analyse the significant differences between our IDL method and the comparative methods. The original Scott-Knott test uses a cluster analysis algorithm to divide all the methods with significant differences into different groups. However, this test method requires the data with normal distribution and cannot well handle the groups with the negligible effect size of significant differences. To overcome these two limitations, Tantithamthavorn et al. [59] proposed an improved version, called SKESD test that applied log transforming to preprocess the results of the performance indicator and quantified the effect size by applying Cohen's delta. In this study, we perform the SKESD test with two stages to conduct the significant analysis. The process of SKESD test is demonstrated in Figure 3. In the first stage, we take all the performance indicator values of each method on each app as inputs to the SKESD test and obtain the output of the corresponding rank list of each method on each app. In the second stage, we take the output results from the previous processing as inputs and then get the final rank of each method across all apps. The lower ranking value of a method means that it obtains better performance.

| Answer to RQ1: the prediction performance of our KPIDL method and its variants
Methods: To answer this question, we first treat the IDL combining PCA method (short for PIDL) and IDL without any feature preprocess (i.e., IDL) as baseline methods to investigate how effective is IDL when using non-linear KPCA, linear PCA, and no feature learning. In addition, we compare KPIDL with the baseline methods that combine the traditional DL method with KPCA, PCA, and no feature learning (short for KPDL, PDL, and DL, individually) to investigate the prediction performance when not considering the class imbalance. Results: Tables 3 and 4 report the average EARecall and EAF-measure values and the corresponding standard deviations of our KPIDL method and the five comparative variants, individually. In these tables, the values in bold denote the best performance for each app or the best average value across all apps. Figure 4 visualises the statistic test results of SKESD for our KPIDL method and the five baseline methods in terms of two effort-aware indicators. Different colours indicate that the methods belong to different groups with significant differences. From these tables and the figure, the following findings can be drawn. First, in terms of EARecall, our KPIDL method obtains better performance on 6 out of 15 apps compared with the five baseline methods. The average EARecall value by our KPIDL method over all apps achieves improvements by 38.7%, 27.4%, 21.5%, 9.0%, and 63.7% compared with those of DL, PDL, KPDL, IDL, and PIDL, individually. Our KPIDL method obtains the best average EARecall value and achieves an average improvement by 32.1%.
Second, in terms of EAF-measure, our KPIDL method obtains better performance on 9 out of 15 apps compared with the five baseline methods. The average EAF-measure value by our KPIDL method over all apps achieves improvements by 43.1%, 36.9%, 11.8%, 3.7%, and 42.6% compared with those of DL, PDL, KPDL, IDL, and PIDL, individually. Our KPIDL method obtains the best average EAF-measure value and achieves an average improvement by 27.6%.
Third, our KPIDL method ranks the first and has significant differences compared with its five variants in terms of all two effort-aware indicators.
Summary: Different from its variants that only take either the feature learning or the imbalanced learning into consideration, our KPIDL method that combines the KPCA technique and IDL model has the advantages to learn representative features and deal with the class imbalance problem simultaneously. Our KPIDL method is more effective in obtaining significantly better performance than that of its five variants for predicting JIT defects on Android apps.

| Answer to RQ2: the prediction performance of our KPIDL method and the sampling-based imbalanced learning methods
Methods: To answer this question, we choose six sampling methods as the baseline methods, including Random Over-Sampling (ROS), Random Under-Sampling (RUS), The Synthetic Minority Over-sampling Technique (SMOT), SMOT with Tomek links (SMOTT), SMOT with Borderline samples (SMOTB), and over-sampling using ADAptive SYNthetic sampling (ADASYN). We also use random forest as the basic classifier, which is widely used in software defect prediction tasks [60][61][62][63].
Results: Tables 5 and 6  based imbalanced learning methods, individually. Figure 5 visualises the statistic test results of SKESD for our KPIDL method and the six baseline methods in terms of two effortaware indicators. From these tables and the figure, we can draw the following observations. First, in terms of EARecall, our KPIDL method obtains better performance on 9 out of 15 apps compared with the six baseline methods. The average EARecall value by our KPIDL method over all apps achieves improvements by 10.0%, 16.1%, 4.3%, 8.5%, 9.4%, and 4.2% compared with those of ROS, RUS, SMOT, SMOTT, SMOTB, and ADA-SYN, individually. Our KPIDL method obtains the best average EARecall value and achieves an average improvement by 8.7%.  Third, our KPIDL method ranks the first and has significant differences compared with the six sampling-based imbalanced learning methods in terms of all two effort-aware indicators.
Summary: Different from the sampling-based methods which need change the distribution of commit instances to balance the defect data, our KPIDL method uses the weights strategy to deal with the imbalanced issue. To sum up, our KPIDL method performs significantly better than the comparative sampling-based methods for predicting JIT defects on Android apps.

| Answer to RQ3: the prediction performance of our KPIDL method and the ensemble-based imbalanced learning methods
Methods: To answer this question, we choose five ensemble methods for comparison, including Balanced Random Forest (BRF), EasyEnsemble (EasyEn), Bagging (Bag), Balanced Bagging (BBag), and Adaptive Boost (AdaB). Results: Tables 7 and 8 report the average EARecall and EAF-measure values and the corresponding standard deviations of our KPIDL method and the five comparative ensemble-based imbalanced learning methods, individually. Figure 6 visualises the statistic test results of SKESD for our KPIDL method and the five baseline methods in terms of two effort-aware indicators. From these tables and the figure, the following findings can be drawn. First, in terms of EARecall, our KPIDL method obtains better performance on six out of 15 apps compared with the five baseline methods. The average EARecall value by our KPIDL method over all apps achieves improvements by 21.0%, 10.6%, 2.6%, 31.2%, and 25.9% compared with those BRF, EasyEn, Bag, Bbag, and AdaB, individually. Our KPIDL method obtains the best average EARecall value and achieves an average improvement by 18.3%.
Second, in terms of EAF-measure, our KPIDL method obtains better performance on 10 out of 15 apps compared with the five baseline methods. The average EAF-measure value by our KPIDL method over all apps achieves improvements by 25.3%, 18.5%, 23.7%, 31.9%, and 34.9% compared with those of BRF, EasyEn, Bag, Bbag, and AdaB, individually. Our KPIDL method obtains the best average EAF-measure value and achieves an average improvement by 26.9%.
Third, our KPIDL method ranks the first and has significant differences compared with the five ensemble-based imbalanced learning methods in terms of all two effort-aware indicators except for the Bag method with EARecall indicator.
Summary: Different from the ensemble-based methods, which combine the outputs of multiple classification models, our KPIDL method uses feature representation learning for performance improvement. In summary, our KPIDL method is more effective in obtaining significantly better performance than ensemble-based methods for predicting JIT defects on Android apps.

| Answer to RQ4: the prediction performance of our KPIDL method and the cost-sensitive-based imbalanced learning methods
Methods: To answer this question, we employ three costsensitive-based methods, including the Systematically developed Forest of multiple decision trees (SF) [64], Cost-sensitive- based decision Forest (CF) [29], and Balanced cost-sensitive decision Forest (BF) [29]. To construct the trees, three voting-based strategies are applied to these methods, including cascading-and-Sharing-based Voting (SV) [65], maximally Diversified multiple decision tree-based Voting (DV) [66], and Cost-sensitive Voting (CV) [29]. After combining each cost- results of SKESD for our KPIDL method and the nine baseline methods in terms of two effort-aware indicators. From these tables and the figure, the following observations can be drawn.
Third, our KPIDL method ranks the first and has significant differences compared with those of the nine costsensitive-based imbalanced learning methods in terms of all two effort-aware indicators except for the SFCV method in terms of EAF-measure.
Summary: Different from the above methods which introduce the cost-sensitive strategy into the construction of trees without performing feature transformation, our KPIDL method integrates the weight strategy into feature representation learning. Our KPIDL method significantly outperforms the comparative cost-sensitive-based methods for predicting JIT defects on Android apps.

| Threats to external validity
The generalisation of the experimental results threatens the external validity of this study. We conduct experiments on a publicly available benchmark dataset consisting of 15 Android mobile apps developed in the Java programing language. We need to further explore whether our method is suitable for the mobile apps developed in other languages, such as Kotlin. In addition, since we only investigate Android-based mobile apps, it is necessary to investigate IOS-based mobile apps to verify the generalisation of our KPIDL method.

| Threats to internal validity
The implementation mistakes of the methods in our experiments threaten the internal validity of our study. In this study, we carefully implement the KPCA technique, CSCE loss function, and the DNN structure based on Scikit-learn, Ten-sorFlow, and Python. As we specify multiple parameters empirically, the selection of the more optimal parameter settings needs to be explored in the future. As the code of costsensitive-based baseline methods were released by authors, we carefully integrate it into our experiments. In addition, for other comparative methods, we implement them based on third-part libraries with the default parameter settings.

| Threats to construct validity
The rationality of the used performance evaluation indicators and statistical test methods threatens the construct validity of our study. In this study, we employ two effort-aware indicators, that is EARecall and EAF-measure, which take the code inspection efforts into consideration when calculating, to evaluate the performance of our KPIDL method for JIT defect prediction on Android apps. In addition, to make our results more convincing, we apply a state-of-the-art statistic test method, that is SKESD, for the significant difference analysis between multiple methods.

| CONCLUSION
In this study, we propose a novel JIT defect prediction model, called KPIDL, for Android apps, which incorporates a feature learning stage and a classification model construction stage. More specifically, the KPCA technique used in the first stage is helpful to obtain high-quality feature representation for the defect data. Then, the improved version of DNN is able to alleviate the issue of class imbalance of the defect data by taking the prior probability of classes into account to compensate the imbalance between defective and clean commit instances when calculating the total loss. To evaluate the effectiveness of our KPIDL method, we conduct experiments on 15 Android apps and employ two effortaware indicators for performance evaluation. The experimental results demonstrate that, in term of each indicator, our KPIDL method performs better than 24 out of 25 comparative methods, including its five variants, six samplingbased, five ensemble-based, and nine cost-sensitive-based methods.
In the future, we plan to collect more data from Androidbased apps and IOS-based apps developed in other languages to enhance our experiments. In addition, our method will be adapted to cross-project scenarios for JIT defect prediction on apps.