Detection of code smells using machine learning techniques combined with data-balancing methods

ABSTRACT


Introduction
Code smells indicate design issues that violate basic design principles such as hierarchy encapsulation, abstraction, and others, potentially affecting software quality [1], [2].Detecting code smells is crucial for guiding the subsequent refactoring process to improve software quality and reduce software failure risk [3], [4].Code smells usually appear during the design or coding phase due to developers' activities in emergencies, inadequate design, or coding solutions [5], [6].Table 1 lists the four specific code smells that we have investigated.
Software metrics are crucial in measuring and enhancing software quality and are utilized to characterize software engineering products [7].They have diverse applications, including bug identification, test complexity prediction, code smell detection, and clone prediction.Object-oriented metrics are the most commonly used software metrics [8], [9].

A R T I C L E I N F O
A B S T R A C T ML models are mathematical techniques that employ historical data to automatically identify intricate patterns and make informed and intelligent decisions [3].Supervised ML techniques are commonly used for code smells detection [13], [14].With supervised classification algorithms, the machine can acquire knowledge of the associations between instances and decision labels [10], [15].
Studies on code smells detection have recently gained more attention, and scientific researchers have presented many studies for code smells detection using ML models.For example, Mhawish and Gupta [1] presented an approach for predicting code smells using ML techniques and software metrics.The authors utilized datasets obtained from Fontana et al. and their experimental results showed that the accurate prediction of code smells can be significantly facilitated by employing ML techniques.Fabiano Pecorelli et al. [2] examined five distinct data balancing methods to alleviate issues of data imbalance and gauge their effects on ML algorithms in code smell detection.During the experiment, five datasets on code smells were utilized.The findings indicate that ML models utilizing the synthetic minority oversampling technique exhibited the most promising performance.This technique effectively addressed the problem of class imbalance.Fontana et al. [9] presented an approach for identifying code smells that involves the use of various ML techniques.The results indicate that all techniques performed satisfactorily, however, the imbalanced data adversely affected the performance of certain models.Cruz et al. [16] conducted an assessment of seven ML algorithms to identify four distinct types of code smells, while also analyzing the influence of software metrics on the detection of code smells.The experimental results found that ML algorithms can perform well in detecting bad code smells, and metrics play a fundamental role in detecting bad code smells.Martins et al. [17] conducted an empirical study to predict classes that are susceptible to change using eight ML techniques.In their study, three distinct training scenarios were involved, which included object-oriented metrics, code smells, and a fusion of both.The experiments were conducted on a dataset of 32 code odor types and eight object-oriented metrics.The experimental results found that some ML algorithms presented the best results based on the training scenario of a combination of code smells and object-oriented metrics.Hozano et al. [18] evaluated and compared the effectiveness of six ML algorithms in detecting four different code smells across a sample of 40 developers.The findings revealed that the ML algorithms performed poorly for the participating developers, indicating their susceptibility to the type of smells and the individual developer.These algorithms were unable to learn effectively from a limited training set.Sharma et al. [19] presented a method for code smells detection based on convolution neural networks and recurrent neural network models.The experiment results showed that detecting code smells is feasible using deep learning methods.Dewangan et al. [20] proposed an approach based on six ML algorithms to predict code smells based on four datasets obtained from 74 open-source systems.The proposed approach's effectiveness was assessed using various performance metrics, and two feature selection methods were implemented to improve the accuracy of the predictions.The experimental results showed that their approach achieved high prediction accuracy.Jain and Saha [21] proposed a method for code smell detection based on several ML models.The method was evaluated based on different performance metrics.The experimental results demonstrate that boosted decision trees and Naive Bayes models yielded superior performance compared to other models, following dimensionality reduction.
Our analysis of prior research on code smells detection revealed that most proposed methods overlook the issue of class imbalance.However, studies that addressed this problem and implemented data balancing methods [2], [22] emphasized such methods' critical and essential role in code smells detection.
Data imbalance in a training data set, where classes are unevenly distributed, hinders the efficiency of ML algorithms and biases their performance towards the majority class [3].That leads to unbalanced false-positive and false-negative results, making data imbalance the biggest problem for ML algorithms [22].This study selects imbalanced datasets extracted from 74 open-source systems [9].Consequently, there is a growing impetus to employ data balancing methods and develop unbiased classifiers that operate effectively on imbalanced code smells datasets.
To our knowledge, a few studies have applied ML combined with sampling techniques for code smells detection.To address these gaps, our work offers a novel method that aims to achieve the following key objectives and contributions: • The present study introduces a novel method that combines machine learning (ML) with a random oversampling technique to effectively detect code smells.
• This study evaluates the efficiency of the proposed method utilizing various performance measures and compares it with the currently employed methods for detecting code smells.
• We show that the performance of ML models in code smells detection can be significantly improved when balancing the data set by applying data-balancing methods.
The paper is structured in the following manner: Section 2 specifies the research method.Section 3 outlines the results and corresponding discussions, while the final Section (Section 4) presents the conclusion.

Method
Our study proposes a method for training and testing code smell detection models, which utilizes high-performance supervised machine learning algorithms in combination with a random oversampling technique.Fig. 1 illustrates the proposed research process for detecting code smells.The following sections describe the steps taken in this study, which encompass dataset description, data pre-processing, feature selection, dataset balancing, classification algorithms, model building, and evaluation.

Dataset Description
To perform the analysis and experiments, our method was implemented using the datasets proposed by Fontana et al. [9], which include 74 open-source systems of varying sizes and domains sourced from Qualitas Corpus (QC) [23], as detailed in Table 2.The justification for selecting these datasets is that the systems must be able to calculate metric values correctly.Moreover, these data sets are freely available, and researchers can iterate, compare and evaluate their studies.In QC systems, metrics are chosen for both class and method levels.The metrics chosen comprise a standardized set of metrics that address various aspects of the code, such as size, cohesion, encapsulation, etc. [9].The computed metrics for all 74 systems of the QC are displayed in Table 3.

Data Pre-processing and Features Selection
Before constructing the model, it is essential to carry out pre-processing of the collected data.To ensure the production of an optimal model, careful attention must be paid to the quality of the data [24].Data pre-processing refers to a collection of procedures utilized to enhance data quality before constructing a model.Its primary objectives are the removal of noise and extraneous outliers, managing missing values, converting feature types, and more [10], [11], [25].Selecting the most informative features from a list of features through suitable methods is a crucial step commonly referred to as Feature Selection (FS).[26]- [28].FS aims to identify the most relevant features for the target class from a highdimensional feature set and eliminate redundant and uncorrelated features [1], [17], [21], [29].There are three distinct categories of FS methods, which are wrapper methods, embedded methods, and filter methods, each method has rules for selecting the most relevant features as independent variables for training ML models [29].In this study, our models were based on embedded methods because these methods fit ML models.

Class Imbalance and Sampling Techniques
Class imbalance is a common issue in code smells detection, wherein one class has significantly fewer examples than the others.That is particularly relevant since code smells datasets often consist of a small number of smelly instances and many non-smelly ones [3].Therefore, the class imbalance problem can often lead to misclassifying cases in the minority class [30].To address this problem, various techniques have been proposed, including data sampling methods, boosting-based ensemble methods, baggingbased ensemble methods, cost-sensitive learning approaches, and other similar approaches [2]. the dataset used for code smells detection in this study is significantly imbalanced [15].Specifically, the initial datasets consisted of 561 smelly instances and 1119 non-smelly instances.The first two datasets pertain to code smells at the class level, specifically for the god class (with 140 smelly cases and 280 nonsmelly instances) and data class (with 140 smelly cases and 280 non-smelly instances).In contrast, the remaining two datasets focus on code smells at the method level, namely feature envy (with 140 smelly instances and 280 non-smelly instances) and long method (with 141 smelly instances and 279 non-smelly instances).We address the class imbalance issue by enhancing the original datasets to make the data more realistic.We mitigate the class imbalance problem by using the random oversampling technique, which involves randomly selecting examples to increase the minority class [2], [17].Fig. 2 illustrates the learning instances distribution in all datasets.

Classification Algorithms
This section briefly describes the classification algorithms used to detect and classify code smells in our study.

DT
DT is a supervised ML algorithm utilized for carrying out both regression and classification tasks [9].DTs function by segregating instances based on feature values and branching them out.In an ID3 decision tree, all features are initially assigned as root nodes.Subsequently, the features are separated by computing their Entropy, which measures data homogeneity.Entropy values fall within the range of 0 to 1 [12], [13].Mathematically, entropy for a single attribute is represented as: Where C is the number of outputs,   Is the probability of occurrences of each output from all outputs, and F is a feature with some data.

K-NN
K-NN is a basic supervised ML algorithm that operates by examining K neighboring objects and selecting the most commonly occurring class or calculating the distance between them [28].Additionally, it is a lazy-learning technique that categorizes elements according to their spatial arrangement on a hyperplane.The algorithm necessitates the selection of k closest points.Therefore, the first stage involves determining the distance between the input data point and other points in our training data [29], [31].The distance between these two points can be calculated using: Suppose x is a point with coordinates ( 1 , 2 ,...,  ) and y is a point with coordinates ( 1 , 2 ,...,   ).

SVM
SVM is a widely-used and regularized machine learning algorithm employed mainly for classification and regression purposes.SVMs leverage a margin on both sides of a hyperplane to separate two features, and they optimize the hyperplane to maximize the margin between features [2], [29].The general form of the SVM function is defined as: In this context, the weight vector is denoted by w, the input vector is represented by x, and the hyperplane equations' intercept and bias terms are indicated by b.

XGB
XGB is a robust ML algorithm that has been recently introduced.It is grounded in the principle of gradient boosting and utilizes parallel tree boosting to predict the target through the consolidated results of numerous weak models.XGB delivers exceptional speed and accuracy.The formula for the XGB model is given as: Where   (  ) =   () represents the domain of classification trees, while   () represents the score of a particular sample, x represents the predicted value generated by the model.In addition, q signifies the structure of each tree, T refers to the total number of trees, and each   corresponds to a different tree structure q with its corresponding leaf weight w.

MLP
MLP is an artificial neural network composed of multiple layers of interconnected perceptrons, specifically designed to handle intricate data inputs and execute diverse tasks such as regression or classification.The network employs nodes with specific weights to create connections between these layers.The backpropagation algorithm is used to train the model in the MLP network [7].The formula of the MLP model is as follows: The given context states that the output is denoted by  , while n represents the overall number of inputs that are provided to the neuron.Additionally,   represents the input to the network, The weights of the connections between input and output nodes are denoted by  , the bias term is represented by  , and the transfer function is symbolized by  .

Models Building and Evaluation
The proposed models were built and evaluated by utilizing 80% of the dataset for training and keeping the remaining 20% for validation.Table 4 outlines the various parameters used for creating each model independently.We assess the effectiveness of our proposed models by utilizing standard evaluation metrics, namely the confusion matrix, which includes measures such as (accuracy, precision, recall, and measure), MCC, and AUC.MCC is a widely adopted metric for model assessment, which captures the variation between predicted and actual values through true and false positives and negatives.AUC is a visual depiction of classifier efficacy that plots the true positive rate versus the false positive rate at varying classification thresholds.A confusion matrix is a table that assists in evaluating the performance of a classification model by comparing the predicted class labels to the actual class labels of a dataset as illustrated in Table 5.

No
In this given scenario, the sum of ranks for all positive samples is represented by∑ rank(  ) , while the number of positive examples is denoted by "," and the number of negative examples is represented by "".

Results and Discussion
The experimental setup was implemented in Python, and the training and validation datasets were obtained from the same project.To ensure reliable performance evaluation, the proposed models were trained and tested on large datasets with over 6,785,568 lines of source code.Table 6 to Table 9, and Fig. 3 to Fig. 7 show the results.Based on the DT model, we observed that accuracy values varied from 0.92 to 0.99 on the original datasets and from 0.98 to 1.00 on the balanced datasets.In terms of precision, the values ranged from 0.86 to 1.00 on the original datasets and from 0.97 to 1.00 on the balanced datasets.The recall values ranged from 0.89 to 0.96 on the original datasets and were 1.00 on the balanced datasets.In the context of f-measure, the values varied from 0.87 to 0.98 on the original datasets and from 0.98 to 1.00 on the balanced datasets.Moreover, MCC values ranged from 0.81 to 0.97 on the original datasets and from 0.96 to 1.00 on the balanced datasets, whereas AUC values ranged from 0.90 to 0.98 on the original datasets and from 0.98 to 1.00 on the balanced datasets.The K-NN model demonstrates that the accuracy values vary between 0.86 to 0.92 on the original datasets and from 0.91 to 0.97 on the balanced datasets.Additionally, the precision values on the original datasets vary from 0.75 to 0.97 and from 0.88 to 0.97 on the balanced datasets.The recall values vary from 0.70 to 0.91 on the original datasets and from 0.97 to 0.98 on the balanced datasets.In the context of f-measure, the values range from 0.76 to 0.88 on the original datasets and from 0.92 to 0.98 on the balanced datasets.Furthermore, the MCC values range from 0.66 to 0.81 on the original datasets and from 0.82 to 0.94 on the balanced datasets.Finally, the AUC values range from 0.85 to 0.97 on the original datasets and from 0.93 to 0.98 on the balanced datasets.Following the SVM model, it can be observed that the accuracy values vary between 0.90 and 0.98 on the original datasets, and from 0.96 to 1.00 on the balanced datasets.On the original datasets, the precision values vary from 0.85 to 0.96, while on the balanced datasets, the precision values vary from 0.94 to 1.00.In the context of recall, the values range from 0.85 to 0.96 on the original datasets, and from 0.98 to 1.00 on the balanced datasets.In the context of f-measure, the values range from 0.85 to 0.96 on the original datasets and from 0.97 to 1.00 on the balanced datasets.The MCC values range from 0.78 to 0.94 on the original datasets and from 0.92 to 1.00 on the balanced datasets.The AUC values range from 0.96 to 0.99 on the original datasets and from 0.97 to 1.00 on the balanced datasets.Based on the XGB model, it can be observed that the accuracy values range between 0.95 to 1.00 for the original datasets and between 0.96 to 1.00 for the balanced datasets.In the context of precision, the values range between 0.87 to 1.00 for the original datasets and between 0.95 to 1.00 for the balanced datasets.In the context of recall, the values range between 0.97 to 1.00 for the original datasets and between 0.97 to 1.00 for the balanced datasets.In the context of f-measure, the values range between 0.93 to 1.00 for the original datasets and between 0.96 to 1.00 for the balanced datasets.Additionally, the MCC values range between 0.89 to 1.00 for the original datasets and between 0.90 to 1.00 for the balanced datasets, whereas the AUC values range between 0.99 to 1.00 for the original datasets and between 0.98 to 1.00 for the balanced datasets.
Based on the MLP model, it was observed that the accuracy values ranged from 0.88 to 0.98 on the original datasets and from 0.96 to 0.98 on the balanced datasets.Furthermore, the precision values ranged from 0.87 to 0.97 on the original datasets and from 0.96 to 0.97 on the balanced datasets, while the recall values ranged from 0.74 to 1.00 on the original datasets and from 0.97 to 1.00 on the balanced datasets.In the context of f-measure, the values ranged from 0.80 to 0.96 on the original datasets and from 0.97 to 0.98 on the balanced datasets.Furthermore, the MCC values range from 0.72 to 0.94 on the original datasets and from 0.92 to 0.96 on the balanced datasets.Finally, the AUC values range from 0.90 to 0.99 on the original datasets and from 0.98 to 1.00 on the balanced datasets.
Concerning each type of code smell, the top-performing models attain the subsequent results: DT model scores 100% accuracy on data class and long method (balanced datasets).K-NN model achieves 97% accuracy on god class (balanced datasets).The SVM model scores 100% accuracy on the long method (balanced datasets).XGB model achieves 100% accuracy on data class and long method (original and balanced datasets).MLP model scores 98% accuracy on data class (original and balanced datasets) and 98% on the long method (balanced datasets).Fig. 3 shows the best accuracy values of the models for all considered code smells on the original and balanced datasets.The best accuracy on the original datasets (god class) is 98% obtained by the XGB model, while the best accuracy on the balanced datasets (god class) is 98% obtained by the DT model.The best accuracy on the original datasets (data class) is 100% which the XGB model gets, while the best accuracy on the balanced datasets (data class) is 100% obtained by the DT and XGB models.The best accuracy on the original datasets (long method) is 100% which the XGB model gets, while the best accuracy on the balanced datasets (long method) is 100% which is obtained by the DT, SVM, and XGB models.The best accuracy on the original datasets (feature envy) is 95% which the XGB model gets.The best accuracy on the balanced datasets (feature envy) is 98%, obtained by the DT and XGB models.Fig. 4 exhibits box plots that display the averages of several performance measures, including accuracy, precision, recall, f-measure, MCC, and AUC based on the original datasets.The overall average performance of all models is 0.93, 0.96, 0.88, 0.92, 0.86, and 0.96, respectively, for the god class.Similarly, for the data class, the overall average performance of all models is 0.96, 0.91, 0.95, 0.93, 0.90, and 0.98, respectively.In the context of the long method, the overall average of all models is 0.96, 0.95, 0.93, 0.94, 0.91, and 0.97, respectively.Lastly, for feature envy, the overall average performance of all models is 0.90, 0.85, 0.83, 0.84, 0.77, and 0.92, respectively.Fig. 5 exhibits box plots that display the averages of several performance measures, including accuracy, precision, recall, f-measure, MCC, and AUC based on the balanced datasets.The overall average performance of all models is 0.96, 0.96, 0.98, 0.97, 0.93, and 0.98, respectively, for the god class.Similarly, for the data class, the overall average performance of all models is 0.98, 0.97, 0.99, 0.98, 0.96, and 0.99, respectively.In the context of the long method, the overall average of all models is 0.98, 0.97, 0.99, 0.98, 0.97, and 0.99, respectively.Lastly, for feature envy, the overall average performance of all models is 0.95, 0.94, 0.98, 0.96, 0.91, and 0.96, respectively.datasets that are balanced.The AUC values ranges from 0.97 to 1.00, which signifies a high level of discriminatory ability.The results presented in this study showcase the potential of Our models in detecting software development code quality issues, despite the lack of information regarding the datasets and code quality metrics utilized.After analysing the outcomes generated by our presented ML models across all datasets, It is clear from the results that the models achieved impressive scores on all of the datasets.This suggests that our proposed models performed well, and the data balancing methods utilized were instrumental in enhancing the accuracy of ML models for code smells detection.

Conclusion
Code smells detection has significant positive effects on software quality.In this study, we presented a method based on ML techniques combined with a data balancing method (random oversampling technique) to detect code smells; Our proposed method was evaluated by considering four different types of code smells.The evaluation involved conducting experiments using five different ML algorithms and assessing the results using various performance measures.The proposed models' average accuracy on the original datasets was found to be 93% for god class, 96% for data class, 96% for long method, and 90% for feature envy.Meanwhile, on the balanced datasets, the proposed models' average accuracy was 96% for god class, 98% for data class, 98% for long method, and 95% for feature envy.The results indicate that the proposed models' accuracy improved by 3%, 2%, 2%, and 5% on the balanced datasets compared to the original datasets.The experimental result showed that combining ML algorithms with a random oversampling technique can enhance the process of code smells detection and that software metrics play significant and critical roles in detecting code smells.By analyzing the results, it is clear that our method gives better results when compared with the other methods previously studied for code smells detection in performance evaluations using the same datasets.However, the limitation of this study is that some of the results obtained with our models (K-NN and MLP) are not very high.So, in our future work, we will try to improve the architecture of these models to get better results.In addition, we intend to assess the robustness of our method by testing it on various datasets.Furthermore, our goal is to enhance the accuracy of models in detecting code smells by incorporating additional ML algorithms, such as neural networks and deep learning, and utilizing random under sampling techniques for data balancing.

Fig. 1 .
Fig. 1.Shows the overview of the proposed research process for detecting code smells.

Fig. 2 .
Fig. 2. illustrates the learning instances distribution in all datasets.

Fig. 3 .
Fig. 3.The best models' accuracy values on original and balanced data sets.

Fig. 4 .
Fig. 4. Box Plots represent the models' performance measures on all considered code smells_ original datasets.

Fig. 5 .
Fig. 5. Box Plots represent the models' performance measures on all considered code smells_ balanced datasets.

Fig. 6
Fig.6shows the AUC of the models for all considered code smells on the original datasets; the highest AUC on the original datasets (god class) is 99%, obtained by XGB and MLP models.

Fig. 6 .
Fig. 6.The ROC curves obtained by the models on all considered code smells_ original datasets.

Table 6 .
The results for the class-level dataset: god class _ original and balanced datasets

Table 7 .
The results for the class-level dataset: data class_ original and balanced datasets

Table 8 .
The results for the method-level dataset: long method_ original and balanced datasets

Table 9 .
The results for the method-level dataset: feature envy_ original and balanced datasets

Table 10 .
Outlines a comparison between the proposed method and other pre-existing methods, with emphasis on their respective accuracy values Datasets

Table 11 .
Outlines a comparison between the proposed method and other pre-existing methods, with emphasis on their respective AUC values Datasets