Breast Cancer Classification based on Ultrasound Images using the Support Vector Machine (SVM) Algorithm

According to statistics from the Global Burden of Cancer Study (Globocon) of the World Health Organization (WHO), cancer, particularly breast cancer, is a severe health issue in Indonesia with 68,858 new cases and 22,000 deaths recorded in 2020. Ultrasonography (USG) technology is acknowledged as one of the potentials to support early detection, which is vital in reducing mortality from breast cancer. This study focuses on classifying ultrasound images using the Support Vector Machine (SVM) algorithm, GLCM feature extraction, Min-Max normalization, and Mutual Information with SelectKBest Feature Selection. From several experiments using the SVM algorithm with various combinations of parameter values that have been set and different Tests, namely using a Train/Test Split with a proportion of 80/20 and K-Fold Cross Validation, it shows that the SVM algorithm is capable of classifying ultrasound images of breast cancer. into two categories (Benign Tumor and Malignant Tumor) with the same maximum accuracy of 79% after applying the SMOTE Balancing Data technique or without using the Balancing Data technique. As a result, the Support Vector Machine (SVM) algorithm has the potential to be an effective model for identifying breast cancer ultrasound images, both on data from the original set that has not been balanced and data from the set that has been balanced.


Introduction
Cancer is a significant health issue in Indonesia and is ranked second in causes of death after cardiovascular disease [1].One of the many cancer causes of death is breast cancer which is generally experienced by women [2].Statistical data from the World Health Organization (WHO) Global Burden of Cancer Study (Globocon), recorded 68,858 new cases of breast cancer or 16.6% of the total 392,914 new cases of cancer reported in Indonesia with a death toll of more than 22 thousand people in 2020 [1] [3].
Early detection of breast cancer is an important step to reduce death rates that involves identifying early signs before the cancer spreads, with the aim of increasing the chances of cure and reducing the risk of complications [4].One method that can be used to support early detection of breast cancer is ultrasound technology.Breast ultrasound (USG) uses sound waves to create images of the breasts.This helps detect changes such as cysts or lumps that are difficult to see on a mammogram.Although not a primary screening, ultrasound is effective, more affordable, and safe without radiation [5].
Digital image classification is the process of grouping images based on visual characteristics.This is used in fields such as medicine, where an example is recognizing breast cancer in ultrasound images with machine learning algorithms that process image features as input to build classification models.Several previous studies [6][7] [8], has applied one of the Machine Learning algorithms such as the Support Vector Machine (SVM) algorithm to classify digital images in different cases and obtained quite good accuracy values.So from the background explained previously, this research will apply the Support Vector Machine (SVM) algorithm to determine the level of accuracy and whether this algorithm is good enough to be applied in classifying digital images, namely ultrasound images in breast cancer.. Gray Level Co-Occurrence Matrix (GLCM) According to Rao et al. [13], Gray Level Co-Occurrence Matrix (GLCM) is a matrix representation used to identify the extent to which certain pairs of pixels appear at certain distances and angles in an image.GLCM is used to compute various features of this matrix.This approach has many applications including image classification, texture pattern recognition, image segmentation, object identification, and color analysis in images.
There are many texture characteristics that can be extracted from the co-occurrence matrix according to the concept proposed by Haralick [13].However, this research focuses on six main attributes in GLCM that are often used in analysis, that is ASM (Angular Second Moment), Contrast, Correlation, Dissimilarity, Energy, and Homogenity.

Synthetic Minority Over-sampling Technique (SMOTE)
SMOTE is an oversampling technique that creates synthetic data in a minority class by taking examples from that class, finding the nearest neighbors with k-nearest neighbors, and combining the differences between examples and neighbors to avoid excessive overfitting [14].The algorithm process in SMOTE begins by calculating the difference between the feature vector in the minority class and the nearest neighbor value of the minority class.Then, this difference value is multiplied by a random number in the range between 0 and 1.The results of this calculation are then added back to the original feature vector, producing a new feature vector [15].

Mutual Information
Mutual Information (MI) in Feature Selection it is used to measure how much information a feature has, helping to assess the feature's contribution to the model's ability to carry out accurate classification [16].The Mutual Information (MI) value between two random variables is always positive, indicating the level of dependence.An MI of zero indicates independence, while a high MI indicates a strong level of dependence between the two variables [17].

Support Vector Machine (SVM)
Support Vector Machine (SVM) is the algorithm used for classification.This method looks for the largest margin in the form of a hyperplane (vector line) as the dividing boundary between two data classes in the case of binary classification.Although originally designed for two-class problems, SVMs have evolved to handle multi-class classification problems by combining multiple binary classifiers [18] [19].The SVM algorithm searches for the optimal hyperplane by measuring the Sistemasi: Jurnal Sistem Informasi ISSN:2302-8149 Volume 13, Nomor 4, 2024: 1438-1448 e-ISSN:2540-9719 distance and finding the maximum point.The accuracy of the SVM model depends on the kernel and parameters.SVM is divided into linear SVM which separates data linearly and non-linear SVM which uses kernel tricks in high-dimensional space [20].
Several types of kernels commonly used in the SVM algorithm are as follows: 1. Linear Kernels Linear kernels are a very basic type of kernel function.Linear kernels are used when data can be separated linearly.More suitable for data with many features, as mapping to a higher dimensional space does not always help performance, especially in classification.SVM uses a linear kernel by default, where the data is separated by a hyperplane [21].

Polynomial Kernels
Polynomial kernels are a more general form of linear kernels and are used to measure the similarity between training sample vectors in a feature space.This kernel is suitable for normalized datasets.
(  ,   ) = ((    ) + )  (1) The parameter value d in equation ( 1) affects the degree of the hyperplane curve in the polynomial function.The higher the d value, the more curved the hyperplane line becomes, which can make the accuracy unstable [22].

Radial Basic Function Kernels (RBF)
The Gaussian Radial Basis Function (RBF) kernel is used to classify data that cannot be separated linearly.With the right parameters, RBF can provide good performance and reduce errors in model training.RBF uses a feature space with an unlimited number of dimensions that can be adjusted by parameters, so it can handle complex data, especially when the data does not have a clear linear pattern [23]. (2)

Research Method
This research uses quantitative research methods, which is an approach that utilizes data in the form of numbers to analyze and compile scientific information based on numerical data [24].
The research flow is used as a basic step for researchers in building a breast cancer classification model based on ultrasound images using the Support Vector Machine (SVM) algorithm.A general overview of the flow of this research can be seen in Figure 1 below.The results of this feature extraction can be used in the classification process.In the feature extraction stage in this research, the method used was GLCM (Gray Level Co-Occurrence Matrix) calculation.

Pre-Processing Data
In the pre-processing stage of this research, there are two experiments that will be carried out.
The first experiment uses a dataset without Balancing Data, immediately proceeds to the normalization stage with Min-Max Normalization, and Feature Selection using Mutual Information & SelectKBest.The second experiment uses a dataset that has been "balanced" first with the SMOTE method, before proceeding to the next pre-processing stage.The aim of this pre-processing is to prepare a dataset with better quality for further analysis and classification processes.

Split Data
At this stage, the dataset is divided into training data (80%) and testing data (20%) at this stage.
Training data is used to train the model, so that it can understand the patterns and relationships of features in the dataset.Test data is used to test the performance of a trained model, measuring the model's ability to predict data that has never been seen before.

Building a Classification Model (SVM)
In the development stage of the classification model, the Support Vector Machine (SVM) method is used to predict cancer classes: Benign (class 0) and Malignant (class 1).Grid Search is used to find the best hyperparameters that maximize the performance of the SVM model including the C parameter (error penality), kernel type (kernel function), and gamma (kernel coefficient).

Evaluation of Classification Model Performance
The algorithm evaluation stage involves measuring the performance of the classification model.K-Fold Cross Validation and Confusion Matrix are used to generate evaluation metrics, including accuracy, precision and sensitivity.The results of these metrics are then compared to determine the best evaluation.

Analysis
The evaluation results analysis stage involves understanding and interpreting the evaluation metrics from the model or experiment.It involves analyzing and comparing metrics such as accuracy, precision, and sensitivity to gain an understanding of the model performance or results obtained.

Data Acquisition
The data used in this research is the Breast Ultrasound Image (BUSI) Dataset.This data was taken from Kaggle with the dataset URL https://www.kaggle.com/datasets/aryashah2k/breastultrasound-images-dataset.The dataset taken consists of a total of 647 ultrasound images, with 437 images representing benign tumors (Benign) and 210 images representing malignant tumors (Malignant).A visual representation of this dataset can be observed in Figure 2.

Feature Extraction Using GLCM
The feature extraction stage uses the GLCM (Gray Level Co-Occurrence Matrix) method to produce texture features such as dissimilarity, correlation, homogeneity, contrast, ASM, and energy.GLCM takes into account the relationship between two neighboring pixels that have gray intensity, taking into account distance and angle.There are four corners used : 0°, 45°, 90°, dan 135°.The results of this feature extraction are exemplified in Table 1 After that, the GLCM calculation results are saved in .csvfile format.The researcher changed the labels manually using Excel to produce a division of class types into 2 classes for each image which can be seen in Table 2.After doing manual labeling via Excel, the .csvfile is uploaded to Google Colaboratory for the Pre-Processing stage.However, before continuing, visualization is carried out to understand the relationship between variables or features in the dataset and to gain insight into the correlation, distribution and patterns of the data as a whole.This visualization includes the patterns of all combinations of feature pairs in the dataset as seen in Figure 3.

Data Balancing Using SMOTE
Table 2 indicates an imbalance between classes in the breast cancer dataset, which can impact the quality of classification.Therefore, the Synthetic Minority Over-Sampling Technique (SMOTE) method is used to create synthetic data in the minority class so that the number is equivalent to the majority class.The goal is to improve classification quality by ensuring the model is not biased towards the majority class.After applying the SMOTE technique, the amount of data in each class is now the same as can be seen in Table 3.
Table 3 From Table 3, it can be concluded that to balance the minority class with the majority class, 227 synthetic data were added to the Malignant class.

Data Normalization Using Min-Max
The results of balancing the dataset still show differences in varying attribute scales.This difference can affect the performance of the Machine Learning model in carrying out optimal classification.Therefore, it is important to standardize, ensuring that attributes have a uniform scale when building Machine Learning models.In this context, the Min-Max normalization technique is used to change the attributes so that they have values in the range 0 to 1.The results of this standardization process are shown in Figure 4.

Feature Selection Using Mutual Information with SelectKBest
At the feature selection stage, the Mutual Information method with SelectKBest was used to improve the quality and efficiency of data analysis and modeling.This method calculates the Mutual Information score between each feature variable and the target variable (label) in the dataset.The results of calculating the Mutual Information score can be seen in Figure 5. Features with higher Mutual Information scores are considered more informative and important in building a good model or performing accurate analysis.In this research, SelectKBest was used to select k features, and the researcher selected k=6 features with the highest Mutual Information score.In this way, only the most informative subset of features is retained.From Figure 5, the six features with the highest Mutual Information score are Feature 15 (contrast 90°), Feature 3 (dissimilarity 90°), Feature 16 (contrast 135°), Feature 2 (dissimilarity 45°), Feature 4 (dissimilarity 135°), and Feature 14 (contrast 45°).Figure 4.6 shows the Confusion Matrix results from the Support Vector Machine algorithm experiment using the 'RBF' kernel with parameter values C='100' and Gamma='10' on a dataset where SMOTE Data Balancing was not carried out.From these results it can be seen that : 1.A total of 80 class data in the Benign class (Benign Tumors) and 23 data in the Malignant class (Malignant Tumors) were predicted to be correct in total.2. A total of 11 Benign class data were predicted incorrectly, namely as the Malignant class (Malignant Tumor).3. A total of 16 Malignant class data were predicted incorrectly, namely Benign class).

Analysis
From the results of two different types of testing, it can be concluded that both tests using Train/Test Split and K-Fold Cross Validation, the highest evaluation value achieved is the same, but the application of the method and parameter values are quite different.A comparison of the best evaluation values of the two methods can be seen in Table 7.Based on the comparison in Table 7 above, it can be seen and concluded that the type of testing either uses the Train/Test Split 80/20 method using a dataset without Balancing using an SVM model with the kernel 'RBF', parameters C=100, and Gamma=10 or K-Fold Cross Validation using a dataset carried out by SMOTE Balancing Data using an SVM model with kernel 'RBF' parameters C='10', and Gamma='100' has the same highest model evaluation value, namely with accuracy, precision and sensitivity of 79 %.
Furthermore, this study was compared with previous research which used a similar dataset, namely the Breast Ultrasound Image (BUSI) dataset with two classes, Benign (benign tumors) and Malignant (malignant tumors), totaling 210 data for each class with details that can be observed in table 8.

Conclusion
Based on the results of the analysis that has been carried out previously, it can be concluded that the Support Vector Machine algorithm implemented using GLCM feature extraction, Min-Max as well as Mutual Information and SelectKBest feature selection produces quite good scores in classifying two classes (Benign and Malignant) of images.ultrasound in breast cancer.This can be seen from several experiments and tests that have been carried out, the maximum accuracy value is the same as 79% by adding the SMOTE Balancing Data technique or without adding the Balancing Data technique.Furthermore, by comparing previous research which used the same type of dataset as this research, namely the Breast Ultrasound Images (BUSI) Dataset, it was found that the accuracy value increased by 2% by applying the Support Vector Machine Algorithm using the 'RBF' Kernel, GLCM feature extraction technique, Min-Max Normalization , as well as selection of Mutual Information and SelectKBest features.

Figure 2 .
Figure 2. Examples of benign images

Figure 5 .
Figure 5. Balanced data mutual information score

Table 1 . Example of benign image extraction results Citra Fitur
.

Table 8 . Research comparison Researcher Name Dataset Method
From the comparison of results with previous research using the BUSI dataset.In previous research, SVM with DCT feature extraction achieved 77% accuracy in breast cancer classification.In this study, with a larger dataset (647 and 874 data), after several experiments and parameter testing, SVM achieved a maximum accuracy of 79%.These results show an increase in accuracy of 2% by applying the SVM algorithm as well as the GLCM feature extraction technique, and the Mutual Information and SelectKBest feature selection methods.