Analysis Of Factors Affecting Breast Cancer At Haji General Hospital Medan Using The Principal Component Analysis Method

ABSTRACT


INTRODUCTION
Cancer is defined as the growth of abnormal cells that spread uncontrollably and have the potential to harm other body components [1].One of the most commoncancers is breast cancer.Breast cancer is malignant cancer of the breast that originates from gland cells, gland ducts, and breast supporting tissue, but does not include breast skin [2].The most common cancer disease in Indonesia according to data from the International Agency for Research on Cancer in Globocan (Globocan cancer Statistics) in 2020 is breast cancer.Of the 396,914 cases with 22,430 (9.6%) deaths due to this disease, there were around 65,858 new cases (16.6%) of breast cancer detected nationally, of which North Sumatra is in the top 10 highest contributors of all provinces in Indonesia.The increase in the incidence of breast cancer always increases every year in North Sumatra, especially the city of Medan.In 2021, breast cancer sufferers in the city of Medan will reach 824 cases.In 2022 there will beapproximately 100 cases of breast cancer at the Haji Provincial General Hospital (RSUP) Medan.
What is needed when analyzing the data above is a method that is able to analyze severalvariables and can also measure the relationship between variables.A statistical method that iscapable of analyzing several variables is multivariate statistics.In multivariate statistics, each method has its own benefits.Factor analysis is a multivariate statistical technique that is commonly used [3].Principal component analysis is a multivariate statistic that can be used to describe how a set of uncorrelated data (parameters) can vary into several independent parameters (principal components) [4].
Factor analysis using principal components is a type of data analysis that can be used to extract assessment indicators to create new factors that are not correlated with each other and strengthen an assessment category.The degree of data variance in all indicators is calculated in the main components [5].
Previous research conducted by [6] entitled "Breast Cancer Detection with Feature Selection based on Principal Component Analysis and Random Forest" aims to achieve a high level of accuracy in breast cancer detection.This research uses a classification model, namely Principal Component Analysis based on Random Forest, to predict the problem under study, where the results of the model evaluation will be seen for its accuracy value.The findings of this study show that when Random Forest and logitboost feature selection are combined, the principal component analysis-based feature selection method significantly improves its classification performance.
Based on this background and considering previous research, in this research the author is interested in studying using the PCA (Principal Component Analysis) method to analyze several factors that influence breast cancer at Haji Medan General Hospital by considering factors that have the potential to cause breast cancer.So the the author will conduct research with the title "Analysis of Factors Affecting Breast Cancer at Haji General Hospital Medan Using the Principal Component Analysis Method".

Factor Analysis
Factor analysis is a part of the multivariate statistical analysis technique that attempts to identify relationshipsbetween several variables that previously did not depend on each other to create one variable or group (Simarmata et al., 2015).
The following initial model of factor analysis is as follows:   =    1 +  2  2 + ⋯ +     + ⋯ +     +     Where   = standardized  ℎ variable (mean = 0, standard deviation = 0)   = partial regression coefficient carried out on the  ℎ common factor   = common factor  ℎ   = factor coefficients that are frozen on the  ℎ unique factor   = unique factor of the  ℎ variable  = many common factors The methods contained in the factor analysis model are principal components, unweighted least squares, generalized least squares, maximum likelihood, principal axis factoring, alpha factoring, and image factoring.Among these methods, only two are often used in parameter estimation, namely PCA because it can overcome multicollinearity problems [7] and Maximum Likelihood can provide the best estimation results [8].
According to [8] because correlation is the basic idea of factor analysis, the correlational assumptions that will be applied are: 1.The correlation between independent variables must be strong enough, for example greater than 0.5.2. The magnitude of the partial correlation, namely the correlation between two variables with the assumption that the other variables remain constant and must be small.3. Measure Sampling Adequacy (MSA) or Barlett's Test of Sphericity is used to test the full correlation matrix.The steps for carrying out factor analysis essentially consist of (Kusno, 2019: 63): a.To find out how closely the variables are related to each other, a data matrix is compiled in the form of a correlation matrix between the original variables.b.Barlett's Test of Sphericity, KMO, and MSA were used to test the relationship between several variables.c.Extracting factors or factoring, to reduce data from several indicators (variables), produces smaller factors that can explain the relationship between the indicators studied.d.Rotating factors is carried out if feature extraction (factoring) still does not obtain clear main factor components. e. Interpreting factor rotation results (done by looking at factor loadings).

Principal Component Analysis (PCA)
There are two methods for extracting factors, namely principal component analysis and common factor analysis [9].This method is quite effective, overcoming the problem of multicollinearity and eliminating correlation between independent variables until they are not correlated.The advantage of this method is that it can eliminate correlation without reducing or eliminating the original variables.The detailed aim of PCA is to eliminate and can also be said to simplify factors that are less influential or less related without losing the reasons and objectives of the original data [10].
The steps in Principal Component Analysis (PCA) are as follows [11]: Calculates the variance covariance matrix from observational data Variance is used (  ()) to find the spread of data in a collection to determine the deviation of data in a sample data set.Covariance Matrix (  (, )) is a matrix in which the covariance values in each cell are obtained from the sample, provided that x and y are random variables [12].
(̅ ) and ( ̅) are respectively the sample average (mean) of the variables x and y.After calculating using the formula above, we get the n x n formulated as elow: matrix.The covariance matrix can also below: where  = number of observations   ,   = value of the observation μ x , μ y = mean vector Calculate eigenvectors () and eigenvalues () This step is used after getting the covariance matrix value.The eigenvalues of () that have been computed are then transformed (orthogonal varimax rotation, which means minimizing the number of variables that have high loadings on a factor) using the following formula [11]: ( − ) =  Then, calculate the vector by solving the following equation: Last, from each eigenvalue of the new variable (PC), determine the proportion value of the Principal Component (PC) in (%) using the formula:

RESULTS AND ANALYSIS
In factor analysis using PCA, the first step was to collect data obtained by collecting information from the medical records of inpatient and outpatient patients at Haji Medan Hospital in 2022.The data collected was 153 data on breast cancer patients with several causal factors.The step after collecting data is checking data to see empty data (missing values).The results obtained were 143 data.

Correlation Matrix Between Variables
Next, form a correlation matrix with the aim of seeing the closeness of the relationship between variables.

a. Barlett's Test and KMO (Kaiser Meyer Olkin)
Bartlett's test of sphericity is a test used to test the correlation between variables in the sample [13].The requirement for the Barlett test is a sig value <  = (0,05).From the results of the Barlett test that has been carried out, it can be seen that the significance value is 0.0000, which means that there is a correlation between variables and the process can continue.


Analysis Of Factors Affecting Breast Cancer At Haji General Hospital Medan Using The Principal Component Analysis Method (Khairiyah Nurfalija Daulay)

29
The KMO test is carried out to see the adequacy of sampling in a study.From the results of the KMO test that has been carried out, it shows that this analysis is suitable for analysis because the KMO value is > 0.5 and meets the criteria 0.5 < KMO ≤ 0.6 = data to be analyzed (sufficient).Based on calculations using Python, the KMO value = 0.522771, meaning that the sampling adequacy is acceptable because it is greater than 0.5 .The following results in table 1 show that factor analysis is worthy of analysis.The next stage is testing the anti-image correlation matrix.The degree of correlation can be seen through the MSA value obtained from the results of table 3.These results show that the MSA value is > 0,5, which means that seven variables have a strong correlation and can be analyzed further (Manullang et al., 2023).

a. Communalities
The next stage is the extraction process using the PCA method.The next stage after getting the MSA test value for each variable is to look for the communalities value.This value shows whether the variable being studied is able to explain the factors or not.A variable is considered capable if the extraction value is > 0.50.The higher the communalities value, the stronger the relationship between variables related to the resulting factor.
The results in table 3 show that the variable that has the highest communalities value is gender at 0,579526 = 57,9526%.This means that the last education variable can explain 57,9526% of the variance in the factors formed.The greater the value of communalities, the closer the relationship between variables and factors is formed.So, it can be concluded from table 3 that only the variables of gender and smoking habits can be used to explain factors related to breast cancer.

b. Total Variance Explained
Next is the calculation of the total variance explained value to find out the eigenvalues and explained variance values for each attribute.To find out more specific extraction results using the PCA method, it can be seen from the total variance test which shows the eigenvalues.Therefore, only eigenvalues that are more than one or equal to one are components that form factors.So from the 7 variables extracted in this study, three main components were formed, where component 1 had a value of 1,595666, component 2 was 1,293432, and component 3 was 1,138518 shown in the following table.

Factor Rotation
Before interpreting the factor results, the first step that must be taken is factor rotation to look for correlations between factors and variables.Factor rotation was also carried out to review the placement of variables that were still not appropriate.Only correlation represented by loading factors (factors that correlate with each other) with a value > 0.30 is considered quite strongly correlated [14].Factor rotation is also carried out to clarify the position of the variables without looking at the loading values.The following is a table of matrix components before and after rotation.The results of factor rotation can be seen in table 5 and 6.
From table 5 of the component matrix below, it is still not very easy to find the appropriate place for the variables, for example in the Smoking Habits ( 4 ) and Marital Status (_5) variables, the difference in loading values between variables is very small.

Interpretation of results
The loading value identifies the relationship between the factors formed and the variables.The higher the loading value, the closer the variable is to the factor.In the results above, all variables form a factor based on their largest loading value, so that the factors are interpreted in the table below as follows: It can be seen in table 8, the main factor (component) formed is the gender factor with an eigenvalue of 1,595666.So, the factor that most influences breast cancer at Haji General Hospital Medan in 2022 is gender.

CONCLUSION
Several factors that influence breast cancer at RSUP Haji Medan consist of 7 variables which are classified into 3 factors, namely: the first factor with an eigenvalue of 1.595666%, the second factor with an eigenvalue of 1.293432%, and the third factor with an eigenvalue of 1.293432%.1.138518%.The factor that most influences breast cancer at RSUP Haji Medan in 2022 is gender.
[ − ][] =   =  Where  = matrix n x n  = Identity matrix  = eigenvalue () Determining a new variable or what can be said to be the Principal Component (PC) by multiplying the original variable and the eigenvector matrix.

Figure 4 .
Figure 4.1 shows the results of component points that have eigenvalues > 1, namely the three highest factors with values above 1.The image above shows the graphic form of the eigenvalues of each factor formed.

Table 1 .
Barlett's test and KMO b. MSA (Measure of Sampling Adequacy)

Table 2 .
MSA values for 7 variables

Table 4 .
Total Variance Explained c. Scree Plot

Table 5 .
Matrix Components Before RotationThe rotational component matrix below shows a clearer and more appropriate distribution of variables.It can be seen from table 6 that the previously small loading factors are increasingly being reduced, and large loading factors are increasingly being enlarged.

Table 6 .
Matrix Components After Rotation I. Factor 1 has only 1 forming variable consisting of: Gender.II.Factor 2 has 3 forming variables consisting of: Age, Smoking Habits, and Occupation.III.Factor 3 has 3 forming variables consisting of: Genetics (Family History), Marital Status, and Last Education.The following table below explains which factor a variable will go into (factor group), namely:

Table 7 .
Rotation Result Factor Group Tabel 8. Table of Variable Interpretation Results