Comparison of Machine Learning Algorithms for Classification of Ultraviolet Index

ABSTRACT


INTRODUCTION
Indonesia is a tropical climate country that has the potential for exposure to sunlight with high intensity. Several types of exposure from sunlight that reaches the earth's surface are ultraviolet rays. UV rays are classified into UV A, UV B and UV C which are distinguished based on the difference in wavelength. Ultraviolet radiation that has an effect on the skin is UV B radiation and has the strongest effect in causing photodamage to the skin [1]. Sun exposure can be beneficial in increasing vitamin D production based on BMKG provides an appeal to the public regarding the impact of the level of UV radiation exposed to the earth's surface with a UV index value as shown in Figure 1. The UV index is a number without units to explain the level of exposure to ultraviolet radiation related to human health. By knowing the UV index, we can monitor the level of ultraviolet light that is beneficial and that can provide harm [6]. From the parameterization of the UV index and appeals, it will become material in making a recommendation system. Collaborative Filtering is a recommender system technique that utilizes rating information from multiple users to predict item ratings for certain users. The recommender system has been implemented. The hospital applies collaborative filtering to classify illnesses based on symptoms and advise patients on the best doctor [7]. The outputs of models that have been trained to find patterns in the input data are used to make recommendations for products using the model-based approach to collaborative filtering [8]. Collaborative Filtering that combines with machine learning is less time consuming and reliable [9]. In recent years, machine learning has dramatically changed the structure of recommendation systems and presented more opportunities to improve recommender system performance. Recent developments in machine learning-based recommender systems have received significant attention in achieving high recommendation quality. The K-nearest Neighbor (KNN) and Support Vector Machine (SVM) algorithms could make a school graduate recommender system produce an accuracy performance of 0.606 [10].
Machine learning is a technique of learning that derives from data and produces future predictions. Machine learning is divided into three groups, according to recent research: Reinforcement learning, supervised learning, and unsupervised learning [11]. Labeling is used to carry out the category of supervised learning. The two categories of supervised learning are classification and regression. Predicting the class of labeled data is the goal of classification. Regression seeks to forecast the description of the regression relationship as a floating-point number in the form of a number in a predetermined range from the data [12].
This study uses a classification technique to classify UV index level categories to provide recommendations in the form of appeals to daily activities from the impact of UV rays. A classification is a work product that has a data object value that may be entered into one of the many classes that are available [13]. The classification technique is used to predict the class on the label by classifying data based on the training set on the data table which consists of several attributes and one label [14]. Linear Regression, Naive Bayes Classifier, Perceptron, Support Vector Machine, Quadratic Classifiers, Decision Trees, Random Forests, and more algorithms are used in supervised machine learning [15]. Another algorithm for classification is the K-nearest neighbor [16]. The K-nearest neighbor technique and the Support Vector Machine were both used in this study's tests. The K-nearest neighbor (KNN) algorithm is an algorithm that projects learning data into a dimensional space that represents the features of the data. KNN is included in the instance-based learning group. The KNN method is a lazy technique used in classifying closely spaced data [17].
As a kernel-based machine learning model for classification and regression applications, Vapnik introduced the Support Vector Machine (SVM) algorithm. Linear classifier is the fundamental SVM principle. In contrast, kernel technique can be added to a workspace with a high-density workspace for troubleshooting non-linear issues [18]. SVM has attracted the attention of the data mining, pattern recognition, and machine learning communities in recent years because of the remarkable generalizability with optimal solutions and discriminatory power. [19].

Jurnal Teknologi Informasi dan Pendidikan
In the literature review there is literature on calculating the UV index for health appeals and there is also literature on the use of classification algorithms in machine learning in determining the classification of objects. Utilizing observations of solar UV radiation from satellite products used to create UV index climatology at local noon, the UV index study was conducted. A tendency of a considerable rise in the UV index between 2004 and 2020 was found as a result of research using OMI satellite products gathered on the campus of King Abdul Aziz University in Saudi Arabia to estimate changes in UV exposure throughout the 2004-2020 period [2]. Subsequent research is the measurement of the UV index using an analog UV index sensor measurement that converts UV radiation into an analog voltage.
Distribution of UV index based on time span was obtained. Cloud cover causes a decrease in the UV index by four levels and an attenuation of the UV Index by some materials. Based on the data collection sample, it was obtained that the safe time for sunbathing was in the range before 10.00 WIB and after 14.00 WIB [3]. KNN, Naive Bayes, and Decision Tree techniques are used in research on classifying water quality. The KNN approach is the most accurate method for classifying data, with an accuracy rate of 86.88% compared to Decision Tree's 80.84% and Naive Bayes' 63.60% [14]. In this study, a UV index recommender system was created to provide daily activity advice to users. The recommender system is created using a model-based collaborative filtering method assisted by a machine learning classification algorithm in the algorithm of K-nearest Neighbor and Support Vector Machine which produced higher accuracy performance.

RESEARCH METHOD
In the methodology section, it is explained about the process of the recommender system stages starting from the preparation of the dataset, testing the algorithm used, and the performance results from the test with the accuracy value obtained. The stages of the process are outlined in the form of a flowchart so that it can be easily understood.   Figure 2 shown a flowchart in building a machine learning-based recommendation system that requires several steps, including; collection of data, pre-processing of data, compiling data into test data and train data, training dataset testing using KNN and SVM, analysis of the accuracy value of the classification results using the algorithm tested, and the last is the result in the form of a model that has been trained along with recommendations on activities daily based on UV index level. The testing of the algorithms executed using the RapidMiner Studio. It is the analysis tool for data mining that is a standalone piece of software. Additionally, RapidMiner Studio serves as an environment for machine learning, data mining, text mining, and predictive analytics [20]. Data processing is carried out quantitatively as data values in the form of counts or numbers where each dataset has a numerical value associated with data obtained from measurement sensors. This data is quantifiable information that can be used for mathematical calculations and statistical analysis, so that decisions in real cases can be made.

Jurnal Teknologi Informasi dan Pendidikan
The data collection of UVA and UVB was obtained from in-situ measurements using the UVA and UVB radiometer instrumentation from the Global Atmosperic Watch Observation Station in Palu City, Central Sulawesi as shown in Figure 3.  The UV-A and UV-B data used are daily data at local time intervals from 06.00 -18.00 in January, February and March 2021 with measurements every one minute. This data can be used as material for data processing which will become a dataset in training data and testing data. Raw data obtained from measurement equipment collected as many as 2155 lines as shown in Figure 4. The raw data shows local time and UVA and UV radiation values in units of W/m 2 . The data collected will be used as material for further calculations in determining the UV index as material for building a recommendation system. Data pre-processing is a crucial step in the data processing process that converts raw data, sometimes referred to as unstructured data, gathered from numerous sources into information that may be used for further processing. The calculation of the weighting factor is the method used to derive the UV index. The UV index value is obtained from the UV erythermal value of the measurement results of the tool which is calculated using the weighing factor for UV-A and UV-B according to the CIE spectral action function. Since the weighing factor is wavelength dependent, UV radiation is received higher on surfaces with longer wavelengths. Figure 5 shows UV radiation classified between 100 nm and 400 nm into three spectral bands such as UV-A with a wavelength of 320-400 nm, UV-B, 280-320 nm. UV-A radiation accounts for a total of 95% of the sun's UV radiation that reaches the earth's surface [21]. UV-A penetrates more into surfaces than UV-B. However, energy is inversely proportional to wavelength. UV-A has lower energy than UV-B. As a result, the weighing factor for UV-B is higher than for UV-A. The radiometer operates in the spectral radiation range for both types of UV. However, the data reported includes only the integrated intensity at which UV irradiation was measured. Since the chosen wavelength can be arbitrary, we provide some assumptions which serve as our reference for the calculations. For UV-A, although the total intensity received at the surface is lower, the wavelength of 325 nm was chosen to compromise the higher contribution of UV radiation to the calculations. For UV-B, the chosen wavelength was 305 nm, which is near the UV-A spectral range with a lower weighing factor compared to the shorter wavelengths but arrives at the surface in a larger proportion. The weighing factors for the UV spectra at 305 nm and 325 nm are 0.22 and 0.0029, respectively. These factors are associated with the calculation of erythermal UV intensity. Equation 1 is the formula showing how the UV index is calculated [22]. .
The two erythermal UVs are both expressed in W/m2. The denominator of 0.025 W/m2 is the standard increment value that corresponds to how much total UV is potentially damaging to living tissue, or in other words an increase on one scale of the UV index is equivalent to 25 m 2 /W exposure to UV radiation.  In this test the data will be prepared to carry out the classification stage. Classification can be done using a supervised learning algorithm. The classification algorithm divides the data into two categories: train data and test data. While test data is used to assess and evaluate the performance of the model discovered at the testing stage, train data is used to instruct the algorithm on how to develop a suitable model. As was done in reference to earlier research journals, the distribution of the number of train data and test data in this study uses a percentage of 70:30 [23]. In tests that use a dataset of 30% testing data and 70% training data, the accuracy value can reach 90.33% [24]. The total UV index data used as train data in this study was 1689. The total UV index data used as test data in this study was 646.

RESULTS AND DISCUSSION
In this section, the research findings are discussed while also providing a thorough discussion. Results shown in tables, graphs, figures, and other formats. the division of the discussion into several sections.

Testing Data based on Machine Learning
Testing using the KNN and SVM algorithms is a system development methodology after data pre-processing in analyzing processing data quantitatively. Techniques for processing quantitative data using mathematical equations and statistics are then assisted by the Rapid Miner application to calculate the accuracy, precision, and recall values for the results of data classification in the algorithm testing process.

K-nearest Neighbor
The number of closest neighbors, denoted by the variable parameter k in the Knearest Neighbor (KNN) algorithm The KNN algorithm locates a data point or nearest neighbor given a query from the training data set. According to the shortest distance from the query point, the nearest data point is located. It uses a majority vote rule to determine which class appears the most after locating the k nearest data points. The final classification of the query will be based on the class that appears the most [25]. In the KNN algorithm, a formula is needed to calculate the shortest distance, the distance value in the KNN method can be calculated using the Euclidean distance formula [26]. The following is the Euclidean distance formula in Equation 2 , Information: x = data 1 y = data 2 d = Euclidean distance i = iteration   Figure 7 shows the testing process on the KNN algorithm. The dataset is obtained as input in excel data format. Then the nominal data on the category attribute is converted into a numerical form. Data is separated by a ratio of 0.7 as train data and 0.3 as test data. Then the stage is continued by selecting the KNN model to classify the data. After that the process continues with the apply model to apply the classification model. After obtaining the model from test data and train data, performance is obtained which shows the accuracy, precision, and recall values of the model being trained and tested.

Support Vector Machine
Support Vector Machine (SVM) is a machine learning technique based on statistical learning theory that can identify predictive systems and handle non-linear regression in a spherical space. This algorithm is also flexible enough to be used to the field of data modeling, where data classification and analysis follow a regressive pattern. SVM is an algorithm for producing predictions in the context of regression or classification [27]. In this test using SVM with the Radial Basis Function (RBF) kernel function, also known as the Gaussian kernel function. Where γ is a parameter to set the distance. The RBF kernel function can be formulated by Equation 3 [28].  There are several steps for classifying with the SVM algorithm, including determining the hyperplane or dividing line between two support vectors, determining the margin or distance line between support vectors and hyperplanes, and mapping support vectors into a class in the same dimension class.  Figure 8 shows the testing process on the SVM algorithm. The dataset is obtained as input in excel data format. The table data obtained has category attributes with nominal types which are converted to numerical attributes to make it easier to classify using SVM. Then the data is separated by a ratio of 0.7 as train data and 0.3 as test data. The next step is to choose the SVM model to classify the data. After that the process continues with the apply model to apply the classification model. After obtaining the model from test data and train data, performance is obtained which shows the accuracy, precision and recall values of the model being trained and tested.

Accuracy Measurement with Confusion Matrix
The confusion matrix is a classification method evaluation method based on the accuracy of the classification results. Classification accuracy will affect classification performance. The confusion matrix is a prediction matrix that will be compared with the original categories in the input data [29]. In order to compare the classification results produced by the system with the actual classification results used the confusion matrix. The importance of the confusion matrix will provide information on how well the model that has been made previously through existing accuracy measurements to find out how accurate the model that has been made is. The classification model's performance on a set of test data with known true values is described by the confusion matrix. In the case of multiclass classification, the metrics specified for binary classification, do not fully apply [30]. Table 1 shown the confusion matrix for multi-class problems with k class numbers [31]. In this study there were five different classes namely low, moderate, high, very high, and extreme. The results will discuss the accuracy, precision, and recall values obtained using the confusion matrix in the multi-class classification of the K-NN and SVM algorithm calculations.

Accuracy results of the K-nearest neighbor algorithm
Testing of the machine learning algorithm produces class recall, class precision, and accuracy values. The method for determining the accuracy value is a multi-class confusion matrix. Table 2 shows the results of the classification test with the UV index confusion matrix using the K-NN algorithm.  Table 3 shown the results of the classification test with the UV index confusion matrix using the SVM algorithm.  Table 4 shown the results of testing the KNN and SVM algorithms based on the results of accuracy, precision and recall values.