Generalized Normalized Euclidean Distance Based Fuzzy Soft Set Similarity for Data Classification

Classification is one of the data mining processes used to predict predetermined target classes with data learning accurately. This study discusses data classification using a fuzzy soft set method to predict target classes accurately. This study aims to form a data classification algorithm using the fuzzy soft set method. In this study, the fuzzy soft set was calculated based on the normalized Hamming distance. Each parameter in this method is mapped to a power set from a subset of the fuzzy set using a fuzzy approximation function. In the classification step, a generalized normalized Euclidean distance is used to determine the similarity between two sets of fuzzy soft sets. The experiments used the University of California (UCI) Machine Learning dataset to assess the accuracy of the proposed data classification method. The dataset samples were divided into training (75% of samples) and test (25% of samples) sets. Experiments were performed in MATLAB R2010a software. The experiments showed that: (1) The fastest sequence is matching function, distance measure, similarity, normalized Euclidean distance, (2) the proposed approach can improve accuracy and recall by up to 10.3436% and 6.9723%, respectively, compared with baseline techniques. Hence, the fuzzy soft set method is appropriate for classifying data.


Introduction
Nowadays, Big Data is used in Tuberculosis (TBC) patient data in healthcare, stock data in economics and business fields, and BMKG data (containing weather, temperature, and rainfall data), etc. Data mining is the process of extracting knowledge from large amounts of data [1], and is done by extracting information and analyzing data patterns or relationships [2,3].
The critical issue in fuzzy soft sets is the similarity measure. In recent years, similarity measurement between two fuzzy soft sets has been studied from different aspects and applied to various fields, such as decision-making, pattern recognition, region extraction, coding theory, and image processing. For example, similarity measurement [20] has been researched in fuzzy soft sets based on distance, settheoretic approaches, and matching functions. Sut [21] and Rajarajeswari [22] used the notion of the similarity measure in Majumdar and Samanta [20] to make decisions. Several similarity measurement [23] based on four types of quasi-metrics were introduced to fuzzy soft sets. Sulaiman [24] researched a set-theoretic similarity measure for fuzzy soft sets, and applied it to group decision-making. However, some studies haphazardly investigated the similarity measurement of fuzzy soft sets based on distance, resulting in high computational costs [20,23]. Feng and Zheng [25] showed that the similarity measure based on the Hamming distance and normalized Euclidean distance in the fuzzy soft set is reasonable. Thus, the similarity of generalized normalized Euclidean distance is applied in the present paper to a fuzzy soft set for classification. The similarity is used to classify the label of data. The experimental results show that the proposed approach can improve classification accuracy.

The Proposed Method/Algorithm
This section presents the basic definitions of fuzzy set theory, soft set theory, and some useful definitions from Roy and Maji [12].

Fuzzy Set
Definition 2.1 [10] Let U be a universe. A fuzzy set A over U is a set defined by a function where l A is the membership function of A, and the value l A (x) is the membership value of x 2 U. The value represents the degree of x belonging to the fuzzy set U. Thus, a fuzzy set A over U can be represented as in (2).
The notion that the set of all the fuzzy sets over U was denoted by F(U).

Fuzzification
Fuzzification is a process that changes the crisp value to a fuzzy set, or a fuzzy quantity into a crisp quantity [26]. This process uses the membership function and fuzzy rules. The fuzzy rules can be formed as fuzzy implications, such as (x 1 is A 1 )°(x 2 is A 2 )°…°(x n is A n ); then Y is B, with°being the operator "AND" or "OR". B can be determined by combining all antecedent values [14].

Fuzzy Soft Set (FSS)
Definition 2.5 [12] Let U be an initial universe set and E be a set of parameters. Let P(U) denote the power set of all fuzzy subsets of U, and A ⊆ E. Γ A is called a fuzzy soft set over U, where the function of c A is a mapping given by c A : Here, the function c A is an approximate function of the fuzzy soft set À A , and the value c A e ð Þ is called an e-element of a fuzzy soft set for all e = 2 A. Fuzzy soft set À A over U can be represented by the set of ordered pairs: Note that the set of all the fuzzy soft sets over U was denoted by FS(U).
Example 1 [14] Let a fuzzy soft set À A describe the attractiveness of the shirt concerning the given parameters, which the authors are going to wear. U ¼ u 1 ; u 2 ; u 3 ; u 4 ; u 5 f gis the set of all shirts under consideration. P U ð Þ be the collection of all fuzzy subsets of U . Let E = {e 1 = "colorful", e 2 = "bright", e 3 = "cheap", e 4 = "warm"}. If A ¼ e 1 ; e 2 ; e 3 f gcan be the approximate value of the function fuzzy, The family {γ A (e i ); i = 1,2,3} of P(U) is then a fuzzy soft set À A . The tabular representation for fuzzy soft set À A is shown in Tab. 1.
U j j is the cardinality of universe U, and The set of all cardinal sets of fuzzy soft set over U can be denoted by cFS U ð Þ:

Classification
Classification involves learning a target function that maps each collection of data attributes to several groups of predefined classes. The purpose of the classification is to see the class's target predictions as accurate as possible for each case in the data. The classification algorithm consists of two stages. In the training stage, the classifier is trained on predefined classes or data categories. An X tuple, represented by the n-dimensional vector attribute, X ¼ x 1 ; x 2 ; . . . ; x N f g , describes by the measurements made on the tuples with n attributes A 1 ; A 2 ; . . . ; A M . Each tuple belongs to a class, as identified by its attributes. Class attribute labels have discreet, non-consecutive values, and each value acts as a category or class. Next, the second step is Classification. In this step, the built-in classifier was used to classify the data by looking at the classification algorithm's accuracy in the estimated data testing. The step is to see the accuracy in the first classification; the predicted classifier's accuracy is estimated. If using a training set to measure the classifier's accuracy, then the estimate would be optimal because the data used to form the classifier comprise the training set. Therefore, a test set (a set of tuples and their class labels selected randomly from the dataset) were used. Test sets are independent of the training sets because test sets were not used to build a classifier.

Similarity Measurement
A measurement of similarity or dissimilarity defines the relationships between samples or objects. Similarity measurements were used to determine which patterns, signals, images, or sets are alike. For the similarity measure, the resemblance is more critical when its value increases, but, conversely, for a dissimilarity measurement, the resemblance is more robust when its value decreases [27]. An example of the dissimilarity measure is a distance measure. Measuring similarity or distance between two entities is crucial in various data mining and information discovery tasks, such as classification and clustering. Similarity indicators calculate the degree that various patterns, signals, images, or sets are alike. A few researchers have measured the similarity between fuzzy sets, fuzzy numbers, and vague sets. Recently [14,20,28] studied the similarity measure of the soft set and fuzzy soft set. They explained the similarity between the two generalized fuzzy soft sets as follows.
Let U ¼ The similarity between F and G is found and denoted by M(F,G). Next, the similarity between the two fuzzy sets q and d is found and denoted by m (q,d). Then, the similarity between the two generalized fuzzy soft sets Fq and Gd is denoted as S(Fq,Gd) = M(F,G) Â m(q,d).
Therefore, M (F, G) = max M i (F,G), where: Furthermore, If we use the universal fuzzy soft set, then q ¼ d ¼ 1 and m(q,d) = 1. Now, the formula for similarity is Example 2. In this example, U ¼ x 1 ; x 2 ; x 3 ; x 4 f gand E ¼ e 1 ; e 2 ; e 3 f g . Let there be two generalized fuzzy soft sets over the parameterized universe U ; E ð Þ. Here,

Distance Measurement
In this study, the fuzzy soft set was calculated based on the normalized Hamming distance [25]. We assume fuzzy soft sets (F,A) and (G,B) have the same set of parameters, namely, A = B. The normalized Hamming distance and normalized distance in Fuzzy Soft Set (FSS) are obtained using Eqs. (13) and (14).
Example 3. As in Roy and Maji [12], let U = {u 1 , u 2 , u 3 } be a set with parameters ¼ a 1 ; a 2 ; a 3 f g . Two FSS G; A ð Þ and H; A ð Þ are represented by Tabs. 2 and 3, respectively.
From Eq. (14), it can be known that

Discussion
In this section, the proposed approach and experimental results of the Fuzzy Soft Set Classifier (FSSC) using the normalized Euclidean distance are discussed.

Proposed Approach
This study proposed a new classification algorithm based on the fuzzy soft set; we call it the Fuzzy Soft Set Classifier (FSSC). This algorithm used the normalized Euclidean distance of similarity between two fuzzy soft sets to classify unlabeled data. Before training and classification steps, we first conducted fuzzification and created a fuzzy soft set.

Training Step
The goal of training the algorithm is to determine the center of each existing class.
Let U ¼ u 1 ; u 2 ; . . . ; u N f g , E be the set of parameters, A E; and A ¼ e i ; i ¼ 1; 2; . . . M f g . There are k classes with n r samples in each class, where r ¼ 1; 2; . . . ; k and P k r¼1 n r ¼ N . Let us say that C r U is r-class data, and À C r ; is the set of fuzzy soft sets of the r-class data. Thus, the center set of class C r is denoted as À PC r and be defined as in Eq. (17). Thus,

Classification Step
The new data of the training step results were used to determine the classes in the new data; that is, by measuring the similarity of two sets of fuzzy soft sets acquired in the class center vector and new data.
Thus, the formula for the similarity measure becomes: , S Ã À P Cr ; À G À Á After the value the similarity for each class was obtained, the algorithm looked for which class label is appropriate for new data À G by determining the maximum similarity.

Experimental Results
We conducted experiments using the University of California (UCI) dataset to assess the accuracy of the proposed data classification method. The dataset samples were divided into training (75% of samples) and test (25% of samples) sets. Experiments were performed in MATLAB R2010a software. Figs. 1-4 show the classification results obtained by our fuzzy soft set method and other baseline techniques.
As seen in Fig. 1, calculations using the normalized Euclidean distance method yield the highest accuracy results. Fig. 2 shows that the normalized Euclidean distance method obtains the second-highest precision; the highest precision is obtained by the comparison table method in MatLab.   Fig. 3 shows that the normalized Euclidean distance method produces the highest recall results, whereas Fig. 4 illustrates that the method has the highest computation time.
The fastest sequence is matching function, distance measure, similarity, normalized Euclidean distance. Comparisons are shown in Tab. 4.

Conclusions
In this study, a new classification algorithm based on fuzzy soft set theory was proposed. Experimental results show that the normalized Euclidean distance method improves accuracy by 10.3436% and increases by 6.9723%, compared to baseline techniques. We also find that all similarity measurements proposed in this paper are reasonable.
Funding Statement: The authors received no specific funding for this study.