A Direct Data-Cluster Analysis Method Based on Neutrosophic Set Implication

Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters. A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets. This paper focuses on cluster analysis based on neutrosophic set implication, i.e., a k-means algorithm with a threshold-based clustering technique. This algorithm addresses the shortcomings of the k-means clustering algorithm by overcoming the limitations of the threshold-based clustering algorithm. To evaluate the validity of the proposed method, several validity measures and validity indices are applied to the Iris dataset (from the University of California, Irvine, Machine Learning Repository) along with k-means and threshold-based clustering algorithms. The proposed method results in more segregated datasets with compacted clusters, thus achieving higher validity indices. The method also eliminates the limitations of threshold-based clustering algorithm and validates measures and respective indices along with k-means and thresholdbased clustering algorithms.


Introduction
Today, data repositories have become the most favored systems. To name a few, we have relational databases, data mining, and temporal and transactional databases. However, due to the high volume of data in these repositories, the prediction level at the same time has become too complex and tough. Today's scenarios also indicate the diversity of these data (for example, from scientific to medical, geographic to demographic, and financials to marketing). Therefore, the diversity of the data and the extensive volume of those data resulted in the emergence of the field of data mining in recent years [Hautamäki, unsupervised learning problem. Most of the times, we use a dataset and are asked to infer structure within it, in this case, the latent clusters or categories in the data. The problem is the classification problems. Though, deep artificial neural networks are very good at classification, but clustering is still a very open problem. For clustering, we lack this critical information. This is why data clustering is more complicated and challenging when unsupervised learning is considered. Authors believe that the best example to illustrate this is to predict whether or not a patient has a common disease based on a list of symptoms. Many researchers Boley et al. [Boley, Gini, Gross et al. (1999); Arthur and Vassilvitskii (2007); Cheung (2003) ;Fahim, Salem, Torkey et al. (2006); Khan and Ahmad (2017)] proposed partitioning-based methodologies, such as k-means, edge-based strategies and variants. The k-means strategy is perhaps the most widely used clustering algorithm, being an iterative process that divides a given dataset into k disjoint groups. Jain [Jain (2010)] presented a study that indicated the importance of the widely accepted k-means technique. Many researchers have proposed variations of partitioning algorithms to improve the efficiency of clustering algorithms [Celebi, Kingravi and Vela (2013); Erisoglu, Calis and Sakallioglu (2011); Reddy and Jana (2012)]. Finding the optimal solution from a k-means algorithm is NP-hard, even when the number of clusters is small [Aloise, Deshpande, Hansen et al. (2009)]. Therefore, a k-means algorithm finds the local minimum as approximate optimal solutions. Nayini et al. [Nayini, Geravand and Maroosi (2018)] overcame k-means weaknesses by using a threshold-based clustering method. This work also proposed a partitioning-based method to automatically generate clusters by accepting a constant threshold value as an input. Authors used similarity and threshold measures for clustering to help users to identify the number of clusters. They identified outlier data, and decreased the negative impact on clustering. The time complexity of this algorithm is O(nk), which is better than k-means [Mittal, Sharma and Singh (2014)]. In this algorithm, instead of providing initial centroids, only one centroid is taken, which is one of the data objects. Afterwards, the formation of a new cluster depends upon the distance between the existing centroid and the next randomly selected data objects. Even in the same dataset, clustering algorithms' results can differ from one another, particularly the results from the k-means and edge-based system techniques. Halkidi et al. [Halkidi, Batistakis and Vazirgiannis (2000)] proposed quality scheme assessment and clustering validation techniques [ Halkidi, Batistakis and Vazirgiannis (2001)]. Clustering algorithms produce different partitions for different values of the input parameters. The scheme selects best clustering schemes to find the best number of clusters for a specific dataset based on the defined quality index. The quality index validates and assures good candidate estimation based on separation and compactness, two components contained in a quality index. An index called the Davies-Bouldin index (DBI) was proposed [Davies and Bouldin (1979)] for cluster validation. This validity index is, in fact, a ratio of separation to compactness. In this internal evaluation scheme, the validation is done by evaluating quantities and features inherent in the dataset. Yeoh et al. [Yeoh, Caraffini and Homapour (2019)] proposed a unique optimized stream (OpStream) clustering algorithm using three variants of OpStream. These variants were taken from different optimization algorithms, and the best variant was chosen to analyze robustness and resiliency. Uluçay et al. [Uluçay and Şahin (2019)] proposed an algebraic structure of neutrosophic multisets that allows membership sequences. These sequences have a set of real values between 0 and 1. Their proposed neutrosophic multigroup works with the neutrosophic multiset theory, set theory, and group theory. Various methods and applications of a k means algorithm for clustering have been worked out recently. Wang et al. [Wang, Gittens and Mahoney (2019)] identifies and extracts a varied collection of cluster structures than the linear k-means clustering algorithm. However, kernel k-means clustering is computationally expensive when the non-linear feature map is highdimensional and there are many input points. On the other hand, Jha et al. ] uses a different clustering technique to resolve stock market prediction. They have used a rigorous machine learning approaches in hand to hand with clustering of the high volume of data. This paper studied the applications of hierarchical (ward, single, average, centroid and complete linkages) and k-means clustering techniques in air pollution studies of almost 40 years data.

Neutrosophic basics and definitions
In this section, we proceed with fundamental definitions of neutrosophic theory that include truth (T), indeterminacy (I) and falsehood (F). The degree of T, I, and F are evaluated with their respective membership functions. The respective derivations are explained below.

Definitions in the neutrosophic set
Let S be a space for objects with generic elements, s S ∈ . A neutrosophic set (NS), N in S , is characterized by a truth membership function,

Definition of states of a set Former and latter
Let us assume that ( ) are two ordinary subsets with an ordinary relation: is called a former set, and is called the latter set.

Definition of neutrosophic algebraic products Triangle product and square product
Let us assume that ( ) V and 3 V can be defined as follows: Correspondingly, 1 2 R R  , a square product, can be defined as follows:

Definition of neutrosophic implication operators
If α is a binary operation on [ ]

Definition of generalized neutrosophic products
Let us extend the Lukasiewicz implication operator to a neutrosophic valued environment.
If we consider membership degrees Q µ and Q ν of µ and ν only, for any two neutrosophic valued environments, is unable to reflect the dominance of the neutrosophic environment, and therefore, we consider the indeterminacy and non-membership , I I µ ν and , Let us prove that the value of ( ) , µ ν Φ satisfies the conditions of the neutrosophic valued environment. In fact, from Eq. (4), we have This shows that the value of ( ) Along with the neutrosophic Lukasiewicz implication, the square product, and the traditional triangle product, we introduce the neutrosophic triangle product and the neutrosophic square product as follows.

Definitions of neutrosophic relations
Neutrosophic relations are based on the conventional arithmetic, algebraic and geometric theories which are used in dealing various real time engineering problems. Neutrosophic relations also relate various neutrosophic sets.
for any ( ) ( ) for any ( ) ( ) Indeed, the neutrosophic triangle product and the neutrosophic square product are firmly related to each other. That is, the neutrosophic triangle product is the basis of the neutrosophic square product, and because of that, (

Applications of the two neutrosophic products
In this subsection, we use the neutrosophic triangle product to compare multi-attribute decision making with neutrosophic information. Subsequently, we use the neutrosophic square product for constructing an anneutrosophic similarity matrix. This anneutrosophic similarity matrix is used for analyzing the neutrosophic clustering method.
Assume a multiple attribute decision making issue. Let Apparently, the degree of uncertainty of

Neutrosophic triangle product's application
The characteristic vectors of two alternatives for the issues described above, say i S and This shows the degree of the alternative, This shows that degree alternative i w is preferred to alternative j w . The alternatives ordering i w and j w can be obtained from Eqs. (14) and (15). In fact, a.

Neutrosophic square product's application
As we know from Eq. (10), mathematically, neutrosophic square product (S1×S2)ij can be deciphered as follows: (S1×S2)ij measures the degree of similarities of the i th row of neutrosophic matrix 1 S and the j th row of neutrosophic matrix 2 S . Therefore, considering the issue expressed at the start of Section 4, (Si × Sj-1 )ij expresses the similarity of alternatives i w and j w . The following formula can be used for constructing a neutrosophic similarity matrix for ( ) Eq. (16) has the following desirable properties: At that point, from the above analyses, we can determine that Eq. (16) satisfies the neutrosophic similarity relation conditions. Thus, this can be used to construct a neutrosophic similarity matrix.

Direct neutrosophic cluster analysis method
After constructing a neutrosophic similarity matrix with the abovementioned method, the equivalent matrix is not required before cluster analysis. The required cluster analysis results can be obtained with the neutrosophic equivalent matrix, starting with the neutrosophic similarity matrix. In fact, Luo [Luo (1989)] proposed a direct method for clustering fuzzy sets. This method considers only membership degrees of fuzzy sets. Our proposed direct neutrosophic cluster analysis technique considers the enrollment degrees, indeterminacy degrees, and non-participation degrees of the neutrosophic esteemed set under the neutrosophic conditions presented below. The proposed method is based on Luo's method, which includes following stages.  Stage C. In this stage, we take other confidence levels and analyze clusters according to the procedure in Stage B. The procedure is carried out until all alternatives are clustered into one category. One of the significant advantages of the proposed direct neutrosophic cluster analysis method is that cluster analysis can be acknowledged by simply depending on the subscripts of the alternatives. We observed from the process described above that, in this method, getting even an λ-cutting matrix is not necessary. In real-world application scenarios, we simply need to affirm their areas in the neutrosophic similarity matrix after choosing some appropriate confidence levels, and afterward, we can get the kinds of considered objects on the basis of their area subscripts.

Performance evaluation
For the performance evaluation, a k-means algorithm and a threshold-based algorithm were used on the Iris dataset from the University of California, Irvine (UCI) Machine Learning Repository. A variable number of clusters (from 2 to 10) were generated for the experiments. For the k-means and threshold-based algorithms, cluster number is the input parameter. The k data objects were selected randomly in the k-means algorithm (k was also taken as an initial centroid of the clusters). On the other hand, only one object was selected randomly in the threshold-based method. The selected object was assigned as the initial centroid of the cluster, and was a member of the first cluster. We observed that this method generates more segregated and compact clusters. Finally, we observed that there was significant enhancement in the indices of validity. The following mathematical analysis proves the above statements. For any cluster-based intuitionistic neutrosophic implication, let X(Ti, Fa) → Y(Tj, Fb), where T and F depict truthfulness and falsehood. Then, we can define various classes of cluster-based neutrosophic set (CNSS) implications, as expressed below: The proposed new cluster-based intuitionistic neutrosophic (CIN) implication is now extended with X(Ti, a, X ) → Y(Tj, Y, Y), as follows: where Ti f/ f → Tj is any cluster of intuitionistic neutrosophic implications, while f is any ∧ neutrosophic conjunction: where f is any ∨ fuzzy disjunction: Referring to the definition proposed by Broumi et al. [Broumi, Smarandache and Dhar (2014)], the classical logical equivalence and predicate relationship now becomes (X → Y) ↔ (¬X ∨ Y), where, (X → Y) ↔ ( X ¬ Y ∨ ) The above class of neutrosophic implications can now be depicted with the operators ( X ¬ Y ∨ ). Let us have two cluster-based neutrosophic propositions: X〈0.3, 0.4, 0.2〉 and Y〈0.7, 0.1, 0.4〉. Then, X → Y has the neutrosophic truth value of X Y ∨ ¬ , i.e., 〈0.2, 0.4, 0.3〉 〈 0.7, 0.1, 0.4〉 ∨ , or 〈max{0.2, 0.7}, min{0.4, 0.1}, min{0.3, 0.4}〉, or 〈0.7, 0.1, 0.3〉. Therefore, 〈 , , 〉 = 〈 , , 〉 ¬ for neutrosophic negation and 〈 1, 1, 1 〉 〈 2, 2, 2 〉 ∨ = 〈max { 1, 2}, min{ 1, 2 }, min{ 1, 2 }〉 for the neutrosophic disjunction. The dataset that we referred to from Stappers et al. [Stappers, Cooper, Brooke et al. (2016)] and [Systems (2020)] contains 16,259 spurious examples caused by radio frequency interference (RFI)/noise, and 1,639 real pulsar examples with each candidate having eight continuous variables. The first four variables are obtained from the integrated pulse profile. This is an array of continuous variables that describe a longitude-resolved version of the signal. The remaining four variables were similarly obtained from the dispersion measure (DM)-SNR curve. These are summarized in Tab. 2. Tab. 2 shows a dataset describing a sample of pulsar candidates collected during the high time-resolution universe survey. The first column is the mean of the integrated profile. Mean1 is the mean of the DM-SNR curve, and SD1 is the standard deviation of the DM-SNR curve. Finally, ET1 is the excess kurtosis of the DM-SNR curve, and Skewness1 is the skewness of the DM-SNR curve. In Tab. 2, the mean of the integrated profile is compared with pulsar candidates that vary significantly with Mean1. Here, Mean1 is the mean of popular candidates at high time resolution. The dataset that we have referred from Stappers et al. [Stappers, Cooper, Brooke et al. (2016)] and [Systems (2020)] that contains 16,259 spurious examples caused by radiofrequency interference (RFI) or noise, and 1,639 real pulsar examples with each candidate having 8 continuous variables. The first four variables are obtained from the integrated pulse profile. This is an array of continuous variables that describe a longitude-resolved version of the signal. The remaining 4 variables are similarly obtained from the dispersion measure (DM)-SNR curve. These are summarized in Tab. 2.

Conclusions and future work
One of the major issues in data clustering is the selection of the right candidates. In addition, the appropriate algorithm to choose the right candidates has been a challenging issue in cluster analysis, especially for an efficient approach that best fits the right sets of data. In this paper, a cluster analysis method based on neutrosophic set implication generates the clusters automatically and overcomes the limitation of the k-means algorithm. Our proposed method generates more segregated and compact clusters and achieves higher validity indices, in comparison to the mentioned algorithms. The experimentation carried out in this work focused on cluster analysis based on NSI through a k-means algorithm along with a threshold-based clustering technique. We found that the proposed algorithm eliminates the limitations of the threshold-based clustering algorithm. The validity measures and respective indices applied to the Iris dataset along with k-means and threshold-based clustering algorithms prove the effectiveness of our method. Future work will handle data clustering in various dynamic domains using neutrosophic theory. We also intend to apply a periodic search routine by using propagations between datasets of various domains. The data clustering used by our proposed algorithms was found to be workable in a low computational configuration. In the future, we will also use more datasets.
Funding Statement: The work of Gyanendra Prasad Joshi is supported by Sejong University new faculty research grand.

Conflicts of Interest:
The authors declare no conflicts of interest.