1 Introdcution

Twitter is one of the platforms commonly used by social users to share opinions or trusts across the world. People can share experiences or opinions through tweets. Health care data are an emerging need for society, and it is necessary to automate tweet health data to identify major health problems in society. Usually, health-care tweet data are extensive, and tweet data need to be assessed to find knowledge about significant health problems (or health clusters). This is the crucial motivation for addressing the health cluster tendency problem. Visual techniques, such as VAT [1], cVAT [2], and MVCS-VAT [3], can be used to access information about several clusters of tweet health data (or social health data). Popular topic models, including nonmatrix factorization (NMF) [4], latent semantic indexing (LSI) [5], probabilistic LSI (PLSI) [6], and latent Dirichlet allocation (LDA) [7], are used to extract the topic features of tweet data. The topic-tweet document matrix is created using the topic models for the set of tweet documents. TF-IDF is another alternative matrix for describing tweet document features based on term analysis, and the matrix usually known as the TF-IDF matrix [8]. Tweet document analysis using topics is more practical than using the TF-IDF matrix because data sparsity occurs in the TF-IDF matrix.

The topic-document matrix (TDM) is the most recommended approach in text clustering applications [9] [25]. Dissimilarity features are derived using a Euclidean distance measurement in a VAT. In a cVAT, the dissimilarity features are derived using the cosine distance metric. In the majority of text clustering applications [10] [23][26], the authors proved that cosine-based cluster assessment is more informative than a standard Euclidean distance formula. In a cVAT, the cosine-based similarity is measured using a single reference viewpoint, i.e., the origin. An extended version of the cVAT is the MVCS-VAT [3]. In MVCS-VAT, the cosine-based similarity values are derived using multiple viewpoints. Deriving the similarity using multiple viewpoints is a more accurate mechanism than a single viewpoint approach in the cVAT. Justifying the cluster assessment using the multiviewpoint cosine-based similarity values is more appropriate than the justification of a single viewpoint. The recent MVCS-VAT methods conducts the cluster assessment of health data in an excellent manner [27][31]. Each cluster represents a health cluster, which clusters the tweets; and those tweets belong to the same health topic are discussed. The tweets are categorized into health clusters based on the similarities among tweet documents. The problem of the MVCS-VAT is that it takes more computational time and memory space due to the assessment of health clusters using multiple viewpoints. For example, finding the similarity between two tweets documents t1 and t2 among the n documents is performed using n-2 viewpoints. Every tweet among the n tweets is taken as a viewpoint except t1 and t2; hence, there are ‘n-2’ viewpoints. The cosine similarity is computed between two tweet documents for n-2 viewpoints. Finally, similarity computation is applied for n(n-1)/2 cases concerning n-2 viewpoints. Thus, the total computation time is n(n-1)(n-2)/2. Therefore, the MVCS-VAT is a more expensive cluster assessment model for a large number of tweet documents. The proposed work uses an effective sampling procedure to further extend the MVCS-VAT[28]. The existing study proposes using a constant number of sample viewpoints instead of taking the n-2 multiple viewpoints in the proposed sampling-based MVCS-VAT (S-MVCS-VAT) algorithm. The algorithm and experimental details are demonstrated in the next sections.

The key contributions of the paper are summarized as follows:

  1. 1.

    Health clusters from big social data are assessed.

  2. 2.

    A sampling-based visual technique for determining the health clusters in a visual form is proposed.

  3. 3.

    Crisp partitions are derived from the visual images from the proposed S-MVCS-VAT.

  4. 4.

    Significant social health data cluster results are derived.

  5. 5.

    The performance of visual techniques for social and benchmark health data is empirically demonstrated.

The remaining sections are summarized as follows: Sect. 2 presents the literature on visual techniques for precluster assessment; Sect. 3 introduces the proposed sampling-based MVCS-VAT; Sect. 4 illustrates the experimental study; and, finally, Sect. 5 provides the conclusion and future scope of the work.

2 Literature of visual techniques for precluster assessment

Top clustering methods, such as k-means [11] and hierarchical clustering, are widely used in clustering-related applications [12]. The data clustering process depends on two crucial steps: finding the knowledge about the number of clusters and making a data partition of the data. Determining the number of clusters is known as the cluster tendency problem. Social health data are the opinions or views of social users on Twitter. Social health data are tweeted health data. Finding the categories of clusters of social health data based on health topics is known as finding the health cluster tendency [29]. The preassessment of several health topics in social data is a challenging problem. With this motivation, many visual techniques are surveyed for the precluster assessment of social health data. Bezdek et al. [1] proposed a basic model, namely the visual assessment of (cluster) tendency (VAT), for determining the number of clusters of numerical data. It works for numerical data. Its algorithmic is shown in the following.

figure a

Thus, social data are initially preprocessed into the topic-document matrix using various topic models [13]. This is a better representation of social data than the TF-IDF matrix. Four topic models, latent Dirichlet allocation, latent semantic indexing (LSI), probabilistic latent semantic indexing (PLSI), and nonnegative matrix factorization (NMF), are the recommended topic models in text clustering-related applications. These models are used to convert the social data into a numeric topic-document matrix. With this matrix, social health data are denoted in the form of a numeric representation. In a VAT [14], the social health topic-document matrix is used to find the dissimilarity features using the Euclidean distance matrix. The reordered dissimilarity matrix (RDM) [15] is derived according to the given steps of the VAT and then displays the image of the RDM. The number of health clusters (or health cluster tendency) is derived from the count of the number of square-shaped dark colored blocks in the RDM image (also known as the VAT image). A cosine metric uses vectors’ magnitude and distance to find the similarity features between two data objects whereas a Euclidean distance metric only uses the distance. Therefore, in a text clustering application, cosine-based cluster assessment succeeds more than Euclidean distance assessment. Following a cosine metric, another visual technique, i.e., the cosine-based VAT (cVAT), was developed in [12] for the precluster assessment problem.

In the cVAT, the similarity (or dissimilarity) features between two data objects are derived using a single viewpoint, i.e., the origin. Computing similarity features using a single viewpoint cannot provide a more informative assessment. Thus, multiple viewpoints are used in the later development of visual techniques, such as the multiviewpoint-based cosine similarity features VAT (MVCS-VAT) [3]. The MVCS-VAT is the most recommended visual technique to acquire accurate similarity features using a multiple viewpoint strategy instead of just a single viewpoint. For n tweet documents, as per the MVCS-VAT, n-2 viewpoint computations are needed to find the cosine-based similarity features among any two tweet documents. Finally, average n-2 similarity features concerning n-2 viewpoints are taken as the similarity features between the two tweet documents. This method is most accurate for visualizing the number of clusters for the set of n tweet documents [30]. The approach for the similarity feature computation between any two documents for the set of five tweet documents is shown in Fig. 1.

Fig. 1
figure 1

Sampling viewpoints using cosine similarity

The key limitation of the MVCS-VAT is that it demands more computational time and memory allocation for finding the social data clustering results from a large set of tweet documents. The proposed methods present the best sampling-based MVCS-VAT for the scalable computation of social data health clustering results.

Further work must find the similarity features between the tweet documents for sample viewpoints instead of n-2 viewpoints. Social data are enormous big data; thus, this proposed base sampling idea optimizes the time and memory requirements in finding health cluster tendencies. This optimized approach to find the health cluster tendency from social data is derived in the next section.

3 Proposed sampling-based Mvcs-Vat (S-Mvcs-Vat)

The clustering of social data (tweet health data) depends on the similarity features of data objects. The cosine-based similarity features are very successful in text data clustering applications. The similarity features concerning a single origin or a single reference viewpoint are derived. The MVCS-VAT uses multiple viewpoints to find accurate similarity features among the tweet documents compared to a single reference viewpoint. Due to the expensiveness of the MVCS-VAT, our proposed work takes the sample viewpoints to determine the quality of social health data clustering results. Algorithm 1 illustrates the procedural steps of the proposed work.

figure b

The proposed algorithm uses topic models, such as LDA, LSI, PLSI, and NMF, to extract the features of health tweets in topic-document matrix form. The proposed algorithm reduces the sparsity problem of tweet data. The topic-document matrix was then converted into a bag-of-features representation of tweet data. The features of tweets are denoted in the vector representation {TF1,TF2,…TFN}.

Randomly select the rth tweet document feature, find the distances between TFr and {TF1,TF2,…TFN} and save the distances into ‘Dist_Start.’ The maximum distance-maintained tweet data object is determined using the argmax function, and the corresponding tweet document number is saved into the variable ‘index.’ These are in Step 1 and Step 2. Next, the distance array DistI is updated according to explored tweet documents, and this is in Step 3. Again, the tweet document with the largest deviation is selected by applying the argmax function to DistI. The corresponding index found by the argmax is another centroid of tweet datasets. The same procedural steps are repeated to find the remaining expected number of centroids of the clusters. After selecting the centroids, the remaining tweet documents are moved into the nearest centroids based on the distances measured in Step 4. The distances are measured using the cosine distance metric of the sample viewpoints. The size of the sample viewpoints is measured based on a percentage of s. The mentioned percentage of samples is equally sampled from every cluster (except clusters TF1 and TF2). These steps are clearly illustrated, similarity features concerning the sample viewpoints are computed, and the C_MVCS computational statement is shown. Dissimilarity values are stored in DM, and normalized matrix values are stored in NormS.

The reordered dissimilarity matrix is computed by applying the visual assessment tendency (VAT) to NormS, as shown in Step 7. The RDM image is visualized to assess the number of visual clusters by counting the squared shaped dark colored blocks that appear along the diagonal. The crisp partitions of the RDM image show the predicted cluster labels of health tweets, which discover the health data clustering results; and these steps are clearly illustrated in Step 8 and Step 9.

For the proposed algorithm, the similarity features for the pair of tweet documents are derived using every viewpoint; and finally, the average of the obtained similarity values is used in the computation of tweet document similarity features. The similarity feature computation is less expensive due to taking sample viewpoints instead of a large number of all viewpoints. This provides a considerable improvement for finding the social data clustering results compared to the state-of-the-art visual topic models.

In the recent MVCS-VAT technique, effective social health data clustering results are derived using all given viewpoints. For small datasets, the MVCS-VAT is very impressive at determining the clustering tendency and individual social data clustering results. However, the amount of social data is massive; therefore, the MVCS-VAT uses many viewpoints to find the social health data clustering results. Ultimately, the method demands large computational and spatial costs. The MVCS-VAT is always suitable for finding social data clustering results, and it is expensive for big social data. Our proposed S-MVSC-VAT uses the sampling schema to perform scalable computations for big social data clustering. The experimental demonstrations are presented in the following section.

4 Experimental study

Tweet data [2] are collected on different health topics to assess health data clustering results. Each subset of data is created with specific health topics. Table 1 presents the details of the social health data in terms of a number of health topics [18], names of health diseases, and the size of the datasets.

Table 1 Social health datasets topics description

Benchmarked health datasets are retrieved based on the health keywords provided by TREC [16] [17], which are mentioned in the same table.

After extracting the tweet features in the form of a bag-of-features, various big social data visual clustering methods are tested in the experimental study. Three traditional visual methods, the VAT, cVAT, MVS-VAT, and the proposed S-MVCS-VAT are applied to the provided big social data. Visual images with excellent clarity are provided by both the S-MVCS-VAT and MVCS-VAT compared to other visual methods. The notable improvement of the proposed method is that it can derive faster health data clustering results than the MVCS-VAT.

The crisp partitions are derived based on the diagonal and nondiagonal pixel intensity values. The cluster labels of data objects are derived based on these cluster partitions, and the results are shown in Fig. 5b for three data topics.

Tweet document features are extracted through the four different topic models: LDA, LSI, PLSI, and NMF. Figure 2, Fig. 3, Fig. 4, and Fig. 5a show the results of visual health data clustering for these topic models. From the illustration of the visual health data clustering results, S-MVCS-VAT shows the visual clusters.

Fig. 2
figure 2

Visual health data clustering results for big social health data (2 Topics)

Fig. 3
figure 3

Visual health data clustering results for big social health data (5 Topics)

Fig. 4
figure 4

Visual health data clustering results for big social health data. (10 Topics)

Fig. 5
figure 5

a Visual health data clustering results for big social health data. (15 Topics) b Crisp partitions for three data topics

in the form of diagonal square-shaped dark colored blocks with outstanding clarity under all four topic models.

The clarity of the proposed work with sampling viewpoints is the best. With sampling viewpoints and without sampling approaches showed almost the same clarity of visual clusters.

Crisp partitions and consequent quality clustering results depend on the clarity of visual image clusters. The S-MVCS-VAT has the ability to obtain social health data clustering results with optimized time and space values. All four proposed variants are developed with the four specified topic models. These are the LDA-S-MVCS-VAT, LSI-S-MVCS-VAT, PLSI-S-MVCS-VAT, and NMF-S-MVCS-VAT. All the comparative analyses of time values (taking the speed parameter) of four variants of existing and proposed models are shown in Figs. 6, 7 and  8. These figures compare the same models using the memory space parameter and time comparison parameter. Empirical analysis of the speed, memory, and time and space costs shows that the proposed S-MVCS-VAT is a more scalable visual health data clustering model in speed and memory efficiency. This leads to the S-MVCS-VAT being faster and more memory efficient than other visual health data clustering models.

Fig. 6
figure 6

Speed parameter analysis of visual social health data clustering models compared with the MVCS-VAT

Fig. 7
figure 7

Memory space analysis of visual social health data clustering models (S-MVCS-VAT vs. MVCS-VAT)

Fig. 8
figure 8

Time analysis of visual social health data clustering models (S-MVCS-VAT vs. MVCS-VAT)

The performance or quality of the visual data clustering models is evaluated using four parameters: the cluster accuracy (CA) [19], normalized mutual information (NMI) [20], precision [21], and recall [21]. These values are given in Tables 2, 3,4, and 5, respectively.

Table 2 Cluster Accuracy (CA) for the visual health data cluster models
Table 3 Normalized mutual information (NMI) for the visual health data cluster models
Table 4 Precision (P) for the visual health data cluster models
Table 5 Recall (R) for the visual health data cluster models

From the crisp partitions, the data object labels are predicted, and the performance of visual health cluster models is evaluated based on the matching the predicted cluster labels and ground truth labels using CA, NMI, precision, and recall.

4.1 Critical observations

The proposed method used the sample viewpoints only to assess the cluster tendency and data clustering results. Thus, the proposed method is faster method than the MCS-VAT. Crisp partition images with the best clarity and goodness-of-fit occur when using the proposed method. The proposed work is able to discover the quality of large social health data clustering results.

Table 6 presents the goodness-of-fit of the existing and proposed visual images and shows that S-MVCS-VAT scored higher than the other methods underlying the four topic models.

Table 6 Goodness-of-fit of the visual Images

The overall experimental analysis shows that the accuracy was improved at a rate of 5 to 10% in the proposed S-MVCS-VAT method underlying the four topic models NMF, LDA, LSI, and PLSI for big social health data.

5 Conclusion and future scope

Health data assessment is an emerging need in society. Twitter is one of the enriched social sources for people to exchange views or opinions on any topic. Big social data are extracted through Twitter using lakhs of tweets. For the lakhs of tweets, it is most expensive to find social health data clusters. The recent visual technique, the MVCS-VAT, effectively conducts social health data cluster assessment with n-2 multiple viewpoints. The proposed work uses an efficient sampling strategy and four topic models to enhance the MVCS-VAT. Experimental is carried out on 18 different case studies, i.e., 18 different subsets of health datasets. Overall observation of these experimental states that proposed S-MVCS-VAT improves the quality of social health data clusters with significant rate of 5 to 10%. Goodness-of-fit images for the visual clusters are much improved in S-MVCS-VAT for all these datasets. Two scalable parameters, i.e., computational time and memory, are calculated for the proposed S-MVSC-VAT and existing MVCS-VAT underlying with different topic models for all 18 case studies (i.e., 2 topics to 15 topics; 2 topics to 5 topics in TREC 2018) carried in the experimental work. It proved that the proposed S-MVCS-VAT is more scalable with respect to computational time and memory allocation. Future work can be extended to develop scalable ailment visual techniques for health analysis and socially recommended solutions.