Opinion Texts Clustering Using Manifold Learning Based on Sentiment and Semantics Analysis

Nowadays, opinion texts are quickly published on websites and social networks by various users in the form of short texts and also in high volumes and various fields. Because these texts reflect the opinions of many users, their processing and analysis, such as clustering, can be very useful in a variety of applications including politics, industry, commerce, and economics. High dimensions of the text representation decrease efficiency of clustering, and an effective solution for this challenge is reducing dimensions of texts. Manifold learning is a powerful tool for nonlinear dimension reduction of high-dimensional data. ,erefore, in this paper, for increasing efficiency of opinion texts clustering, by manifold learning, dimensions of the represented opinion texts are reduced based on sentiment and semantics, and their intrinsic dimensions are extracted. ,en, the clustering algorithm is applied to dimension-reduced opinion texts. ,e proposed approach helps us to cluster opinion texts with simultaneous consideration of sentiment and semantics, which has received very little attention in the previous works. ,is type of clustering helps users of opinion texts to obtainmore useful information from texts and also provides more accurate summaries in applications, such as the summarization of opinion texts. Experimental results on three datasets show better performance of the proposed approach on opinion texts in terms of important measures for evaluating clustering efficiency. An improvement of about 9% is observed in terms of accuracy on the third dataset and clustering based on sentiment and semantics.


Introduction
In recent years, social networks and microblogs have expanded widely and are good platforms for sharing users' opinions in the form of short texts in various fields [1]. ere are too many people on social networks who easily publish their opinions in various fields including politics, industry, trade, and economics [2]. Analysis of these opinion texts can be very important and influential in decision-making and policy-making in diverse fields. Due to the very high volume of the produced texts, it is not easy for users of these data to find what they need [3]. erefore, there is a need for different techniques to automatically process these texts and extract necessary information [4]. ese techniques perform the required analysis on texts in order to extract public opinions and determine different polarities of opinions in political, social, economic, and other contexts [5]. Opinion mining is an emerging and ever-dynamic field and has too many challenges. Some of the general challenges of this field are given in [6].
Clustering is one of the challenges and techniques used for the analysis of opinion texts. Clustering is a descriptive unsupervised technique in data mining, in which data are grouped into clusters so that similar samples are placed inside a cluster with respect to specific measures, and dissimilar or less similar ones should be placed in other clusters [7]. e main strength of clustering is that it can detect clusters without prior knowledge [8]. Opinion texts clustering can be used in applications, such as text summarization [9], topic detection [10], sentiment analysis [11], question-answering systems [12], and recommender systems [13].
For opinion texts clustering, it is necessary to represent texts in the vector form [14]. In general, methods used for the representation of texts are divided into two categories, statistics-based and semantics-based methods. In statisticsbased methods, the text is represented based on the number of repetitions of each word in the text. Common methods in this category include term frequency-inverse document frequency (TF-IDF) [15] and latent semantic analysis (LSA) [16]. Important advantages of statistics-based representation methods are intuitive representation, simple representation, and language-independent representation. Important advantages of statistics-based representation models are intuitive, simple, and language-independent representation. Important disadvantages of statistics-based representation models are loss of all word order information, loss of word meaning, high dimensionality, and data sparsity problems [17]. In the semantics-based method, the text is represented based on semantics and syntactic of words. Common methods in this category include Word2Vec [18] and Doc2Vec [19]. Important advantages of semantics-based representation models are using semantics for representation, low dimensionality representation, and no data sparsity problems. Important disadvantages of semantics-based representation models are the need for a semantics dictionary, language-dependent systems, and computation cost [17]. As opinion texts are short and sparse, this sparse representation and their high dimensions have posed a major challenge to the clustering of such texts [20]. Using Word2Vec and Doc2Vec as text representation models solves the problem of sparse display of short texts, and to some extent, it also solves the problems regarding the representation of high-dimensional texts [21], but these methods represent the text with 200-500 dimensions where there is still the problem of high dimensions.
High-dimensional data increase computational time and memory required for processing, and also, the existence of noise and the low number of samples compared to highdimensional data influences processing efficiency negatively.
us, for reducing the problems of such data, dimension reduction is considered [22]. Dimension reduction methods take the sample x with D features and generate the sample y with d features, in which d << D.
x 1 · x 2 , . . . , x n εR D ⟶ f y 1 · y 2 , . . . , y n εR d . (1) Dimension reduction is one of the first steps in efficient data analysis and is an important preprocessing step in many fields of information processing including data retrieval, pattern recognition, text processing, data visualization, and data compression. Also, reducing text dimensions can increase clustering efficiency on texts [8,23]. In terms of clustering with/without dimension reduction, an overview of opinion texts clustering is presented in Figure 1.
In this paper, opinion texts are clustered using nonlinear dimension reduction methods. Manifold learning, as a nonlinear dimension reduction method, is used, and opinion texts are reduced in terms of their dimensions based on sentiment and semantics. en, the clustering algorithm is applied to these dimension-reduced texts. Reducing the dimension of the represented opinion texts makes it possible to optimally represent these texts based on sentiment and semantics. For this purpose, dimensions of these texts are reduced based on the sentiment and semantics of the text, and their intrinsic dimensions are extracted by applying the ISOMAP algorithm to the texts represented in Doc2Vec format. en, the K-Means algorithm [24] is utilized for the given low-dimensional texts, and clustering is performed. Nonlinear dimension reduction methods are often more powerful than linear methods, because the relationship between intrinsic and measured features may be much richer than a linear relationship between them. Among the different nonlinear dimension reduction methods, manifold learning methods on various datasets have desirable results. Due to the fact that the relationship between the intrinsic features of opinion texts and the features in these texts is nonlinear, in this paper, to reduce the dimension of texts, manifold learning will be used. is paper has two main contributions: (1) A new approach is presented to reduce dimension and representation of opinion texts based on sentiment and semantics by manifold learning (2) is approach helps to cluster opinion texts with high efficiency and simultaneous consideration of sentiment and semantics after representation of opinion texts in a low-dimensional manner based on sentiment and semantics e rest of the paper is organized as follows. In Section 2, the related and previous literature on the field of this study is reviewed. In Section 3, manifold learning is explained and the ISOMAP algorithm is investigated as one of the common manifold learning algorithms and is used to reduce dimensions. Section 4 describes the proposed method for opinion texts clustering in detail. In Section 5, the simulations performed to evaluate the efficiency of the proposed method and relevant results are presented. Finally, the conclusion is provided in Section 6.

Related Works
Dimension reduction is a successful method for opinion texts clustering. In this section, first, the previous methods proposed for opinion texts clustering by reducing their dimensions are reviewed. It is noteworthy that manifold learning, as a nonlinear dimension reduction method, has not been used for clustering of high-dimensional opinion texts yet, but it has been applied to represent the texts with reduced dimensions. In the second part of this section, the previous methods presented for reducing text dimension using manifold learning will be reviewed.

Opinion Texts Clustering by Dimension Reduction.
High dimensions of data are a barrier to extracting useful information, and dimension reduction techniques are used as a successful method to overcome this challenge. In the previous works presented for opinion texts clustering using dimension reduction, mostly linear dimension reduction methods including principal component analysis (PCA), singular value decomposition (SVD), nonnegative matrix factorization (NMF), and linear discriminant analysis (LDA) have been used, and also in some cases, feature selection method has been applied. In [25][26][27][28][29][30], first, dimensions of the opinion texts were reduced by linear dimension reduction methods, and then, clustering was performed. In [25], the TF-IDF model was used to represent the text for clustering; then, dimensions were reduced using the SVD and NMF methods. Finally, clustering was performed on the given dimension-reduced texts by the K-Means algorithm. Simulation on the dataset of 20 newsgroups showed an improvement in clustering performance when dimension reduction methods of SVD and NMF were utilized. In [26], the short text clustering-linear discriminant analysis (SKP-LDA) method was presented for the clustering of short Chinese texts. is method has been proposed to solve the problem of low efficiency of sentiment analysis and semantics extraction and, as a result, low clustering accuracy. In the proposed method, first, a word bag was defined based on the synchronicity of sentimental words, and then, a definition of topic-specific words and words related to the subject was provided. For improving the quality of semantics analysis, information from topic-specific words and topicrelated words was entered into the LDA model. Finally, 30 high-ranking special word sets obtained by the LDA model were clustered by the K-Means algorithm. In [27], different graph-based clustering methods were presented based on dimension reduction for the clustering of microblogs. e results of simulation done on several datasets collected from the microblogs showed that the proposed methods performed better than the standard and classical text clustering algorithms. Shortness, noisy nature, and the large volume of texts in microblogs are among the main challenges of clustering these texts. In [28], for clustering of short texts, first, the texts and their descriptions were converted into a low-density presentation using the convolutional neural network (CNN). en, through this representation, the similarity between each text and other comments was measured. On the other hand, given that short texts must be strongly related to their descriptions, the differences and similarities between the related and unrelated descriptions were taken into account in CNN training. Finally, representation was expanded with multidimensional properties obtained from location, time, and other information. is provides a strong representation of short texts and increases clustering efficiency. In [29], another method was presented for clustering of microblog texts, which uses the feature selection technique as a dimension reduction method. In the proposed method, after preprocessing of the texts, the LDA model is used as a topic modeler. is model is used to specify a property that is compatible with the topic of a database and provides a list of the reduced properties that can be used to represent any topic in the entire database. Finally, for determining the cluster of each tuple, the hamming distance between each subject properties vector and the tuple is measured, and the cluster with the minimum distance is selected as the final cluster for the tuple. is process continues until the entire database is clustered. e proposed method was applied to the microblog database related to four incidents and the simulation results showed better performance in the proposed method compared to several existing clustering techniques. In [30], a hybrid dimension reduction method was presented for opinion texts clustering. In the proposed method, dimension reduction was performed by combining feature selection and feature extraction methods. For this purpose, first, two sets of features were selected within the primary features by two attribute selection methods. en, these two feature sets were merged using one of the three methods of Union, Intersection, and Modified Union. In the Union method, all the properties of the two sets were selected. In the Intersection method, only common features between the two feature sets were selected. In the Modified Union method, first, the Union method was applied on the properties of two top-ranked sets of features, and then, the Intersection method was utilized for the other features. Next, the integrated feature set was reduced by the PCA feature extraction Scientific Programming method, and the final feature set was clustered using the K-Means algorithm. Finally, the sentiment score of each cluster was calculated using the SentiWordNet database.

Dimension Reduction of Texts Using Manifold Learning.
Manifold learning is one of the successful methods of dimension reduction that has been used in various studies to reduce the dimension of texts. In [31], a new method was provided for the representation of biomedical sentences based on manifold learning. In the proposed method, first, biomedical sentences were represented using the trained sentence representation method. en, a neighborhood representation graph was constructed using manifold learning to determine the local geometric structure of the sentence. In this way, the basic rules of the sentences were revealed and reembedded in sentence representation using manifold learning. As a result, the geometric structure of relationships between sentence representations was effectively described, having a positive effect on the efficiency of subsequent operations performed on representation. In [32], a simple and effective approach was introduced to represent words. is approach performs in a postprocessing manner and removes top components from all words. is simple operation can be used to embed words in downstream tasks or as initialization to teach task-specific embeds. In [33], manifold learning was used to reduce the dimension of texts while maintaining their semantic distances. e main goal was preserving the semantic connections between the texts and reducing dimensions from high to low. e proposed method uses manifold learning and reduces dimension by preserving mutual information between the texts. is method has been used to summarize texts. Large volumes, on the one hand, and large dimensions, on the other hand, have posed a major challenge to operations, such as the classification and clustering of big data. In [34], a new method was proposed to reduce dimensions of the big data.
is method uses ISOMAP and LLE algorithms and performs dimension reduction. For evaluating efficiency of the proposed method, SVM and Random Forest classification algorithms were used, and the given dimension-reduced data were classified, confirming good performance of the proposed dimension reduction method. Word embedding methods seek to discover a space based on Euclidean measure to map words into vectors, which is performed based on cooccurrence of words in a corpus. Embedding words may misjudge the similarity between words. For solving this problem, a method was proposed in [35] to reembed the pretaught words embedded by a step in the manifold learning.
is method also takes into account geometric information of the words that helps to better estimate the similarity between the words. In [36], statistical manifold learning was used to represent texts. e proposed method is an effective text learning framework whose main purpose is using the hidden topics to represent and measure texts. Assuming that words with the same topic follow the same Gaussian distribution, the texts are represented as a combination of themes. In [37], a word representation method was introduced for retrieving semantic space criterion.
is method integrates existing word representation algorithms and applies the manifold learning algorithm to them. For this purpose, the corpus with word cooccurrences was compared with semantic similarity of words, and it was shown that cooccurrence of words is consistent with the Euclidean semantic space hypothesis. Table 1 summarizes the related works reviewed in Sections 2.1 and 2.2, so that it helps better compare studies. In cases where the dimension reduction method is written manifold learning, the used algorithm is not specified in the relevant papers and includes all manifold learning algorithms.

Manifold Learning
In general, there are two methods to reduce dimensions of high-dimensional data: feature selection and feature extraction. In the feature selection method, some features are selected from the initial features. To do this, features that have more potency to distinguish samples are chosen. In the feature extraction method, some new features are generated based on the initial features, so that the number of features is less than the original ones. In feature extraction, the generated features are new features in another space, and no correlation can be found between the initial and generated features.
Dimension reduction based on feature extraction methods is divided into two categories: linear and nonlinear [38]. Linear methods are suitable in cases where there is a linear relationship between the data, which is also considered as a limitation of these methods. ey cannot manage nonlinear and complex real-world data. e most popular linear dimension reduction methods are PCA [39], LDA [40], NMF [41], and SVD [42]. Nonlinear dimension reduction methods, often known as manifold learning, are used in cases where there is no linear and seemingly significant relationship between the data, and they can overcome the limitations of linear methods [43]. e goal of manifold learning is mapping a set of highdimensional data to a set of low-dimensional data. e reason for naming this method as "manifold learning" is that this method tries to make a manifold from a set of learning points. is method is a powerful tool to reduce the nonlinear dimension of data, in which intrinsic parameters of the system, as the main factors in distinguishing data from each other, are identified and the whole set is placed on a manifold that expresses the actual relationship of the parameters. us, the relationship between data is expressed in a low-dimensional space [38]. If the primary data set has D dimensions, the manifold learning problem would be determining the position of each record of the primary data in a space d, where d is much smaller than D. Figure 2 shows an overview of the main purpose of manifold learning.
ISOMAP [45] and LLE [46] are the most famous manifold learning algorithms. e main idea in these methods is reducing system dimensions based on maintaining the relationship between data, which can be expressed as maintaining distance. In this paper, the ISOMAP algorithm is used to reduce text dimensions, which is explained in the future. e ISOMAP algorithm as a part of the global method is among the first manifold learning algorithms, and it preserves geometric features of the input data set and reduces dimensions nonlinearly by maintaining the distance between the data. is algorithm is an extended type of linear dimension reduction method of multidimensional scaling (MDS) algorithm and is a classic method for embedding heterogeneous information in a Euclidean space. e ISOMAP algorithm maps high-dimensional data to lowdimensional ones by maintaining the distance between each data pair and consists of three general steps: (1) finding neighboring points for each of data points, (2) calculating the geodesic distance for the data points, and (3) implementation of MDS algorithm (Algorithm 1). In Step 1, knearest neighbors are calculated for each data point. To do this, first, a matrix of distance or similarity is created between the data points and then, based on this matrix, knearest neighbors to each data are found and a corresponding matrix is created. e shortest path between two points on the graph is the geodesic distance between the data points, which can be determined in Step 2 by dynamic algorithms, such as Dijkstra's and Floyds҆ algorithms. Finally, in Step 3, the MDS algorithm is applied and the next dimension reduction step is performed. e MDS algorithm maps high-dimensional data to low-dimensional ones. After calculating the geodesic distance and storing it in matrix D, equation (2) will be established.
en, using equation (3), eigenvalues and eigenvectors of the matrix X T X are calculated. In this equation, λ is the diagonal matrix and U is the orthogonal matrix. LDA Texts modeled and reduced by LDA, clustered by K-Means [27] LDA and TF-IDF Texts modeled and reduced by LDA and TF-IDF, clustered by graph-based methods [28] CNN Texts and their descriptions converted to low-density representation by CNN [29] Feature selection Texts reduced by feature selection, modeled by LDA [30] Feature selection and extraction Texts reduced by a combination of feature selection and extraction methods, clustered by K-Means Dimension reduction of texts using manifold learning [31] Manifold learning Biomedical sentences reduced by manifold learning methods [32] Manifold learning Words represented in a simple and effective approach by manifold learning [33] LLE Texts reduction by LLE with maintaining semantics distances [34] LLE and ISOMAP Big data dimension reduction by LLE and ISOMAP [35] Manifold learning Words embedding by manifold learning [36] Manifold learning Texts representation by statistical manifold learning [37] SVD Word representation for retrieving semantic space criterion by SVD Figure 2: e main purpose of manifold learning [44].

Manifold Learning
Scientific Programming Finally, the necessary low-dimensional mapping to the space is performed by maintaining the eigenvectors corresponding to d largest eigenvalues equation.

Proposed Method
As mentioned in the previous sections, the purpose of this study is to improve opinion texts clustering by reducing their dimensions using manifold learning. erefore, the proposed method mainly focuses on opinion texts clustering with simultaneous consideration of sentiment and semantics. For this purpose, after preparing the texts for clustering, first, their dimensions are reduced using the ISOMAP algorithm, and then, clustering is performed. e dimension reduction step will be performed in three modes and based on semantics, sentiment, and a combination of them. In fact, dimension reduction based on a combination of sentiment and semantics helps us to cluster opinion texts in terms of both metrics of sentiment and semantics, which is necessary for many applications, while in the previous works, very little attention has been paid to this important issue. Figure 3 shows an overview of the proposed method. is method is described in detail in future sections.

Acquisition of Opinion Texts and Preprocessing.
In the acquisition step, the opinion texts to be clustered are received in the form of datasets. e datasets used in this paper have two main fields. e first field is a text field and contains the opinion written by an individual on a website or social network. In fact, this field is analyzed, and clustering is done on it. e second field is a label field and contains a label of an actual cluster of opinion texts. In fact, the actual cluster to which the text field belongs is located in this field. For performing clustering, the text field is processed and clustered, but during clustering, the cluster field is not used, and it is used after the completion of clustering to evaluate clustering performance. e main purpose of text preprocessing is to minimize structural and writing errors in the text. Due to the shortness of opinion texts, abbreviations, irregular expressions, and infrequent words are widely used in these texts, and these cases produce high noise levels in these texts and influence the quality of text analysis. For solving this challenge, different preprocessing techniques should be used [2,47]. In the proposed method, preprocessing was performed with high sensitivity, and an attempt was made to reduce the noise of the opinion texts to a great extent and also bring the texts as close as possible to the structured texts. e preprocessing approach performed in the presented method consists of two categories. e first category removes the unwanted and noisy elements from texts. e methods used for this purpose are removing duplicate letters, chunks, emojis, emails, tweets' signs, hashtags, extra spaces, and special characters. e second category of methods tries to get the texts out of the unstructured state and bring them into the structured state as much as possible. Methods used for this purpose are converting acronyms, removing contents in parentheses, lowercasing, removing stop words, and lemmatizing.

Dimension Reduction of the Opinion Texts by ISOMAP
Algorithm. After preprocessing of the texts, they are converted into vector form. For this purpose, the Doc2Vec model was used in the proposed method. Although the Doc2Vec model takes the representation form of texts out of sparse mode and greatly reduces the dimension of the text, it still has high dimensions. In the proposed method, the text represented with the Doc2Vec model has 300 dimensions. For increasing the efficiency of analysis of these texts, it is necessary to reduce text dimensions. For decreasing dimensions, first, intrinsic dimensions of the texts need to be estimated. For this purpose, a scree plot is created, in which the index of text elements is placed on the horizontal axis and the corresponding eigenvalue is placed on the vertical axis, and a curve is drawn based on these values. en, elbow points are marked on the curve, each of which can be a candidate for the intrinsic dimension of the texts. After estimating the intrinsic dimension, the ISOMAP algorithm is used to reduce text dimensions, which has three main steps. e first and most important step is building a similarity matrix. In the proposed method, for constructing a neighborhood graph, a similarity matrix of the texts is extracted first. In this paper, the opinion texts were clustered based on semantics, sentiment, and a combination of them. erefore, in building a similarity matrix, this matrix is created based on the method with which dimensions are reduced.
In the first mode, for clustering of texts based on semantic, a similarity matrix is constructed based on the semantic distance between the texts. e metric used to measure the semantic distance is the Euclidean distance equation.
where t i . t j are two opinion texts represented by Doc2Vec model, n is the length of Doc2Vec vectors, and t ik is the ith element of kth opinion text in Doc2Vec vector. In the second mode, a similarity matrix is constructed based on the sentiment distance between the texts. e valence aware dictionary for sentiment reasoning (VADER) tool [48] was used to measure the sentiment of each text. VADER is a lexicon and rule-based sentiment analysis tool that is specifically compatible with sentiments expressed on social media. is tool is sensitive to both polarity (positive/ negative) and intensity (strength) of opinion texts. VADER has been included in the NLTK package and can be applied directly to the unlabelled opinion texts. For constructing a sentimental similarity matrix, the first degree of the sentiment of each text is determined using the VADER tool, and then, the sentimental distance between all the texts is measured by calculating the difference in the degree of the sentiment of the desired texts. e degree of sentiment indicates the amount of sentiment in the text, and we can understand the degree of positivity/negativity of the topics of each text [3].
In the third mode, where our main goal is clustering the texts based on sentiment and semantics of the texts, in constructing a similarity matrix, first, the semantics distance of the texts is calculated using the cosine similarity metric (equation (6)) and is stored in a matrix. en, the sentimental distance between the texts is measured using the VADER method and is stored in another matrix. Considering that, in this mode, clustering is done by considering semantics and sentiment at the same time and it is necessary to combine the semantics and sentiment of the texts, in combining these two metrics, the use of the cosine similarity metric shows better efficiency.
where t i · t j are two opinion texts represented by Doc2Vec model, n is the length of Doc2Vec vectors, and t ik is the ith element of kth opinion text in Doc2Vec vector.
Eventually, the final similarity matrix is created by combining these two matrices using Algorithm 2. In this algorithm, DSM is a matrix that stores the semantic distance between texts and DSN is a matrix storing the sentiment distance between texts. e two matrices are combined according to the procedure specified in the algorithm, and the matrix forms the final similarity, i.e., D. In combining the Scientific Programming semantics and sentiment similarity matrices to construct a distance matrix, the main effect will be the sentiment similarity matrix, where, in different states of this matrix, the semantics matrix puts its effect on this matrix. e numbers obtained in the combination of these matrices are obtained based on experiments. After constructing the similarity matrix between the texts, the neighborhood graph of the texts is constructed, and then, the MDS algorithm is applied to this graph and the dimension-reduced vectors are obtained.

Clustering of Dimension-Reduced Opinion Texts.
After performing dimension reduction on the opinion texts, the final step of clustering will take place. In this step, the dimension-reduced texts stored as vectors are clustered using the clustering algorithm. e clustering algorithm used in this paper is the K-Means algorithm, which is a very famous and widely used clustering algorithm and has high efficiency in text clustering. e pseudocode for the K-Means algorithm is given in Algorithm 3.

Experiments and Evaluation
Simulations were performed to show the effectiveness of the proposed approach on the clustering performance of opinion texts. e used datasets, performance measures, and simulation results are presented in the following.

Datasets.
For increasing the validity of the evaluation of the proposed approaches, various datasets should be used. Input: N data that must be clustered and k as the number of clusters Output: k clusters of input data Algorithm: Randomly determine k data as the centroids of the clusters Repeat Assign each data to its closest centroid Compute the new centroid of each cluster Until e cluster centroids do not change ALGORITHM 3: e pseudocode of the K-Means clustering algorithm. 8 Scientific Programming erefore, three diverse datasets are used in the simulation. e used datasets are shown in Table 2. e "Search Snippets" dataset contains 2280 texts collected by Google search, which includes 8 clusters, namely, Business, Computers, Culture-Art-Entertainment, Education-Science, Engineering, Health, Politics-Society, and Sports, according to the semantics of the texts. e "Twitter Dataset" includes 2000 tweets collected from Twitter, which are tagged based on sentiment into positive and negative clusters and are related to various topics, such as Sports, Saints, Funny Images, etc. e "Twitter-Sentiment-Corpus-3" dataset contains 3424 tweets, which are collected from Twitter on 4 topics and are clustered into three clusters of positive, negative, and neutral based on sentiment and into four clusters of Apple, Google, Tagged Facebook, and Microsoft according to semantics. In this paper, 1091 tweets from this dataset were selected that had positive or negative sentiment and are used in three labeled modes. In the first mode, 1091 tweets included four clusters of Apple, Google, Facebook, and Microsoft based on semantics. In the second mode, 1091 tweets included two positive and negative clusters based on sentiment. In the third mode, considering semantics and sentiment simultaneously, 1091 tweets included 8 clusters of Apple-Positive, Apple-Negative, Google-Positive, Google-Negative, Facebook-Positive, Facebook-Negative, Microsoft-Positive, and Microsoft-Negative.

Evaluation Measures.
In this paper, various measures were used to evaluate the efficiency of clustering methods including accuracy, precision, recall, F-score, adjusted rand index (ARI), normalized mutual information (NMI), completeness, and homogeneity.
Accuracy. is measure determines which percentage of the texts are properly clustered and placed in their respective clusters. In other words, accuracy is the closeness of the predicted clusters to true clusters.
F-Score. It determines the harmonic mean of precision and recall and helps to have a trade-off between precision and recall. If one is strengthened and the other is weakened, this metric quickly decreases.
ARI. is measure is the corrected-for-chance version of the rand index (RI) and determines the degree to which real and clustered labels match with each other.
In this regard, the value of RI is calculated using RI � number of agreeing pairs number of pairs .
NMI. is measure calculates the statistical similarity between the created clusters and predefined labels.
Here, C is a random variable representing the initialization of points inside the cluster, and K is a random variable representing the group label. Also, I(C; K) � H(C) − H(C/K) is the amount of mutual information between the variables C and K, H (C) is the expansion of the variables C, and H (C/K) is the expansion of the variable C, given K.
Completeness. is measure determines the degree of completeness of the clustering algorithm. A complete clustering is achieved when each cluster has the data belonging to the same cluster.
For calculating some of the above measures, TP, TN, FP, and FN are required. e definition and method of calculating these values are given as follows: TP: if a pair of observations in the same category is in the same cluster, the clustering result for this pair is indicated by TP TN: if a pair of observations in two separate categories is also placed in two separate clusters, the clustering result for this pair is shown with TN FP: if a pair of observations in two separate categories fits into a cluster, the clustering result for this pair is indicated with FP FN: if a pair of observations in the same category is placed in two clusters by mistake, the clustering result for this pair is also shown with FN

Experimental Results and Discussion.
In this section, results of evaluating the efficiency of the proposed method on each of the datasets are given. Recently, various methods have been proposed for clustering; for example, in [49], successful methods for clustering have been proposed, but these methods are not for opinion texts clustering. On the other hand, the K-Means is the most successful and common clustering method used in opinion texts clustering; therefore, this algorithm is used as a basic algorithm. For a more accurate evaluation of the proposed method, this method was compared with the basic algorithm used (K-Means), the basic algorithm along with linear dimension reduction methods including PCA and SVD algorithms, which have been widely used in previous opinion texts clustering researches [50]. Also, the proposed method was compared with the basic algorithm along with the nonlinear dimension reduction method (KPCA algorithm).

Intrinsic Dimension of Opinion Texts.
For determining the intrinsic dimension of opinion texts, first, the intrinsic dimension of the texts is estimated by the proposed method as explained in Section 4.2. Given that several intrinsic dimensions were estimated on the datasets, for obtaining the final intrinsic dimension, experiments are performed and accuracy of clustering on each dataset is obtained. en, the final dimensions are determined based on the highest obtained accuracy. e results of experiments performed on different datasets are shown in Figure 4, in which the horizontal axis shows the estimated intrinsic dimension and the vertical axis shows the value of accuracy obtained for each estimated intrinsic dimension.

Results.
Since the Search Snippets dataset is labeled only by semantics, only semantic-based dimension reduction was applied to this dataset and the results are shown in Table 3. It should be noted that each clustering method was performed 50 times, and the average of each measure is given in tables. Table 3 shows the results regarding the performance of the proposed method and the compared methods based on the Search Snippets dataset. As can be seen, in this dataset, dimension reduction does not have a positive effect on the performance of clustering algorithms and the best performance was related to the K-Means algorithm without using dimension reduction.
Given that the Twitter Dataset is only labeled based on sentiment, the sentiment-based dimension reduction was applied to this dataset, and the results are shown in Table 4. As can be seen in Table 4, in this dataset, the proposed method has a positive effect on opinion texts clustering and better results are obtained.
e Twitter-Sentiment-Corpus-3 dataset is labeled based on semantics, sentiment, and a combination of them. erefore, three-dimension reduction modes were applied to this dataset, and the results are shown in Tables 5-7.  Table 5 shows the simulation results done on the Twitter-Sentiment-Corpus-3 dataset that are related to the mode, in which the dataset is labeled based on semantics, and in the proposed method, dimension reduction is done only based on semantics. As can be seen, in this mode, the proposed method also shows the best performance compared to other methods. Table 6 shows the simulation results done on the Twitter-Sentiment-Corpus-3 dataset. ese results are related to the mode, in which the dataset is labeled based on sentiment, and in the proposed method, dimension reduction was done only based on sentiment. As can be seen in Table 6, in this mode, the proposed method also improves clustering performance. Table 7 shows the simulation results done on the Twitter-Sentiment-Corpus-3 dataset. ese results are related to the mode, in which the dataset is labeled based on sentiment and semantics. erefore, in the proposed method, dimension reduction was done based on sentiment and semantics. In this mode, the proposed method has also a positive effect on opinion texts clustering.
Given that accuracy is one of the most important measures in evaluating the efficiency of clustering methods, Figure 5 shows a comparison between the accuracy of the proposed method and other methods. Tables 3-7 and Figure 5, dimension reduction using the proposed method improved clustering performance, except in the Search Snippets dataset. is result can be due to the fact that this dataset contains texts collected from Google search, which are structured and somewhat devoid of abbreviations, irregular expressions, and infrequent words, and that these texts contain the minimum amount of noise. Meanwhile, other datasets contain tweets collected from Twitter, which are unstructured and include noise, and in these datasets, dimension reduction has led to improved clustering efficiency. e simulation results done on the Twitter Dataset show that the proposed method caused an improvement of about 2.5% in accuracy as well as other measures and an improvement of about 8% in the NMI. e results of this dataset show that the proposed method can have a positive effect on opinion texts clustering. e simulation results done on the Twitter-Sentiment-Corpus-3 dataset demonstrate that the performance of the proposed method is reliable. Table 5 shows the results of clustering based on semantics. According to the results, the proposed method has a positive effect on opinion texts collected from social networks. In this case, the proposed method causes an improvement of about 10% in accuracy and an improvement of about 17% in the NMI. e results presented in Table 6 also confirm the good and acceptable performance of the proposed method in the sentiment-based clustering. In this case, the proposed method caused an improvement of about 21% in accuracy.

Discussion. According to the results presented in
As stated previously, the main purpose of the proposed method is opinion texts clustering based on sentiment and semantics, the results of which are given in Table 7. As can be seen in Table 7, in this case, the proposed method also shows its positive performance and causes an improvement of about 9 and 1% in accuracy and NMI, respectively. is good performance of the proposed method can be attributed to some reasons. First, it reduces dimension on the texts, leading to the optimal representation of the texts and 10 Scientific Programming  subsequently making a high efficiency in processing on the texts, and also as observed, dimension reduction results in higher quality in clustering. Second, dimension reduction is based on the purpose of clustering, which is also useful in increasing the accuracy of clustering. Here, depending on the used dataset, dimension reduction was based on sentiment, semantics, and a combination of both, which also increased the quality of clustering. e results and discussion may be presented separately, or in one combined section, and may optionally be divided into headed sections.

Conclusions
In this paper, a new approach for opinion texts clustering was presented. According to the previous studies, dimension reduction causes an optimal representation of the texts and has a positive effect on the efficiency of machine learning algorithms. erefore, in this paper, manifold learning was used to reduce dimension and to extract intrinsic dimensions of opinion texts. Using the ISOMAP algorithm, as one of the global manifold learning algorithms, first, dimensions of texts were reduced based on semantics and sentiments, and then, the texts were clustered using the K-Means algorithm. In the dimension reduction phase, three reduction modes were performed. In the first case, dimension reduction was performed based on semantics. In the second mode, it was performed based on the sentiment, and finally, in the third mode, it was done based on both semantics and sentiment. e simulation results done on the three datasets show that the proposed method does not have acceptable performance on the structured or noise-free texts but it improves clustering performance on tweets collected from    Twitter that contain noise and are unstructured. e third mode of dimension reduction, in addition to improving the efficiency of clustering, allows us to cluster opinion texts with high efficiency while simultaneously considering semantics and sentiment, which has received very little attention in the previous works. An improvement of about 9% is observed in terms of accuracy, about 4% in terms of F-Score, about 10% in terms of ARI, about 1% in terms of NMI, and about 2% in terms of completeness on the third dataset and clustering based on sentiment and semantics. Highprecision clustering based on semantics and sentiment helps us to summarize opinion texts with high quality. In future works, progress can be achieved in terms of both dimension reduction algorithms and methods. In the dimension reduction section, other manifold learning algorithms, such as the LLE, can be used. Also, the initial dimension reduction can be done using linear dimension reduction methods or feature selection methods, and then, the final dimension reduction can be performed using manifold learning algorithms.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.