Semantic Analysis Techniques using Twitter Datasets on Big Data: Comparative Analysis Study

This paper conducts a comprehensive review of various word and sentence semantic similarity techniques proposed in the literature. Corpus-based, Knowledge-based, and Feature-based are categorized under word semantic similarity techniques. String and set-based, Word Order-based Similarity, POS-based, Syntactic dependency-based are categorized as sentence semantic similarity techniques. Using these techniques, we propose a model for computing the overall accuracy of the twitter dataset. The proposed model has been tested on the following four measures: Atish’s measure, Li’s measure, Mihalcea’s measure with path similarity, and Mihalcea’s measure with Wu and Palmer’s (WuP) similarity. Finally, we evaluate the proposed method on three real-world twitter datasets. The proposed model based on Atish’s measure seems to offer good results in all datasets when compared with the proposed model based on other sentence similarity measures.


INTRODUCTION
The problem of calculating semantic similarity between two words/texts/ sentences/phrases is a long-standing issue in the field of Natural Language Processing (NLP). Generally, semantic similarity is a metric of the conceptual distance between two terms, based on the closeness of their meanings [1]. Sentence similarity approaches play an increasingly significant role in studies and applications associated with text in several fields such as document clustering, classification of text, IR, topic tracking, topic detection, text summarization, machine translation, and so on. Semantic similarity among documents, sentences, phrases, texts, and words are extensively studied in different areas, encompassing NLP, semantic search engines, semantic web, and Artificial Intelligence (AI). There are numerous word semantic strategy. Jiang and Conrath [7] proposed a metric to measure the semantic similarity among concepts and words, wherein corpus statistical information is combined with lexical taxonomy structure. Finally, the Weighted Path (WPath) measure proposed by Zhu [8], which combines two methods of path length and IC, used to measure the semantic similarity among words. Furthermore, with respect to semantic similarity of a sentence, Li et al. [9] proposed an approach that takes consideration the aggregation of semantic similarity and word order similarity included in the phrases, texts or sentences. In this measure, the semantic similarity of short text pairs is computed utilizing information from both the organized lexical taxonomy and the corpus. Mihalcea et al. [10] proposed an algorithm to assess the semantic resemblance of the sentences using measures based on knowledge and a similarity corpus. An approach was proposed by is Hliaoutakis et al. [11], for calculating the semantic similarity among medical words utilizing MeSH and general words utilizing WordNet. The Semantic Text Similarity (STS) measure which identifies the similarity among two texts from syntactic and semantic information was presented by Islam [12]. Ramage [13] presented a measure that combines relatedness information through a random path over a graph built from WordNet. The Semantic Similarity Based Model (SSBM) measure introduced by Gad and Kamel [14] was used to calculate semantic similarities by exploiting WordNet. New semantic weights were added to document terms by SSBM measure and SSBM modernizes the weights of frequencies by including the values of semantic similarities between words.
This paper presents a comprehensive review of various word and sentence semantic similarity techniques proposed in literature. Corpus-based, Knowledge-based, and Featurebased are categorized under word semantic similarity techniques. String and set-based, Word Order-based Similarity, POSbased, Syntactic dependency-based are categorized as sentence semantic similarity techniques. Then, we propose a new model for computing the overall accuracy of the entire in twitter dataset, which is based on the sentence semantic similarity between the tweets. The method proposed is undergone the following steps: First, the semantic similarity between tweets (calculate the semantic similarity of each tweet in the dataset with all other tweets in the same dataset is computed. Following this, the process continues with the rest of tweets of the dataset). The overall accuracy of the dataset is then calculated using Equations 29 and 30.
The rest of this paper is organized as follows: In Section 2, we present a survey of literature on semantic similarity measures and comparison between different similarities. Section 3 gives a view of the creation of datasets and Section 4 describes our proposed model. While the analysis of experiments and obtained results are provided in Section 5, Case studies of the Experimental results are illustrated in Section 6. Section 7 discussed the results and the final section presents the conclusions.

SEMANTIC SIMILARITY TECHNIQUES
Semantic similarity turns out the very complicated problem, where there are a lot of measures to measure word and sentence semantic similarity. Hence, these problems are tackled by finding similarities regarding word similarity and sentence similarity. Figure 1 shows the classification of semantic similarity approaches.

Semantic Similarity of Words
The approaches of semantic similarity are explained in the first part of this section. This provides numerical similarity values to terms/words in order to reflect the semantic distance between them. Semantic relatedness in computational linguistics is the reverse of semantic distance. If two words have any sort of semantic relation, then they are semantically linked [15][16]. The commonality of two concepts or words is represented by a particular metric known as the semantic similarity that depends on concepts hierarchical relations [17]. The similarity of semantic is the particular situation of semantic relatedness, which is a common idea and does not necessarily depend on hierarchical relations [16][17]. Several methods of word similarity have already been reported in literature. Starting from distance methods calculated using semantic networks, to the measurements based on distributional similarity models learned from the corpus. In this context, we therefore sought to concentrate on corpus-based approaches, knowledgebased approaches [8], and feature-based approaches. Since the corpus-based approaches primarily depend on contextual information of words showing up within the corpus. They primarily evaluate the most common semantic relatedness among words. In Knowledge-based measures, the similarity between words is derived depending on WordNet hierarchy relations. Feature-based approaches take into consideration the features or characteristics that are well-known to both terms. The measure of resemblance between two terms is described as a function of their characteristics.

Corpus-Based Methods
The word semantic similarity measures of corpus-based are dependent on word associations that determine the degrees of similarity among words learned from big corpora [17]. These corpus-based measures are calculated based on the word co-occurrences and word distributions statistics. It is presumed that two words are more similar if their adjacent contexts are very similar or if they show up simultaneously and more repeatedly. There are several count-based approaches, pursuant to various computational models, such as Point-wise Mutual Information (PMI) [18][19] Latent Semantic Analysis (LSA) [20]. Predictive based approaches such as Word2Vec [21] are used to generate and compute high quality and continue dense vector representations of words by anticipating a word in its adjacent context. Count-based approaches enumerate word co-occurrences and build a word-word matrix where these statistics are implemented directly with probabilistic models [18], dimension reduction [22], and matrix factorization [23]. The Continuous Bag Of Word (CBOW) approach, as proposed by the authors of Word2Vec [21] is more effective in computation and therefore, more appropriate with bigger corpus when compared to the skip-gram approach. Therefore, a CBOW approach is employed for training word vectors in a Neural Network (NN) comprising 3 layers viz., an input, projection, and output for predicting the word based on the words adjacent to it. Two measures, namely LSA [19] and PMI-IR [20] have been described in the following section.
• Latent Semantic Analysis (LSA) method The LSA method suggested by Landauer [19] is yet another corpus-based method of semantic similarity. In LSA, the similarity of paragraph meaning is identified by analyzing a large volume of corpora. In this method, term co-occurrences in a corpus are apprehended through approaches of dimensionality reduction on the term-by-document matrix T which represents the corpus, using the Single Value Decomposition (SVD) method. This SVD method is used to minimize the dimensions and relationships among words.
• Point-wise Mutual Information-Information Retrieval (PMI-IR) method This method has been proposed by Turney [20] as a straightforward unsupervised learning metric in order to recognize synonyms and to assess semantic resemblance among words. In order to compute the similarity of the word pairs, the PMI-IR method utilizes both a familiar semantic similarity metric PMI and IR. The PMI-IR measure is based on co-occurrence of words utilizing enormous collections of documents indexed in very large corpora such as modern search engines of the web. Given the following two words word i and word j , their PMI-IR is evaluated as given in Equation 1.

Knowledge-Based Methods
A number of methods are used to calculate the semantic similarity among terms/words depends on ontology and these methods have been improved in order to identify how closely two meanings of words are related utilizing information obtained from semantic networks [10]. If two words are placed closer in a given ontology, these words are considered to be similar. We present the following numerous measures that operate efficiently in the hierarchy of WordNet. The lexical database WordNet [24] is the most prevalent semantic network in the field of calibrating vol 35 no 6 November 2020 the knowledge-based approaches among words. It is used as the background ontology which classifies words based on sets of synonyms known as (synsets). Each synset is a collection of words that share a common sense (synonyms). These Synsets are connected both by means of conceptual semantic and lexical relations. WordNet is organized into concept taxonomy by the hierarchy of relationships between synsets (i.e hyponymy and hypernym). All these measures use a couple of concepts Con i and Con j as an input and yield a value that shows their semantic relatedness. The following approaches were chosen based on their results observed in other language processing applications, and their comparatively elevated computational effectiveness. A brief description of each of these measures is as follows.
(1) Path length measures There are several measures of semantic similarity based on pathlength. This section, however, gives a brief overview of semantic similarity measures based on path length, survey their respective merits and demerits.
• Shortest path method Several knowledge-based approaches for calibrating similarity among concepts in the lexical database WordNet were provided in literature [25]. One of these approaches the shortest path method is a simple metric in hierarchical semantic networks. The fundamental idea of this measure is to count the number of edges between two concepts (synsets) in the lexical database WordNet. If the two concepts are close to each other in the WordNet, then they are likely to more similar. Let Con i and Con j be two concepts and path(Con i , Con j ), the shortest path between these two relating concepts Con i and Con j . In the shortest path method [2], the semantic similarity measure Sim path can be formulated, as given in Equation 2.
Sim path (Con i , Con j ) = 1 1 + path(Con i , Con j ) (2) • The Leacock & Chodorow method Leacock and Chodorow [3] suggested a semantic similarity metric, namely LCh to compute the semantic similarity between given two concepts Con i and Con j in lexical database WordNet. This approach gives a score that shows how two words/concepts are similar based on the shortest path which links these concepts/words and maximum taxonomy depth in which the words take place. Lch measure is formulated as given in Equation 3.
Where Max depth(Con) is the maximum taxonomy depth, Con ∈ WordNet and Length(Con i , Con j )is the shortest path length between Con i and Con j utilizing node counting.
• WuP method The WuP metric was presented by Wu and Palmer [4], which computes the semantic similarity among two concepts on the basis of the depth taxonomy WordNet. This method also takes into consideration the position of Con i and Con j concepts in the taxonomy with respect to the position of LCS(Con i Con j ) which is the most particular concept of ancestor shared between Con i and Con j concepts. This measure combines the LCS and depth to produce a score of similarity and is expressed as in Equation 4.
Sim wup (con i , Con j ) = 2 * depth(LCS(con i , Con j )) depth(con i ) + depth(con j ) Where LCS(Con i Con j ) is the least common subsumer of concepts Con i and Con j , and depth(Con i ) is the path from Con i to C root where C root is the root concept of the taxonomy.

• Li similarity method
Li et al. [5] suggested a similarity metric to calculate sentence similarity by integrating the semantic vector and word order. This measure comprises the Shortest Path (SP) between Con i and Con j concepts and the subsume depth in the taxonomy in a non-linear function. This metric is formulated as in Equation 5.
Where SP is the shortest path between Con i and Con j concepts, H the depth of subsumer; β > 0 and α ≥ 0 are limiting factors that measure the depth and the SP respectively. It is therefore obvious that the value of measure score is between 0 and 1 (for similar concepts). The optimal parameters for this measure are β = 0.6 and α = 0.2 based on [5].
(2) Information content-based methods Information content-based methods associate probabilities with concepts in ontology, and IC is formulated as in Equation 7. The different measures of semantic similarity based on IC are described below.

• Resnik similarity method
The Resink similarity method proposed by Resnik [6] computes relatedness by taking into account the depth of the two concepts in the WordNet and the depth of the LCS. This is a score that denotes how two meanings of words are similar. IC refers to the frequency of concepts discovered in a text corpus. In WordNet, the frequency associated with a concept tends to increase every time the concept is recognized, as are the counts of the WordNet hierarchy's ancestor concepts (for verbs and nouns). IC can be calculated in the WordNet only for verbs and nouns, as these are the only POS in which concepts are structured in hierarchies. Thus, the semantic similarity of the two concepts Con i and Con j can be expressed as given in Equation 6.
IC(Con) = − log(p(Con)) where IC is the amount of information contained in a corpus of text and is defined as in Equation 7, Con refers to a concept in WordNet, and P(Con) refers to the likelihood of finding a concept Con in a large-scale corpus.
• Lin similarity method This measure was suggested by Lin [1], which relies on Resnik's measure of similarity. The IC shared by two concepts Con i , Con j is taken into consideration in this similarity. The Lin 498 computer systems science & engineering measure sought to identify a similarity measure that was justified both theoretically and universally and utilizes the amount of information necessary to depict the commonality between the two concepts and the information required to completely describe these terms. The proposed metric can be defined as given in Equation 8.
• Jiang and Conrath similarity method According to the shortcomings of Resnik's measure [6] which only considers the IC and LCS concept. In Jiang and Conrath's method [7], the similarity measure is based on both a corpus statistics and taxonomic links (hierarchic ontology) which measure the semantic similarity among given concepts. This measure computes the semantic distance to acquire the semantic similarity, where the semantic distance between two given concepts Con i and Con j is defined as the difference between the sum of the IC of the two given concepts Con i , Con j , and IC of their most LCS. The semantic similarity of this metric is the opposite of the semantic distance is expressed as in Equation 9.
• Weighted Path (WPath) semantic similarity method The WPath measure was developed by Zhu [8] which incorporates two approaches path length with IC to calibrate the semantic similarity among concepts. The main idea of using path length among concepts is to represent the difference between them, while IC is used to take into account the commonality among concepts. This measure is formulated as given in Equation 11.
where p ∈ (0, 1]. The variable p depicts the contribution of the LCS's IC that shows the prevalent information shared between Con i and Con i concepts.

(3) Feature-based and Hybrid methods
Several methods have been developed for calculating the similarity among words. The existing work can roughly be classified into two major groups namely: (i) Feature-based methods (ii) Hybrid-based methods. A brief overview of each of these semantic similarity methods have been presented in the following paragraph.

(3.1) Feature-based methods
Feature-based approaches take into consideration the features or properties that are well-known to both terms and the specific differentiating properties of each term. The measure of resemblance between two terms is described as a function of their characteristics. Several measures of semantic similarity based on feature based were proposed and a brief description of each of these measures is as follows: • Tversky method Tversky approach [26] takes into account the features/properties of concepts in order to compute resemblance among diverse concepts, although the position of concepts in the taxonomy and the information content of the concepts are disregarded. A set of words that indicate its characteristics should define each concept. In this approach, the similarity between two concepts Con i and Con j tends to increase when there is a similarity between concepts and tends to diminish when there is the difference between them. The disadvantage of this measure is that if a set of features is not complete, it cannot work properly. This measure is expressed as in Equation 12.
Sim Tvsk Con i , Con j = Con i ∩ con j Where the value of α is adaptable and α ∈ [0,1], Con i and Con i harmonize to description sets of two concepts Con i and Con j respectively.
• Basic feature method The basic feature-based approach supposes that a set of words that indicate its features or properties should define each concept. The more prevalent the features of two concepts,the more similar these concepts are considered to be [27]. According to [28,29] the next metric is formulated as in Equation 13.
Where Ans(Con i ) and Ans(Con j ) harmonize the description sets of concepts Con i and Con j , respectively. Ans(Con i ) ∪ Ans(Con j ) represents the union of two nodes Con i and Con j . The reachable nodes joined by both Ans(Con i ) ∩ Ans(Con j ).

• Feature-based similarity using Wikipedia
A model for feature-based similarity fully based on Wikipedia to calculate the semantic similarity among the concepts was proposed by Jiang [30]. The features/characteristics were chosen according to the Wikipedia page organization. In this model, firstly the authors presented a formal representation of Wikipedia concepts. A feature-based similarity model dependent on the formal representation of the concepts of Wikipedia was then provided. Eventually, a variety of feature-based methods of semantic similarity arising from the model installations were investigated.

(3.2) Hybrid methods
Several hybrid methods have already been presented to measure similarity between words/concepts. A brief overview of each of these semantic similarity methods is presented in the following paragraphs.

• Knappe method
Knappe method proposed a similarity measure using the specifications of two compared concepts Con i , Con j and the information of generalization [28]. This metric depends mainly on the possibility of numerous routes/paths joining two given concepts Con i , Con j . The proposed metric can be defined as follows in Equation 14.
vol 35 no 6 November 2020 where p ∈ 0, 1 that defines the degree of influence of generalizations. Knappe measure scores between 0 and 1, and Ans(Con i ) and Ans(Con j ) harmonize to the description sets of concepts Con i and Con j respectively. Ans(Con i ) Ans(Con j ) is the connection between a pair of parent node sets.

• Zhou method
The Zhou method was proposed by Zhou [31] to evaluate semantic similarity in the taxonomy and takes into account path-based measures between two concepts and IC based measures. Further, the weight of two metrics can also be adjusted artificially in this method. This method is expressed as given in Equation 15.
where LCS(Con i , Con j ) is the least common subsumer of concepts Con i and Con j . From the Equation 15, it can be observed that both path measure and IC measure were taken into account for calculating of similarity. For excellent results, the variable k is a weight factor that needs to be adjusted manually.

Comparison between different Methods
The Table 1 compares all measures of the word similarity which can be grouped into two kinds Knowledge-based similarity methods, and Feature and Hybrid based methods.

Sentence Semantic Similarity Measures
The meaning of a sentence is reflected by the words in its sentence Li [9]. Literature presents various measures that can estimate the similarity between short texts and they are classified into syntactic-based measures, semantic-based measures, and hybrid measures. The following paragraphs present a brief overview of sentence semantic similarity measures and also highlight the advantages and disadvantages of these measures.

Word Order-Based Similarity
Word order similarity is a method of evaluating the similarity of sentences based on the order words. Usually, two sentences are considered to be similar or identical if same words appear in both sentences in the same order. There are several measures of sentence semantic similarity based on word order. A brief description of each of these measures is as follows.
• Li sentence similarity measure A sentence similarity measure was proposed by Li [9] which takes into account the aggregation of semantic vector and word order similarity. This metric is used to calibrate the semantic similarity among very short texts/sentences. The proposed method uses all the characterized words in two short texts or sentences to dynamically create a joint word set. The semantic similarity among two short texts is computed for each sentence by utilizing the information from the lexical database WordNet [24] and a corpus. An ordering vector is also created for each sentence. It is to be note that each word in a sentence participates differently to the interpretation and the meaning of the entire sentence/text. The importance of a word is scaled through the use of IC obtained from a corpus. A semantic vector for each of the two sentences can be derived by aggregating the IC from the corpus with a raw semantic vector. These two order vectors are used to compute the similarity order. Semantic similarity computation depends on the two semantic vectors. Lastly, the overall similarity of the sentence is formulated as an aggregation of the word order similarity and semantic similarity and this metric is expressed as given in Equation 16 below.
where δ ∈ (0.5, 1] determines the relative contribution of word order and semantic information and T i and T j refer to the pair of sentences/texts. In Equation 16, S s refers to the semantic similarity between T i and T j and is defines as the cosine similarity between a pair vectors S i and S j , and S r refer to the word order similarity measure. S i and S j in Equation 17 refer to the lexical-semantic vectors of two sentences T i and T j respectively, derived from the joint word set. r i and r j refer to the word order vectors of two sentences T i and T j respectively.

• Semantic Text Similarity (STS) Measure
Islam [12] proposed STS measure which identifies the similarity between two sentences or texts containing syntactic and semantic information. In order to produce more common text or sentence similarity measures, three similarity functions are taken into account. Firstly, the string similarity is calculated utilizing the altered version of the longest, common, subsequence string matching approach. Secondly, the similarity of the semantic words is computed after which the writers utilize commonword order similarity to integrate syntactic information in the suggested approach. In the end, the sentence/text similarity is obtained by combining the following three functions: the string similarity, semantic similarity, and common-word order similarity with normalizing. An extremely good Pearson correlation for thirty pairs of the sentence was obtained from the proposed method STS and it was found to surpass the results achieved by Li [9].
• Atish Pawar Sentence Similarity Measure Atish [32] proposed an approach for computing the semantic resemblance between two paragraphs, sentences, or words. Initially, this measure filters and disambiguates the given two input texts and tags them in their POS. The method for calculating the semantic resemblance between two texts is split into 3 components namely: word resemblance, sentence resemblance, and word order resemblance. The likeness among words is computed based on the edge-based method. In the proposed approach, a lexical database is used for comparing the meaning of a proper word. For each sentence, a semantic vector which contains the similarity among words is created and is utilized to calculate the similarity of the sentence. In order to compute the effect of the sentence syntactic structure, word order vectors are also established. Using two semantic vectors, the semantic similarity is calculated. Finally, the overall semantic similarity is computed based on Equation 18.
where S is a magnitude of the normalized vectors and is given as S = V i . V j , ζ is the variable which is given as in Equation 26, C i and C j , are the numbers of valid elements in V i and V j respectively. Sum(C i C j ) is the summation of C i and C j . In order to restrict the similarity value in the range of 0 and 1, γ is set to 1.8.

String or Set-Based Measures
Literature presents various measures of sentence semantic similarity based on string or set-based. The following paragraphs present a brief overview of sentence semantic similarity based on string-based methods.

• Mihalcea Text Semantic Similarity Measure
Mihalcea et al. [10] suggested an approach for assessing the semantic resemblance of texts or sentences utilizing the similarities of Knowledge-based approaches and corpus-based approaches. The proposed approach calculates the overall sentence semantic similarity between the two input texts T i , T j and is formulated as given in the following Equation 20.
where T i and T j are the two input sentences and IDF stands for the Inverse Document Frequency used to define the specificity of words, and MaxSim(word, T j ), the highest semantic similarity that can be obtained by comparing each word in text T i to recognize the word in the sentence T j and it also stands for the Wu and Palmer WordNet similarity or path similarity measure.

• Using Word Sense Disambiguation(WSD)
A new measure was proposed by Abdalgader [33] which uses WSD to measure sentence resemblance. In this approach, each word is linked with a WordNet as a pre-processing stage. A unit vector that includes all the words in both sentences is created.
The original set of words of each sentence is extended using WordNet synonyms following which a vector representation is created for each sentence. The components of this vector are computed based on a resemblance between the extended words in that sentence and the unit vector. Lastly, the cosine similarity of the two vectors is used to compute a sentence semantic similarity.

Part-Of-Speech (POS) Similarity Measures
The proceeding paragraphs presented details related to String/set based measures. Measures that adopt POS tag to compute similarity between sentences are overviewed in the following paragraphs. •

Features-Based Measure of Sentence Semantic Similarity (FM3S)
In order to calibrate the semantic similarity of two sentences, FM3S was suggested by Taieb [34]. The FM3S measure is dependent on the combination of the following three constituents: verb-based semantic similarity utilizing the tense information, the noun-based semantic similarity, including compound nouns, and the common-word order similarity utilizing the tuning parameter α ∈ [0, 1] in a non-linear manner. This measure uses the technique of quantification IC-based method [35] in combination with Lin [1] method and WordNet taxonomy to assess the degree of semantic similarity among words. FM3S measure is defined as given in Equation 21.
where T i and T j are the two input sentences and α ∈ 0, 1 a parameter used to transform each component's contribution to the ultimate result. As per Equation 21, SS Nouns (T i , T j ) is the noun semantic similarity function assigned to sentences T i and T j , SS verbs (T i , T j ) the verb semantic similarity function, and SS Cwo (T i , T j ) the common word order similarity function. The proposed measure produced competitive outcomes when Compared to the previous measures suggested by Li's benchmark [9].

• Part-Of-SpeechTags Short-Text Semantic Similarity (POST-STSS) MKeasure
A new measure, namely POST STSS was proposed by VukBatanovic [36] to calculate the semantic resemblance of short texts in which POS tags are utilized as indicators of the deeper syntactic knowledge obtained generally utilizing more advanced tools such as semantic function labelers and parsers. The proposed model included the POS tag weighting scheme and it also depends on the BOW model. The POST STSS measure neither needs advanced syntactic tools nor hand-crafted knowledge bases, this making itself more easily applicable to languages with scarce NLP resources. The authors concluded that the proposed method yields higher accuracy when compared to other methods that utilized advanced syntax-processing tools.

• Aggarwal Measure
A new measure was proposed by Aggarwal [37] to compute the semantic resemblance among sentences. This measure integrates knowledge-based semantic similarity scores with corpus-based semantic relatedness measure over the whole sentence obtained for those words falling under the same syntactic roles in both sentences. All these scores were fed as the properties/features to ML models such as Bagging models and linear regression to obtain a single score, which represents the degree of similarity among sentences.

Syntactic Dependency-Based Similarity
The different measures of sentence semantic similarity based on Syntactic Dependency are described as follow: • Syntax-Based Measure for Semantic Similarity (SyMSS) Oliva et al. [38] proposed the SyMSS measure to calculate sentence semantic similarity. This measure considers the significance and structure of a sentence to be composed of meanings of it's individual words. In this measure, a deep syntactic analysis of each text is performed through a joint dependency parser and the semantic information obtained from a lexical WordNet database. SyMSS measures the semantic similarity between words with the same syntactic role with this syntactic analysis. The SyMSS measure is defined as given in Equation 22.
Where T i is sentence/text consisting of n phrases and their heads are h i1 , . . . , h in and T j , sentence/text consisting of n phrases, and h j 1 , . . . , h j n are their heads. Phrases of h ik and h j k have the same syntactic function. L refers to the syntactic roles of sentences that are present in only one of the sentences. In this case, if one sentence contains a phrase that is not shared by the other, a penalization factor (PF) is introduced to reflect the fact that one of the sentences contains an extra piece of information.
• Dan Measure Dan and et al. [39] proposed a method to evaluate the semantic similarity between sentences based on the assumption that the meaning of a sentence is captured by its syntactic constituents and the dependencies between them. A syntactic parser was used to obtain both the constituents and their dependencies. This method assumes that two sentences the same meaning if there is a strong mapping between their chunks and if the chunk dependencies in one text are preserved in the other. The measure considers that every chunk to have its unique importance, concerning the overall meaning of a sentence, which is calculated based on the information content of the words in the chunk. This measure is expressed as given in Equation 23.
Where T i and T j are the set of chunks in the first and second sentences respectively. Thus, W k (t i , t j ) values are similarity scores computed among chunks in T i and those in T j . All calculations were carried out by the proposed method, recursively using the Rus and Lintean's approach, applying Equation 23.

• Wali Wafa Sentence Similarity Measure [40]
Wali W. et al. [40] presented a generic hybrid measure that improves the similarity measure between sentences by applying semantic and syntactico-semantic knowledge including the benefit of the standardized Lexical Markup Framework (LMF) dictionary [41]. This method included three phases wherein preprocessing of the sentence pairs constituted first stage. The second step involved the following similarity scores syntactic-semantic, semantic, and lexical. In the end, the overall score was calculated using supervised learning. This measure is expressed as given in Equation 24.
where the parameters α,β,γ are the weights attributed to lexical similarity, semantic similarity, and syntactico-semantic similarity respectively, and C is a constant. SimLexis the lexical similarity between sentences which uses the Jaccard Coefficient and is described as SimLex(T i , T j ) = MC MS1+MS2−MC . SemSMis a score of the semantic similarity which uses the cosine similarity and it is formulated as where V i and V j are the semantic vectors of sentence T i , T j respectively. SSM is the syntactico-semantic degree between T i and T j sentences: SSM(T i , T j ) = ASC ASS1+ASS2−ASC , where ASS1 and ASS2 are the counts of semantic parameters included in sentences T i and T j respectively, while ASC is the count of semantic parameters shared between T i and T j texts/sentences.

• WaliWafa Sentence Semantic Similarity [42]
A new measure, namely Sim Wali (T i T j ) was improved by Wali [42] to determine the semantic resemblance between T i and T j sentences/texts. This measure aggregates the following three components namely lexical similarity, semantic and syntacticsemantic similarity in a linear function and is formulated as shown below in Equation 25.
where A refers to the lexical similarity function between sentences T i and T j , namely LexSim(T i T j ) and is formulated as in Equation 26, B refers to the semantic similarity between two sentences T i and T j , namely SemSim(T i T j ) which computes utilizing the cosine similarity as in Equation 27. C is the syntactico-semantic function between two sentences T i and T j namely, SynSemSim(T i T j ) and is determined as in Equation 28. The parameters α, β, and γ refer to the weights attributed to lexical similarity, semantic similarity, and syntactico-semantic similarity respectively. More details of this measure are given in [34].

DATA COLLECTION Dataset
To display the differences and effects of the proposed model using semantic similarity measures, the following three datasets were collected to verify the model. The Twitter Streaming API was used to collect our datasets. The key information of the datasets is introduced as shown in Table 2, where the name   English tweets were concentrated for analysis. Tweets were also filtered in the second stage of cleansing date to exclude all non-English tweets from the dataset.

PROPOSED MODEL
It is very important to compute the accuracy of the whole dataset in proposed model. Therefore, a new model for computing the overall accuracy of the whole twitter dataset which is based on the sentence semantic similarity between the tweets has been proposed. This model consists of two formulas, namely Accuracy semantic similarity (1) and Accuracy semantic similarity(2) as mention in Equations 29 and 30 respectively. This model has been developed as per the following steps: First, the semantic similarity between tweets has been computed (compute the semantic similarity of each tweet in the dataset with all other tweets in the same dataset then the process continues with the rest of tweets of the dataset). Second, the overall accuracy of the dataset has been calculated using Equations 29 and 28. Table 3 presents formulation of the dataset which consists of n tweets as the proximity matrix, where T 1 is the 1 st tweet in the specific dataset, and T n refers to the n th tweet (last tweet) in the same dataset. SS T is the semantic similarity between i th tweet and j th tweet in the dataset. The proximity matrix thus formed as is presented in Table 3.
To compute the overall accuracy of each of the datasets, the semantic similarity among tweets or sentences has been calculated using any sentence semantic similarity measure [9, 10, 12, 32-34, 40, 42] and word similarity measures [1][2][3][4][5][6][7]. This complies that the semantic similarity of each tweet with all other tweets separately has to be computed. For example, if a dataset consists of n number of tweets, the semantic similarity of each tweet with all other tweets in the dataset has to be calculated so that the number of semantic similarity computations for all tweets in the dataset is n * n computation. We have proposed two formulas as in Equation 29 and 30 to calculate the accuracy of the whole dataset. The first formula in the proposed model is formulated as given in Equation 29.
A SS(1) = Accuracy semantic_similarity(1) where Semantic_Similarity(T i , T j ) is the sentence semantic similarity between i th tweet and j th tweet in the dataset, and n refer to is the number of tweets in the dataset. The second formula is defined as given in Equation 30.
A SS(2) = Accuracy semantic similarity(2) The computation time of Accuracy semantic similarity(2) is greater than of computation time of Accuracy semantic similarity(1)

EXPERIMENTS AND ANALYSIS
This section presents details about experiments conducted to evaluate the performance and accuracy of the four methods on three twitter dataset. These methods have been used to compute the overall semantic similarity and overall accuracy of each dataset.

Experimental Setup
This section presents details about datasets, software tools and packages utilized, and the software and hardware particulars of the system. All of these are implemented in python 3.7.1

Experimental Results on SR_DS2019 Dataset
Four hundred and forty one tweets about the Sudanese Revolution (SR_DS2019) dataset were collected, as mentioned earlier, to conduct the experiments. The semantic similarity between the tweets of SR_DS2019 dataset has been calculated. The overall accuracy of the dataset based on the semantic similarity obtained has been computed using Equation 29. Table 4 shows the overall accuracy of 441 tweets of SR_DS2019 dataset using the following four methods of sentence semantic similarity namely, Mihalcea's method [10] with path similarity [2], Mihalcea's method [10] with Wup similarity [4], Li's method [9], and Atish method [32]. The overall accuracy levels of the entire dataset was found to be at 0.612, 0.404, 0.341, and 0.641 with using Mihalcea's method [10] with Wup similarity, Mihalcea's method with path similarity, Li's method, and Atish method respectively. Table 4 and Figure 2 present the experimental results which it appears to show that the overall accuracy of the proposed model using Atish method yields good and superior results when compared to the proposed model using other measures.

Experimental Results on AO2MNZ_DS2019 Dataset
The results of the semantic similarity were obtained for a different number of tweets with respect to AO2MNZ_DS2019 dataset. Table 5 shows the overall accuracy of the AO2MNZ_DS2019 dataset obtained using the following four approaches of sentence semantic similarity viz., Mihalcea's method with path similarity, Mihalcea's method with Wup similarity, Li method, and Atish method. The overall accuracies of the entire dataset using these  methods can be seen to be at 0.452, 0.642, 0.407, and 0.688 respectively. Accuracy has been computed, using Equation 29 which is based on the results obtained from the semantic similarity of each tweet with other tweets in the dataset. As shown in Figure 3, the experiments show that the proposed model using Atish's measure appears to provide the highest overall accuracy and also seems to outperform the proposed model using all the other sentence similarity measures.

Experimental Results on EAPC_DS2019 Dataset
The results of semantic similarity on EAPC_DS2019 were obtained for a different number of tweets. The overall accuracy of the EAPC_DS2019 dataset using all methods of semantic similarity viz., Atish's Method, Li's method, Mihalcea's method with Wup similarity, and Mihalcea's method with path similarity is presented in Table 6. The overall accuracy has been computed using Equation 29, based on the results obtained from the semantic similarity. The overall accuracies of the entire dataset using four methods can be observed to be at 0.734, 0.428, 0.700, and 0.529. It can also be observed that the overall levels of accuracy using Atish's method seems to yield good results when compared to the proposed model using all other semantic similarity measures as shown in Figure 4.

CASE STUDY OF THE EXPERIMENT RESULTS
The samples of the dataset consisting of 10 tweets derived from the "Ethiopian Airlines Plane Crash dataset" (EAPC_DS2019) were used as mentioned in Table 7. The experiments on this dataset were conducted and tested on three metrics, namely Mihalcea's algorithm with path similarity, Mihalcea's algorithm with Wup similarity, and Li's method. The various accuracy scores have also been computed and compared in Tables 8, 9, and 10. It can be observed that the semantic similarity between tweet (T 1 ) and tweet (T 2 ) using three methods are 0.41, 0.638, and 0.441 respectively. Further, the semantic similarity between tweet (T 2 ) and tweet (T 8 ) are 0.457, 0.632, and 0.34 respectively. In addition, the semantic similarity between tweet (T 9 ) and tweet (T 10 ) using all three methods is 1 because their texts are the same. Two formulas namely A SS(1) as given in Equation 29 and A SS (2) as given in Equation 30 have been used. As pointed out in this paper, A SS (2) shows that the best accuracy score between all tweets in the samples. Tables 8, 9, and 10 show the semantic  Tables 8, 9, and 10 using only 3 methods indicate that, the overall accuracy of 10 tweets of EAPC_DS2019 dataset using A SS(2) are at 0.567, 0.707, and 0.523 while the overall accuracy levels obtained using A SS(1) are 0.520, 0.675, and 0.470 respectively. From the results, the performance of A SS(2) seems to be superior to that performance of A SS (1) in all methods. The best results of all three methods are obtained using Mihalcea's algorithm with Wup similarity and Equation 30. Table 11 presents the comparative accuracies using Equations 29 and 30 A SS(1) and A SS (2) respectively. It appears that the accuracy levels obtained using A SS(2) offer better results when compared to accuracy levels obtained using A SS (1) . Figure 5 also indicates that the accuracy levels of a varied number of tweets using Equation 30 are better when compared to those affected using Equation 29. Equation 30 consumed more time to perform this task when compared to Equation 29.  Table 8 Tweet semantic similarity of n * n tweets using Mihalicea's measure with path similarity, and the overall accuracy for n * n tweets using the proposed model (Equations 29 and 30), where n = 10.

Tweet
No.  Table 9 Tweet semantic similarity of n * n tweets using Mihalicea's measure with Wup similarity, and the overall accuracy for n * n tweets using the proposed model (Equations 29 and 30), where n = 10.

Tweet
No. (2) Table 10 Tweet semantic similarity of n * n tweets using Li's method, and the overall accuracy for n * n tweets using the proposed model (Equation 29 and30), where n = 10.

DISCUSSION
The following section discusses the results of experiments at work conducted on three twitter datasets viz., EAPC_DS2019, SR_DS2019, AO2MNZ_DS2019. In SR_DS2019 dataset, the overall accuracy level of the entire dataset using Atish's method [32] is 0.641 and this performance seems to outperform all the other measures such as Li, Mihalcea's method [10] with path similarity, Mihalcea's method with path similarity, as shown in Table 5. It can also be observed that this dataset shows 0.359 failure of accuracy owing to the following reasons: A) all the tweets in the dataset are posted by different users. B) Informal type of language, the lack of proper written grammar, and the unstructured and uncertain nature of huge data in twitter present a new kind of challenges. C) a broad range of anomalies including, slangs, lengthening (Repeating character), concatenating words, complex spelling errors, unconventional use of acronyms, and multiple versions of abbreviations of the same words.
In AO2MNZ_DS2019 dataset, our proposed method presented a good level of overall accuracy at 0.688 of the entire dataset using Atish's method [32]. The overall accuracy using all methods is shown in Table 6. In AO2MNZ_DS2019 Dataset, the failure of semantic similarity is 0.312 and this seems to indicate that Atish's method outperforms all other measures. The failure of accuracy in these samples is 0.0.312 and the reasons for this failure of the accuracy are previous paragraph.
In EAPC_DS2019 dataset, our proposed model seemed to achieve a good overall accuracy of 73% using Atish's method and this also performance seems to outperform all the other methods.
Tables 7 represent the comparison of similarity from the proposed method and other measures. In contrast, the accuracy failure of these samples is 27% and the reasons for this failure are mentioned previous paragraph.

CONCLUSION
Semantic similarity measures are widely used in many fields including Natural Language Processing, Web search, and so on. This paper investigated several techniques of computing semantic similarity measures, which measure both the word and sentence semantic similarity. Three categories introduced in word semantic similarities which are namely corpus-based, knowledge-based, and feature-based were described. The four categories presented in sentence semantic similarity techniques based on String and Set-based, Word Order-based Similarity, POS-based, Syntactic dependency-based techniques were also described. The proposed model for calculating the overall accuracy of the twitter dataset based on the sentence semantic similarities presented has also been described. The experiments conducted on all three twitter datasets to evaluate the proposed model have also been covered in details. The experimental results seem to indicate that the model proposed based on Atish's measure is superior to the proposed model based on other similarity measures.