Textual Similarity Measurement Approaches: A Survey

: Survey research is appropriate and necessary to address certain research question types. This paper aims to provide a general overview of the textual similarity in the literature. Measuring textual similarity tends to have an increasingly important turn in related topics like text classification, recovery of specific information from data, clustering, topic retrieval, subject tracking, question answering, essay grading, summarization, and the nowadays trending Conversational Agents (CA), which is a program deals with humans through natural language conversation. Finding the similarity between terms is the essential portion of textual similarity, then used as a major phase for sentence-level, paragraph-level, and script-level similarities. In particular, we concern with textual similarity in Arabic. In the Arabic Language, applying Natural language Processing (NLP) tasks are very challenging indeed as it has many characteristics, which are considered as confrontations. However, many approaches for measuring textual similarity have been presented for Arabic text reviewed and compared in this paper.


INTRODUCTION
Every single second, millions of bytes are added all over the world.Therefore, the information stored on the web is enormous, indeed.As a result, searching tools as search engines are indexing billions of web pages, which is just a fraction of information that can be reachable on the Web.However, the searching process discloses a large volume of information changing in relevance and quality.Appraisal of information in terms of relevance and reliability is central since an inappropriate use of information can outcome in inappropriate decisions and grave penalties [1].The ranking task is reordering the results retrieved from the search tool based on the relevancy between the search result and the original inquiry issued.It is a central task in many NLP topics like information retrieval, question answering, disambiguation, text summarization, plagiarism detection, paraphrase identification, and machine translation [2].Finding the similarity between terms is the essential portion of textual similarity, then used as a major phase for sentence-level, paragraph-level, and script-level similarities.In the event of measuring the relevancy between documents, many papers tend to analyze the surface words occurrence between documents.
Text representation is a significant task used to overthrow the unregulated form of textual data into a more formal construction before any additional analysis of the text.The differences in the approaches that exist in the literature review for textual similarity depend on the text representation scheme used before text comparison.There are different text representation schemes suggested by researchers likes Term Frequency-Inverse Document Frequency (TF-IDF) 1 , Latent Semantic Indexing (LSI) 2 , and Graph-based Representation [3], [4], [5].Due to these ways, the similarity measure to compare text units also differs because one similarity measure may not be convenient for all representation schemes.For example, the cosine similarity based on geometric distance is an appropriate textual similarity measure for text represented as a bag of terms.But, it is obscure whether cosine similarity will achieve passable results when text symbolizes as a graph-based representation [3].While the majority existing textual similarity measures are developed and used for English texts, very rare measures have been developed especially, for the Arabic Language [6].Thus, in this work, we discuss the effort done by researchers for the task of measuring similarity for many languages English, Spanish, Arabic, etc.
The following section is a background on the textual similarity concept and summarizes the most relevant associated work.Then the paper concludes.

TEXTUAL SIMILARITY CONCEPT AND LITERATURE REVIEW
Measuring textual similarity tends to have an increasingly important function in related topics like text classification, recovery of specific information from data, clustering, reveal the topic, subject tracking, question answering, essay grading, summarization, and the nowadays trending Conversational Agents (CA), which is a program that deals with humans through natural language conversation.Finding the similarity between terms is the essential portion of textual similarity, used as a major stage for sentence-level, paragraph-level, and script-level similarities [7].The relevancy of words can be estimated in two manners: semantically and lexically.If two terms have a similar chain of characters, they are lexically analogous.Otherwise, if they have the identical context and significance, although they contain different characters, they are semantically analogous.Then a more recent approach (the hybrid similarity) has been used, which is an integration of different similarity measurements [8].

A. Text Similarity Approaches
According to [7], [8] We found that the text-similarity approaches are three categories and illustrated in Fig. 1 and 2.
1) Lexical-Based Similarity (LBS): These techniques depend on computing the distance among two chains to recognize the similarity among them.LBS measurements are categorized into two groups: character-based and term-based distance measurements.Character-based measurements were proposed to handle typographical errors.Even so, these measurements go wrong to captivate the similarity with term arrangement issues (like, "John Smith" versus "Smith, John").Term-based similarity measurements try to recompense for this issue [7], [8].In Table 1, we summarize the most prominent attempts to measure the lexical-based similarity and compare them according to the applied technique, the used dataset /sample in the experiment and the results released by each approach, chronologically arranged.2) Semantic-Based Similarity: Semantic similarity defines the similarity among sequences that depend on their significance instead of using character-matching [8].It is considered a potential part of Natural Language Processing (NLP) tasks such as word sense disambiguation, text summarization, entailment, machine translation, etc. [6].In [9], the authors consider semantic similarity as a challenging task if you have two texts then the challenge is to measure how similar they are or to decide if they have a qualitative semantic relation between them.Generally, it is divided into two main ways to calculate the similarity among sequences: corpusdependent and knowledge-dependent measures.A large corpus is used to define the similarity between words.However, Arabic is a weakly resourced since there is a lack of data, due to this research into corpus-based and computational linguistics in Arabic is affected.Otherwise, Knowledge-based similarity measurements can divide into two groups: measurements of semantic similarity/relevancy and measurements of semantic relatedness.Semantic similarity is a type of relatedness measurement between two terms in which a wide range of relations between connotations is covered [7].
In Table 2, we summarize the most prominent attempts to measure the semantic-based similarity and compare them according to the type of semantic similarity (Knowledge-based, Corpus-based), the applied technique, the used dataset /sample in the experiment and the results released by each approach, chronologically arranged.
3) Hybrid-Based Similarity: It is a combination of Lexical-Based Similarity measures and Semantic-Based Similarity measures.Most of the recent researches used this kind of Similarity measure.In Table 3, we summarize the most prominent attempts to measure the hybrid-based similarity and compare them according to the applied technique, the used dataset /sample in the experiment and the results released by each approach, chronologically arranged.B.

Literature Review
There is extensive literature that deals with the measurement of textual similarity addressed in the following.Mihalcea and et al. [10] developed a method that focus on measurement of the semantic similarity of short texts where Semantic similarity measurements: two were corpus-based (Point-wise Mutual Information, Latent Semantic Analysis) and six were knowledge-based measures (Leacock & Chodorow, Lesk, Wu, and Palmer, etc.) were examined.They noted that the maximum similarity was sought only within classes of words with the same part-of-speech.
Islam, A., & Inkpen, D. [11] provided a method that estimates the similarity of two texts from the integration of semantic and syntactic information.Since a corpus-based measure of semantic term similarity and adjusted version of the Longest Common Subsequence (LCS) string matching algorithm used.They concluded that the main advantage of this system was that it has a lower time complexity than the other system because it used only one corpus-based measure.
Abouenour et al. [12] proposed two directions for improvement: firstly, a semantic Query Expansion (QE) used in the purpose to have a senior level of recall when the Information Retrieval (IR) process retrieved passages; then a structurebased process used for passage re-ranking to have the expected answer at the top of the candidate passages list.In the first step, they used the content and the semantic relations within the Arabic WordNet (AWN) 3 ontology, and in the second step, they adopted the Passages Ranking (JIRS PR) module based on the Distance Density n-gram model.This system gives a higher similarity value to those passages containing more grouped structures.The highest performances were obtained when Java Information Retrieval System JIRS was used together with the semantic Query Expansion (QE).
Dai and Huang [13] presented a word semantic similarity based on the Chinese-English HowNet4 ontology.The main aim of this work was to compute the similarity between terms by exploring their attributes and relations.For a given word pair, similarities between their attributes by combining distance, depth, and related information are computed.Then word similarity was estimated through a combination scheme.
Refaat et al. [14] presented a method to assess Arabic free text answer (essay) automatically based on Improved Latent Semantic Analysis (LSA) technique (the input text, unifying the form of letters, deleting the formatting, replacing synonyms, stemming and deleting "Stop Words") to produce a matrix that represents texts better than the traditional form of LSA matrix.
Navigli and Ponzetto [15] presented an automatic approach to the construction of BabelNet5 (a very large, wide-coverage multilingual semantic network).It is based on the integration of lexicographic and encyclopedic knowledge from WordNet and Wikipedia.To achieve the best translation performance, they relied on recent advances in machine translation by using Google from WordNet and Wikipedia.
Gomaa et al. [16] presented a different unsupervised approach to treat with students' answers using document similarity.It is divided into three stages: The first stage is measuring the similarity between model answer and student answer, using thirteen String-Based algorithms, 6 of them were Character-based, and the other 7 were Term-based measures.The Second stage was measuring the similarity using distributionally similar words using co-occurrences (DICSO 6 1 and DISCO2) Corpus-based similarity measurements.Finally, they are combined to accomplish a maximum correlation value.
Nitish, A. et al. [17] implemented two approaches to estimate how much the two sentences are similar.The first approach combined corpus-based semantic relatedness measure over the entire sentence with the knowledge-based semantic similarity degrees got for the words that have the same syntactic roles in both the sentences.Then, it fed all these scores as features to machine learning models to estimate the similarity score.The second approach used a bipartite based method over the WordNet and Lin measure, without any modification.
Daniel Bar et al. [18] First, they computed text similarity scores between pairs of texts and their sources using Content similarity (longest common substring measure), Structural Similarity (N-gram model), and Stylistic Similarity (Sequential TTR).Then, using these scores as features for two machine learning classifiers from the WEKA toolkit 7 , such as a Naïve Bayes classifier and decision tree classifier.
Zou et al. [19] Introduced bilingual word embeddings8 : semantic embeddings associated across 2 languages in the context of neural language models.A single semantic similarity feature induced with bilingual embeddings added near half a BLEU point to the results of the NIST08 Chinese-English machine translation task.
Kaleem et al. [20] presented a sentence similarity approach formed to mitigate the issue of free word order in the Urdu language.The main objective behind it was to alleviate the complex word order issue that comes with the Urdu language by matching all possible word order variations on a single scripted pattern to reduce the time and effort required to script an Urdu conversational agent.
F. Elghannam [21] proposed a new corpus-based method to measure the semantic similarity between short texts to ranking them.It uses the statistical lexical similarity between the vectors of similar words (second-order word vectors) extracted from the corpus instead of relying on only word distribution similarity calculations.To determine the degree of similarity, she measures the lexical similarity between their second-order word vectors.
J. Tian et al. [22] A universal model in the combination of traditional NLP methods and deep learning methods together to define semantic similarity proposed.First, they translate all sentences into English through the machine translation (MT) system, i.e., Google Translator.
S. Xiaolong et al. [23] proposed a new framework for computing semantic similarity.The model learned word segmentation automatically and the overall architecture LSTM network to extract semantic features and then used the attention model to enhance semantics since, LSTM model was used to extract the semantic features of the sentence based on the Siamese network, attention model used to weight the semantic output of each moment and the Policy network used to determine whether the sentences need to segment or not.The experiments showed that the model improved the accuracy by 95.7% compared to the previous baseline models.
Kim et al. [24] proposed an attention mechanism to capture the semantic correlation and to appropriately align the twosentence elements using a densely connected co-attentive recurrent neural network (DRCN).They connected the recurrent and co-attentive features from the bottom up with no distortion.The result showed that the dense connections over-attentive features were more effective.DRCN showed the highest mean, which indicates that it did not only results in a competitive performance but also has a consistent performance.
Khafajeh et al. [25] proposed an automatic Arabic thesaurus.They used term frequency-inverse document frequency (TF-IDF) for index term weights, Similarity Thesaurus by using Vector Space Model with four similarity measurements (Cosine, Dice, Inner product, and Jaccard) and Association thesaurus by Applying Fuzzy model.An automatic Arabic thesaurus using termterm similarity used.
The experiment showed that the Jaccard and Dice similarity measurements are the same for the VSM model, while the Cosine and Inner similarity measurements are the same too, but they are a little bit better than Jaccard and Dice measures, and it gives nearly the same ranking for all the queries.Using stemmed words with similarity and association thesaurus in the Arabic language retrieving system is much better than using full words without using the thesaurus.
An Arabic term list that includes 9,000,000 fully inflected surface words.
They proposed a Spelling Checking tool for Arabic words, which involved in a trigram language model to approximate knowledge about permissible characters in Arabic.
They concluded that their system performed better than Hunspell in choosing the best solution, but it was still below the MS Spell Checker.Testing of this language model given the precision of 98.2% at a recall of 100%.As shown in Table 1, some researches were concerned with string-based similarity.S. H. Mustafa and et al. [6] examined the performance of the bigram and trigram techniques in the context of Arabic free text retrieval since the N-grams9 conflation scheme uses to transform a word into a chain of N-grams.The experiments were done on a corpus of thousands of distinct textual words drawn from several sources representing various disciplines.K. Shaalan et al. [26] proposed a Spelling Checking tool for Arabic, which depend on a trigram language model to approximate knowledge about permissible characters to classify generated words as valid and invalid, a finite-state automaton that measures the edit distance between input words and candidate corrections, the Noisy Channel Model, and knowledge-based rules.

Al
Al-Ramahi et al. [27] implemented a system that computes the similarity among course descriptions of the same subject from various universities or similar academic programs.Since 3 different bi-gram techniques used: the vector model to represent each document in a way that each bi-gram is associated with a weight that reflects the importance of the bigram in the document.Then, the cosine similarity is used to compute the similarity between the 2 vectors.The other two techniques were: word-based and whole document-based evaluation techniques.In both techniques, the Dice's similarity measure applied for calculating the similarity between any given pair of documents.
A. Magooda et al. [2] proposed a ranking system of three components: TF-IDF based module, Language model (LM) based module, and Wikipedia-based module.Then, the three relevancy values calculated are then converted into one relevancy score using weighted summation.Retrieved documents are then re-ordered based on the new weighted sum scores.As shown in Table 2, Semantic-based similarity is divided into 2 types: corpus-dependent and knowledge-dependent measures.In Corpus-based similarity, a large corpus is used to define the similarity between words.Otherwise, Knowledge-based similarity used lexical resources such as WordNet, VerbNet, etc.An automatic Arabic essay scoring system using LSA similarity measurement developed.
The correlation between the human assessor and the system is 0.91.The results indicated that a correlation varied from 0.78 to 0.87.Also, the computation time per answer has decreased with a percentage of 4%.The first run ranked 10th and the second-ranked 12th in the primary track in task 1, with correlation 0.619, and 0.617, respectively.
As shown in Table 3, the researches mentioned above combine String-based similarity and Semantic-based similarity to achieve the best results.For example, F. Elghannam [23] achieved accuracy result in 97% in sentence tests.
Ref. [37] Categorized the existing approaches that are concerned with measuring textual similarity between texts to three types depending on the text level to document similarity, sentence similarity, and word similarity.In the past, most of the researchers focused on the documents level similarity (two long texts or long text with small ones).Recently, the sentence level similarity has more interest, which led to provide training, test data in multi-languages, and deploy different approaches for sentence similarity.Generally, these approaches are divided into three categories, namely: vector space-based approaches in which the text is represented as a vector of features, using bag-of-words (BoW) then, compute the similarity between their vectors.Alignment-based approaches that assume that linguistics expressions that have similar meaning could be aligned.Moreover, machine learning-based approaches are based on supervised machine learning along with semantic similarity measures and features (Lexical, syntactic and semantic features) [38].
Bilal Ghanem et al. [35] 2018 The experiments showed that the model improved the accuracy by 95.7% compared to the previous baseline models.
According to [37], the existing approaches that measure semantic similarity for Arabic texts either documents, sentences, or words divided into four types of techniques namely: word co-occurrence approaches that ignore word order of the sentence but, it successfully extracts keywords from documents, Statistical corpus-based approaches that use the Latent Semantic Analysis (LSA) as a language-understanding model, Descriptive features-based approaches, which depend on the semantic features that extract from dictionaries, or WordNet as a lexical resource.Finally, neural networks-based approaches with word embeddings.Regarding the previous taxonomy mentioned above, we review some of those approaches as following.
Nagoudi et al. [36] combined Word Embedding (CBOW model), word alignment method, IDF, and POS weighting features for extracting semantic and syntactic features from documents to capture the most relevant ones but, it was weak in data with higher dimensionality.It is confined to working on distant local contextual windows rather than counting global co-occurrences.M. Al-Samdi et al. [38] proposed an approach for paraphrase identification (PI) and semantic text similarity (STS) analysis in Arabic news tweets.It employs a set of extracted features divided into Text overlap features (such as n-grams, stemmed n-grams and POS overlap features), Word Alignment features, and Semantic features (such as NER overlap features and topic modeling features) to detect the level of similarity between tweets pairs.They noted that the lexical overlap features play a notable role in improving the results of PI and STS analysis.Additionally, semantic features enhance the results of both tasks PI and STS.Word alignment features significantly enhance the results of PI, whereas results obtained by overlapping features based on NER and POS are acting as bad features when used alone with the lexical overlap features.The best-realized results in both tasks are when using the lexical overlap features with the word alignment and topic modeling features.M. Zrigui and A. Mahmoud [39] presented a semantic approach that identifies whether an unseen document pair is paraphrased or not.It consists of two phases.At the feature extraction phase, they used global vectors representation combining global co-occurrence counting and a contextual skip-gram model.At the Paraphrase identification phase, a convolutional neural network is used to learn more contextual and semantic information between documents.Konopik et al. [40] Introduced a system for estimation of semantic textual similarity in SemEval 2016.The core of this system consisted of exploiting distributional semantics to compare the similarity of sentence chunks.They used a broad range of machine learning algorithms in addition to several types of features (Lexical features include word base form overlap, word lemma overlap, chunk length difference, Syntactic features include POS tagging, Semantic features include GloVe, Word2Vec, and WorNet database.).
Almarwani et al. [41] addressed the problem of textual entailment in the Arabic Language.Their approach consisted of a combination between traditional features such as length of sentences and similarity scores (Jaccard and Dice), named entity recognition and Word Embedding (Word2Vec).S. A. Al Awaida et al. [42] proposed an Automated Arabic essay grading model to achieve better accuracy by combining the F-score technique to extract features from student answers and model answers with Arabic WordNet (AWN) to find all relevant words from student answer for semantic similarity.
A. El-Hadi et al. [43] presented a new approach for semantic similarity measures based on the MapReduce framework and WordNet after the translation phase to compute the similarity between Arabic queries and documents.The experiments were running on a variable number of documents in the corpus stored in HDFS in an Arabic search engine.M. Zrigui and A. Mahmoud [44] presented a method based on deep learning for paraphrase detection between documents.Since the word2vec model extracted the related features by anticipating each word with its neighbors.Then, the obtained vectors are averaged to generate a sentence vector representation (Sen2vec).Finally, a Convolutional neural network (CNN) is used to capture more contextual information and semantic similarity computation.
A. Omar and W. Hamoda [45] studied the effect of document length on measuring the semantic similarity in the text clustering of Arabic news by many experiments with different normalization techniques such as Byte length normalization, cosine normalization, Maximum normalization, Mean normalization, etc. to choose a reliable one for the previous purpose.The study proposed the integration of TF-IDF for ranking the words within all the documents.It deduced that the Byte length normalization method is the most appropriate for text clustering with TF-IDF values.
Based on what Wali et al. [46] discussed, we noted that most of the previous researches mentioned above estimated the semantic similarity based only on the word order or the syntactic dependency and the synonymy relationship between terms in sentences without taking into consideration the semantic arguments namely the semantic class and thematic role in computing the semantic similarity.Wali et al. [46] presented a hybrid method for measuring semantic similarity between sentences depending on supervised learning and three linguistics features (Lexical, Semantic and Syntactic-Semantic) extracted from learning corpus and Arabic dictionaries like LMF dictionary.This is a two-phase method: the learning phase, which consists of two processes: the pre-processing process that aimed to have an annotated corpus and the training process that is used to catch a hyperplane equation via the learning algorithm.The second phase was the testing phase that implemented the learning results from the first one to compute the similarity score and classify the sentences as similar or not similar.
Wali et al. [47] proposed the original idea because it has not been employed yet in former research in the literature.They presented a Hybrid Similarity measure that aggregated in linear function, three components (Lexical similarity using Lexsim, semantic similarity using Semsim that uses the synonymy words extracted from WordNet and syntacticsemantic similarity SynSenSim based on common semantic arguments such as thematic role and semantic class.)Moreover, the determination of the semantic arguments has been based on the VerbNet database.
Wali et al. [48] proposed a multilingual semantic approach based on similarity to calculate the similarity degree between the user's answer and the right one saved in the dataset.It supports three languages: English, French, and Arabic.Hybrid Similarity measure (Lexical, semantic, and syntactic-semantic) using knowledge bases like WordNet, LMF dictionary, etc. used.They concluded that the short sentences achieved the best measures of recall and precision.As the sentence gets longer, there will be more calculation, which reduces the system's performance.
We summarize them in Table 4 according to the applied technique (word co-occurrence, feature-based approach, Latent semantic analysis or Hybrid approach), the used dataset /sample in the experiment, the aim of each one, the similarity type (String-based, Corpus-based, Knowledge-based, Hybrid-based) and the results obtained by each approach.As can be seen, from Table 4, we observed that measuring semantic similarity for documents with word embeddings technique achieves better results.While for sentence semantic similarity, a hybrid technique that joins Latent Semantic Analysis (LSA) with word co-occurrence achieves better results.For word similarity, the feature-based approach provides the best results.In Table 5, we represent the most important textual similarity measuring tools provided in recent years.In Table 5, we classify these tools based on the type of textual similarity they provide and if they support the Arabic Language or not.

CONCLUSION
In this paper, we presented a chronological overview of textual similarity measuring in a variety of languages.Generally, we found out that the textual similarity is divided into three categories: Lexical-based similarity, Semantic-based similarity, and Hybrid similarity.Then we shed light on semantic analysis in the Arabic Language, which we can divide into four types: Word co-occurrence approach, Latent Semantic Analysis approach, feature-based approach, and hybridbased approach.Word co-occurrence approach that ignores the term order of the sentence, and it does not take into consideration the meaning of a term according to its context.But it successfully extracts keywords from documents.The Latent Semantic Analysis (LSA) seems like a complete model of language understanding, and it is a successful approach in information extraction, especially for documents, but it ignores word order and function words.Also, this approach is based on Singular Value Decomposition (SVD), which is computationally expensive, and it is difficult to update as new documents appear.The third type is the features-based approaches in which a word in a short text represented using semantic features based on dictionaries or WordNet which means, a high-quality resource is needed, and this is not always available.Finally, the Hybrid-based approach that used neural networks and word embeddings which have two limitations for short texts; the first one is that word embedding does not consider term order and the second one is that it is unable to capture polysemy; it cannot learn separate embeddings for multiple senses of a word.In the future, we need further researches to enhancing the accuracy of Arabic similarity measurements as achieved in English Languages.

TABLE I :
CHRONOLOGICAL REPRESENTATION OF THE MOST IMPORTANT RELATED RESEARCHES CONCERN ON LEXICAL-BASED SIMILARITY.

TABLE III :
CHRONOLOGICAL REPRESENTATION OF THE MOST IMPORTANT RELATED RESEARCHES CONCERN ON HYBRID-BASED SIMILARITY. 11https://eranraviv.com/understanding-pointwise-mutual-information-in-statistics/

TABLE IV :
CHRONOLOGICAL REPRESENTATION OF THE MOST IMPORTANT RELATED RESEARCHES IN ARABIC.