Tag Recommendation for Short Arabic Text by Using Latent Semantic Analysis of Wikipedia

تتيح المواقع الاجتماعية للمستخدمين مشاركة المواد كالنصوص والصور، وتتيح حرية إضافة كلمات رئيسية لها تسمى أوسمة. ولكن الحرية لها مساوئ منها: التكرار الناتج عن عدم ضبط الكلمات، الغموض، التشتت، الأخطاء الإملائية، والتفرد، مما يعيق عمليات تنظيم واسترجاع البيانات في هذه الأنظمة. نهدف في هذا العمل إلى عرض نظام اقتراح أوسمة للنصوص العربية القصيرة بالاستفادة من الويكيبيديا العربية كمصدر للمعلومات، بحيث يتم توظيف تحليل الدلالات الكامنة لاكتشاف التشابه بين النص القصير ومقالات الويكيبيديا. وقد استخدم "أباتشي سبارك" للتعامل مع الحجم الضخم لمحتويات الويكيبيديا والعمليات الحسابية المعقدة لتحليل الدلالات الكامنة المستخدم لتحليل محتوى مقالات الويكيبيديا إلى ثلاث مصفوفات، وعند إدخال نص عربي قصير، يقوم النظام بمقارنته مع محتوى المقالات ويعطي كل مقالة وزنا حسب علاقتها وتشابهها مع النص المدخل، ثم يتم اختيار الأوسمة المرشحة من عناوين وتصنيفات المقالات الأكثر شبها بالنص. تم تقييم النظام المقترح اعتمادا على مجموعة من 100 نص قصير تم جمعها من موقع تويترفي ثلاث مجالات مختلفة و قام خبيران في كل مجال بتقييم الأوسمة التي أنتجها النظام. وقد حقق النظام المقترح 84.39% mean average precision و 96.53% mean reciprocal rank، مما يظهر مناسبة النظام ودقته لتوسيم النصوص العربية في حين أنه يواجه صعوبات تتعلق باللغة العربية وبتكرارات الكلمات النادرة. كما تم عرض تحليل دقيق ومناقشة لنتائج التقييم تتناول نقاط القوة والقصور في النظام إضافة إلى توصيات لتطوير العمل مستقبلا. كلمات مفتاحية: نصوص قصيرة، اقتراح أوسمة، اللغة العربية، ويكيبيديا، تحليل الدلالات الكامنة، سبارك


Dedication
To my dear mother and father who have given me all their love and support over the years, and for their unwavering commitment through good times and hard times.
To my wonderful, brilliant and supportive wife, Niveen, for her patience, forbearance and sustenance through my studying and preparing of this thesis. To my elegant sons Sary and Tameem, and my sweet daughter Yumna, whom I do all of this for them.
To my brothers and sister for their love and care.
To the spirit of martyr, my brother Sary.
To my father-in-law and mother-in-low for their encouragement and believing in me.
To my best friends Ashraf Qahman and Murad abu Jarad for their support and encouragement.
To all my friends and colleagues who supported me.  Tables   Table (2 Chapter 1 Introduction

Introduction
With the massive daily increase of data on the internet, especially text, automatic tagging recommendation that detects and adds informative, and descriptive tags to documents becomes an important necessity for information aggregation and sharing services (Oliveira et al., 2012).
Tagging is the practice of creating and managing labels called tags that categorize or describe the content using simple keywords. It's not a new concept.
Journals, conference proceedings, and even dissertations have required keywords from authors to improve their information retrieval performances for years. (Jeong, 2009).
Tagging is considered as the way to organize the stuff you don't have time to organize (Fallows, 2007).
Social activities on Twitter, Facebook, Flicker, personal blogs etc. are becoming very popular among users who want to share local or global news, their knowledge or opinions (Kywe, Hoang, Lim, & Zhu, 2012). Lately, users are also using these services to search for information. Therefore, some services include tag or category information to better facilitate search. However, these tags are typically free-form in nature with users permitted to adopt their own conventions and interests without restriction, which can make the set of tags noisy and sparse. Moreover, many works have addressed tagging documents, whereas short texts are peculiar regarding length, composition and formality(Garcia Esparza, O'Mahony, & Smyth, 2010).
A solution to the above problem is to recommend tags (Garcia Esparza et al., 2010) or categorizations to users to enrich and clarify the content, facilitate retrieval, and perform less cognitive effort. Which, in one hand, if done properly, will improve text retrieval, linking, classification, clustering, recommendation, simplify archiving, and also will give the user or the application insight to the content and facilitate seeing the data (information) from different dimensions and enrich the context of the tagged text. On the other hand manual tags or metadata creation is costly in terms of time and effort and users are unwilling to provide an adequate number of tags which is called tag sparsity.
Many works have addressed the tag recommendation problem, but the special characteristics of short texts has made the tag recommendation a new and even more challenging dilemma. It is statistically shown that social texts are extremely short, poorly composed, and tend to be more informal (Guo, Li, Ji, & Diab, 2013). So the application of conventional statistical techniques becomes impractical due to these special characteristics.
When we search for a text, what we really want is to look for the meaning behind the words of the text not the exact terms. Latent Semantic Analysis (LSA) has the ability over other techniques to discover these meanings depending on a powerful linear algebra technique called the Singular Value Decomposition (SVD) (Ryza, Laserson, Owen, & Wills, 2015). SVD can describe the intensities of relations between the components of an input matrix, e.g. Documents and terms, which reveals different relations between the components, such as the relation from: term to term, term to document or document to document (Turney, 2001). This property gives LSA the advantage over techniques like Natural language processing (NLP) (Guo et al., 2013;Laclavik, Šeleng, Ciglan, & Hluchý, 2012) or machine learning techniques (Allahyari & Kochut, 2016a;Tang, Hong, Li, & Liang, 2006) that lack semantics, because it goes deeper than comparing terms, to comparing the meanings behind these terms (Ryza et al., 2015).
LSA was used on data sets other than the Arabic Wikipedia, since Arabic language may pose additional problems because few (or less reliable) resources are available to extract the needed data from the text. While the Arabic Wikipedia is recently used in fields other than tagging, this filed remains unexplored especially for short texts.
Our work aims to recommend tags for short Arabic text, e.g. tweets, depending on Arabic Wikipedia Articles and categories, in an effort to select proper tags such as the title and the categories of the articles that are pertinent to the text by utilizing LSA and dimensionality reduction heavy computations. In order to do that, we need to handle a massive collection of data (Arabic Wikipedia) which contains over a million Articles and a seven million terms, that no single accessible computer we have can deal with, leading to our need to use Apache Spark cluster (Zaharia, Chowdhury, Franklin, Shenker, & Stoica, 2010).
The choice of Arabic Wikipedia as a source of tags is motivated by its large coverage of different knowledge areas, a thing that makes it adequate for recommending tags in any domain of knowledge. Given an Arabic short text, the system suggests ranked tags to that text. These tags are selected from the titles and categories of the Arabic Wikipedia. (Figure 1.1) presents the system as simple steps, details will be discussed later in Chapter 3.

Figure (1.1): The system described in simple steps
First the system constructs the term document matrix by employing the term frequency-inverse document frequency (Tf-idf) weighting schema on the body of the articles after segmentation and lemmatization. Then the latent semantic analysis LSA is applied on that matrix by performing the singular value decomposition. This step allows the system to discover hidden semantics between the input short text and the Wikipedia articles by calculating cosine similarity. Tags are selected from the titles and categories of the articles that are most similar to the short text. Furthermore, the selected tags are ranked in order to present the best tags first.
As far as we aware of, this is the first effort that aims to offer tag suggestion of Arabic text using Wikipedia. While the English version of Wikipedia has been widely utilized in several research areas related to information retrieval and Natural Language Processing. Not all researchers and developers have the computational resources to process such a volume of information and there has been little efforts to utilize the Arabic Wikipedia for similar research. The proposed system is expected to act as a baseline for the research tackling Wikipedia-based tagging of Arabic text.
The tag recommender was assessed over a dataset of 100 short texts gathered randomly from Twitter in three domains: Sports, Technology, and News. The tags generated by the system where examined and judged by two human experts in each field. Our recommender achieved (84.38%) mean average precision and (96.53%) mean reciprocal rank.

Statement of the problem
The main problem addressed by this research is how to recommend semantically related tags to Arabic short text by exploiting Arabic Wikipedia. No effort, to our knowledge, has explored the use of Arabic version of Wikipedia for tagging Arabic texts.
Besides, tags generated by existing techniques mostly relied on statistical approaches while they lacked semantics. They were also restricted to English Language or were applicable on long documents only. In addition, many of existing approaches were domain specific, had limited coverage of knowledge areas, and did not often suit extremely short, poorly composed, and informal short texts.

Objectives
In this section, we present both main and specific objectives of the research work.

Main Objective
The main objective of this research is to design and implement an automatic semantic tag recommender for short Arabic texts that is accurate and reliable, by exploiting the Arabic Wikipedia.

Specific objectives
The specific objectives of the proposal are: 1. Explore how the massive content of the Wikipedia can be processed effectively.
2. Explore the best processing and NLP techniques for Arabic language Lemmatizing and segmenting, compare them, and select the most suitable to our work in order to access, preprocess, clean and filter the content of Arabic Wikipedia.
3. Investigate the implementation of LSA and how to identify most relevant and similar documents. 4. Provide a novel technique for tagging Arabic short texts from the titles and categories of the relevant Wikipedia articles. 5. Assess the performance of our system by annotating short texts obtained from social networks (Twitter). The performance will be evaluated by a number of experts in different fields and evaluation metrics.

Importance of Research
1. Recommend semantically related tags for Arabic short texts which give insight and enrich the text. Since tags are becoming more significant to improve search and text retrieval, simplify archiving, linking, classification, clustering, recommendation, and provide consistency among users.
2. Due to the scarcity of works that are oriented towards Arabic language in the field of automatic tag recommendation, this work could advance first step in the field of Arabic tag recommendation. While our technique still general but the test is limited to Arabic short text.
3. Extend the coverage of our tagger by exploiting Arabic Wikipedia with its massive content as a background knowledge. This will provide a system that is more general than domain specific taggers.  Our work is limited to short Arabic texts. But the process is easily applicable for any language.
 Our technique considered standard Arabic language as well as non-standard Arabic language texts published by common people.
 The evaluation of the system was done using a specific dataset gathered from posts on twitter in the fields of Sports, Technology, and News, similar to the fields of our experts. It was not possible to conduct a comparative study due to the lack of similar tagging approaches of Arabic text  Apache Spark was used as parallel framework to process the content of Wikipedia and build the LSA based system.

Limitations:
1. Low efficiency of the existing Arabic segmenters and stemmers affects the quality of results.
2. Some of the Arabic Wikipedia pages have misspellings and incomplete content.
3. Tweets used for testing contain words of daily dialect (slang), and misspellings, which have a negative influence on the results. 4. Non-Arabic names sometimes are written differently in Arabic (e.g. people, places, scientific experiments, compounds) which affect the quality and accuracy of the results. Also, the system excludes terms written in Latin characters.
5. The terms of input short text that are not found in Wikipedia was excluded from the short text.
6. Comparing a short text with a long one could increase the computation on the system.

Research contribution
The work in this thesis has the following research contributions: 1. A comparison was conducted between some NLP for Arabic language to select the best one based on the suitability of outcome for our work and regardless of the execution time.
2. Implement (LSA) on the whole Arabic Wikipedia. Because, as we recall, LSA is used mostly to tackle the English version not the Arabic version.
3. Present a novel system that we can consider it as a guideline for the future efforts in utilizing Arabic Wikipedia structure in real life applications. 6. Generate a standard dataset for Arabic short-texts and tags.

Structure of Thesis
The thesis consists of five chapters. The chapters are organized in general as follows: Chapter 1: Introduction: this chapter is an overview of the problem, work done in the field, and focuses on the proposed solution. It also discusses the challenges and difficulties of using Arabic text and Arabic Wikipedia.
Chapter 2: Literature Review: this chapter focuses on related works that employed Wikipedia or LSA as well as the works on the tagging field.
Chapter 3: Methodology: This chapter explains the detailed steps of the tagging system. And present a scenario of the system and the results of each phase.
Chapter 4: Results and Discussion: this chapter explains the assessing process of our system, test dataset, evaluation metrics, and discusses the results focusing on the sources of strengths and weaknesses.
Chapter 5: Conclusions: this chapter presents a conclusion of the thesis and possible future works.

State of the Art
The world-wide-web has become the largest ever free-access information repository with billions of web pages (Abdeen & Tolba, 2010). With the massive daily increase of data, especially text, novel approaches are needed to mine such data efficiently and effectively. One way to improve efficiency is to provide proper tags.
Our work aims to recommend tags for short Arabic text, e.g. tweets, depending on Arabic Wikipedia Articles and categories, in an effort to select proper tags such as the title and the categories of the articles that are pertinent to the text by utilizing LSA and dimensionality reduction heavy computations. In order to do that, we need to handle a massive collection of data (Arabic Wikipedia) which contains over a million Articles and a seven million terms, that no single accessible computer we have can deal with, leading to our need to use Apache Spark cluster (Zaharia et al., 2010).
The following section presents a brief background about Apache Spark, Latent semantic analysis, Singular Value Decomposition and Arabic Wikipedia. We restrict our attention to Spark, because it provides a highly-optimized machine learning library called MLlib (Meng et al., 2016) which has several features that are particularly attractive for matrix computations (Bosagh Zadeh et al., 2016;Zadeh et al., 2015):

Resilient Distributed Datasets (RDDs) is essentially a distributed fault-tolerant
vector that can perform operation as in local mode (Gittens et al., 2016).
2. RDDs allow user-defined data partitioning, and the execution engine can exploit this to co-partition RDDs.
3. And co-schedule tasks to avoid data movement. 4. Spark logs the history of operations used to build an RDD, enabling reconstruction of lost partitions upon failures.
5. Spark provides a high-level API in Java that can be easily extended. Which lead to creating a coherent API for matrix computations.
Hadoop (Zikopoulos, 2011) is another big data processing framework that is a software library and a framework which allows for distributed processing of large data sets (big data) across computer clusters using simple programming models. But Spark is favorable to us because (Spark, 2014) first: its ease of use compared to Hadoop and allows writing applications in Java and other languages. Second: Spark runs programs up to 100 times faster than Hadoop. Third: Spark powers a stack of libraries including MLib for machine learning which is essential to our work and also provid near real time analysis that is suitable for machine learning.
Many works have used Spark and MLib for data analysis purposes (Agnihotri, Mojarad, Lewkow, & Essa, 2016;Moss, Shaw, Piper, Hawthorne, & Kinsella, 2016), stating the adequacy for processing terabytes/petabytes of data, which are commonplace in modern day society where both machines and humans generate petabytes of data every day.

Latent Semantic Analysis (LSA)
Latent Semantic Analysis, as the name indicates is the analysis of hidden semantics in a corpora of text. Any collection of documents can be represented as a huge term-document matrix and other things like how close two documents are, how close a document is to a query etc. can be deduced by cosine similarity. However, such models have two drawbacks that are common in many languages: polysemy and synonymy (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990) where polysemy is a word that have different meanings in different contexts and synonymy is a concept having multiple forms of representation i.e. two or more words denoting the same concept.
LSA transforms the original data into a different space so that two (or more) documents/words about the same concept are grouped together (so that they are most similar to each other). LSA achieves this by Singular Value Decomposition (SVD) of term-document matrix.

How Latent Semantic Analysis Works
When we try to find relevant document to search words, the problem arose because what we really want is to compare the meanings or concepts behind the words.
LSA attempts to solve this problem by mapping both words and document into a concept space and doing comparisons in that space (Deerwester et al., 1990).
In order to make this problem solvable, LSA introduces some dramatic simplifications.
1. Documents are represented as "bags of words", where the order of the words in a document, sentence structure, and negation are not important, only the number of the word occurrences in the document matters.
3. Words are assumed to have only one meaning. This is clearly not the case ‫"جدول"(‬ could be a table ‫وأعمدة"‬ ‫,"صفوف‬ schedule ‫جدول"‬ ‫"الفعل:‬ or a spring ‫)"ينبوع"‬ but it makes the problem tractable.
To build the term-document matrix words are usually pre-processed by means of tokenization, stop-words removal and stemming (Sarwar, Karypis, Konstan, & Riedl, 2001). Then each token is assigned a weight which is proportional to its frequency normalized using various schemes, the most known is the Term frequency-Inverse Document Frequency Tf-idf scheme (Han, Pei, & Kamber, 2011) Tf-idf is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. In this matrix each column represents a document and each row in the column represents a term frequency in that document.
We apply Tf-idf weighting because it negates the effect of high frequency words in determining the importance of a document. And we use log to the base 10, to diminish the values of the results, since we are dealing with huge number of documents and terms. As a simple example we present (Table 2.1) below, which shows each term occurrences in every document that we depend on in calculating the Tf-Idf for each term-document. contain t1) = 1+log(5/2) = 1.398 Finally: Tf-Idf t1,d3 = 0.6 * 1.398= 0.8388 And this is performed for every term in each document.
In LSA, matrix approximation performed by singular value decomposition that can relate documents and terms into concepts. Documents and terms in each concept are all semantically related which make it superior to frequency based approaches.  Example: Let M be a (5 documents) × matrix (7 terms), which has the shown values. The number reflects term counts in documents for simplicity. We need to perform the SVD on the matrix, then perform the dimensionality reduction setting k=2, where 2 in the number of concepts to map the documents into.

M=
The Result after performing SVD and dimensionality reduction with k=2 will be as below: Where the shaded values in U represent the documents related to the shaded concept in S, and the terms related to the same concept are the ones shaded in V T .
An example to the terms and documents that can be found in a concept are shown in (Table 2.2).

Documents
Terms We notice that the presented documents in the concept have a thematic coherence with each other and with the terms related to the same concept. And also the terms are semantically related to each other.

Querying and scoring with the low dimensional representation
The Tf-idf composed matrix presents a shallow knowledge about the relationship between entries, depending on the simple frequency count. LSA has the ability to base scores (similarities) on a deeper understanding of the corpus. For example: if the term What about new documents? Simply, the same. But, instead of finding the row of the document in the matrix, we need to create it. It can be done by setting the value of each term in the query (the new short text) to its inverse document frequency to maintain the weighting scheme used in the original term-document matrix (Ryza et al., 2015). Before the comparison and after forming the short text vector, it is multiplied by the matrix V T to compute the concept space vector of the short text.

Arabic Wikipedia
Wikipedia in general has been adopted in many works, specially, text processing.
Wikipedia is currently the most popular free-content, online encyclopedia, which surpasses in scope many conventional encyclopedias and provides a cornucopia of world knowledge (Gabrilovich & Markovitch, 2006

Related Works
Recently, automatic semantic tagging and annotation of documents have attracted a great deal of attention, since it can add significant benefits to many text mining tasks (Allahyari & Kochut, 2016a) , as information retrieval (Shapira et al., 2015), and text classification (Wang et al., 2009), text clustering and cluster labeling (Tonella et al., 2003) although, many attempts have been conducted to address this issue. In the field of our work, several efforts employed different techniques and knowledge bases, some of them targeted documents, and others targeted short texts. In the following sections we review short and long text tagging in association with the works that applied LSA in their approaches.
Depending on the title and the abstract of scientific papers, Bhowmik (Bhowmik, 2008) utilized a set of keywords that are pre-weighted, to weight and extract keywords and sentences according to their importance and position. His work is domain specific, and depends on a set of keywords that needs to be updated. Also it cannot enrich very short texts. likewise, Hulth (Hulth, 2003) built a supervised rule induction classifier that uses the abstract of the paper to generate tags, before she added linguistics knowledge to the representation, therefore each word has Part-of-Speech as a new feature that improved the results. In addition, HaCohen-Kerner (HaCohen-Kerner, 2003) used the frequency of words and phrases to create a weight matrix from abstracts then sorted these weights and chose the highest as tags. All previous works consider only the occurrences of the words, and the resulting tags are included in the original text and may lack semantics, but in our work we consider semantic relations and the generated tags mostly are not contained in the original text.

Text tagging:
In recent time, several attempts have been made to annotate documents and web pages, for example; Tang et al. (Tang et al., 2006) were concerned of semantic annotation on hierarchically dependent data, where targeted instances can have hierarchical dependencies with each other. Ontea (Laclavik et al., 2012) is a platform for automated semantic annotation or semantic tagging, its implementation based on regular expression patterns was presented while the test was carried out on job offers as documents with evaluation of results. Both of the above works use linguistic techniques to address annotation of the documents, and differ from our work in a way that they are primarily focused on specific entities mentioned in the documents, whereas we take all the words in consideration.
Other works similar to ours include Schönhofen's (Schönhofen, 2009)  Gong and Liu (Gong & Liu, 2001) performed SVD on m×n term-sentence matrix (m: number of terms ≥ n: number of sentences where each column represents a document and each row in the column represents a sentence frequency in that document). They used a couple hundreds of CNN news in order to obtain the singular value matrix S, and the right singular vector matrix V T , then select the k th right singular vector from matrix V T . And finally, select the sentence which has the largest index value with the k th right singular vector, and include it in the summary. Likewise, the term-sentence matrix was used by Yeh and others (Yeh et al., 2005) accompanied with modified corpus-based approach to select the best sentences that summarize one hundred political articles from New Taiwan Weekly. Both works are analogous to ours, except we construct a term-document matrix instead of term-sentence matrix.
Also we tag with titles and categories, and deal with enormous number of documents whereas they use hundreds. Finally, we use short texts instead of long documents.
The algorithm proposed by Symeonidis et al. (Symeonidis, Nanopoulos, & Manolopoulos, 2008) performed latent semantic analysis and dimensionality reduction using the higher order singular value decomposition technique. This algorithm was tested on two data sets from Last.fm and BibSonomy. They stated the results showed substantial improvements in terms of effectiveness measured through recall.
All the works that exploited LSA have been used to tag a document using other documents in the same corpus, while in our work we use the Wikipedia as a corpus to tag new short texts that are not in the corpus.

Introduction
This chapter presents the system of a tag recommender system that utilizes Latent Semantic Analysis on the Arabic Wikipedia. It clarifies the detailed steps of the tagging System which include: configuring Arabic Wikipedia and preprocessing of the text. Second, computing Tf-idf and SVD dimensionality reduction. Third, preprocess the short text to be tagged. Forth, the tag selection procedure exploiting titles and categories of the articles. And finally, a case study is presented to view the functional steps of the tagging process.

Configuring Arabic Wikipedia
This section briefly explains the configuration needed for our tagging system.
The description of the system is depicted in (Figure 3.1).

Figure (3.1): The tag recommender system
This configuration includes parsing and preprocessing of Arabic Wikipedia to enable fast information access and retrieval. Note that all the configuration settings are performed only once. (Figure 3.1) shows the complete system processes from preparation until tag selection. The solid arrows are for the system preparation, the dashed arrows are for the tagging process. Detailed description is provided below.
Code, data set and results can be found at https://github.com/YousefSamra/ShortTextTagging

Parsing and information extraction from Arabic Wikipedia XML Dump
In this section we briefly explain the steps we have taken to gather the content that is essential to our work. We selected the most recent XML Dump file of the Arabic  (Table 3.1) presents some information about the file, and the information it contains. After removing all pages listed in the previous table, the relevant remaining 435672 articles that we used in our system were stored in a text file after being preprocessed, in order to be distributed among the working nodes of Spark cluster lately. All other pages were swiftly investigated for any miss enumerated ones, and there wasn't any.

Text Preprocessing
In order to better match the terms of the input text with the Arabic Wikipedia terms, it is important to perform some text preprocessing on both of them. The steps we undertook includes cleansing, tokenizing, stemming, and stop-word removal, performed only on the body of the articles, titles and categories remains untouched.
As these steps are significant to our work, they are also tricky because it requires a lot of investigation and comparisons between some of the available tools along with our precious time.

Cleansing:
This step is meant to remove all texts that increases the size of the corpus, and not affecting the performance of the system, but the contrary. These include all the Latin alphabets, special characters, numbers and punctuations on one hand. On the other hand we found some terms that are repeated in most of the articles and are not adding any information related to the context but in some cases may cause performance deviations. These terms mostly found at the end of many articles and used for redirections or external links. (Table 3.2) presents texts that require deletion. After this step the corpus has a pure Arabic language content. Latin characters are mostly refer to names of persons, locations, etc. that are also written in Arabic such as "Twitter" is written ‫."تويتر"‬ While punctuations are common in most languages, they can make words differ for example ‫,"عربي"‬ "Arabic" is not equal to ‫"عربي."‬ with a period, "Arabic.". This step is vital because it is not performed in the subsequent steps.

Tokenization and stemming:
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens, while stemming is the process of reducing inflected or sometimes derived words to their word stem, base or root form. Tokenization and stemming (also called lemmatization) are crucial to our system, because the generated terms are the input to Latent Semantic Analysis.
Different term formation, may influence the system ability to match terms. To optimize this step we carried out a comparison between four commonly known Arabic Language processors, two stemmers Al-khoja (Khoja, 2001) and SnowBall(snowballstem, 2016), and two segmenters Stanford(CoreNLP, 2016) and Farasa (QCRI, 2016). To perform the experiment we have randomly selected 5 articles and applied each tool to their terms after removing all stop words and repetitions. The final set consists of 751 unique terms. (Table 3.3) shows out a snippet of the results.  were judged according to correctness and suitability for our work but execution time is out of scope. Besides this step is one time execution, meaning that it will be done only once before the system runs. The only need for preprocessing after that is for constructing the input vector. Results in (Table 3.4) shows that Farasa has the best measures, and all tools out performed Stanford segmenter in both precision and execution time. This is because Stanford did not remove ‫"ال"‬ from the beginning of most terms that contain it, such as the first term in (Table 3.3). This is why we consider it inappropriate. While investigating the results, we noticed that both Al-Khoja and Stanford are unifying Arabic terms that have different meanings in the context or generate wrong roots. For example, "‫""ضفة‬bank/shore","‫""ضيف‬guest" and ‫"‪""add‬يضيف"‬ all became ‫,"ضيف"‬ while "‫""مصب‬estuary" was wrongly rooted to "‫""صبأ‬renounce" where the correct root is ‫."صبب"‬ Both also result in errors in the course of dealing with terms containing "Hamza" ‫,"ؤ"‬ "‫."ئ‬In addition, they work badly on both Arabic and non-Arabic names such as ‫"تمويه"‬ "camouflage", "‫""مياه‬waters", ‫"ماراثون"‬ "Marathon" and ‫""مارتن"‬ Martin". This last fault was produced by SnowBall stemmer too. On the contrary Farasa segmenter works well on Arabic terms as well as on non-Arabic names, besides it does not completely root the Arabic terms which helps our system to distinguish between them. While still has the ability to take out the Arabic Additive letters and pronouns, which make it the algorithm of choice. We have chosen Farasa over snowball despite the difference in execution time because we are concerned in correctness of the results more than efficiency. Besides Wikipedia will be processed once by Farasa only when the system is built. We will call Farasa a stemmer because it help partially stem terms by removing the attached letters and additive pronouns.

"A Strong match between Chelsea and Manchester City and Liverpool awaits"
the output of the algorithm was "

Stop-words removal:
Stop-words are commonly used words that are frequently appear in a corpus.
Such words increase the size of the text and removing them doesn't affect the retrieving efficiency(Al-Shalabi, Kanaan, Jaam, Hasnah, & Hilat, 2004). We applied a stop-word removal algorithm to reduce the size of the corpus and improve the retrieving efficiency. Since our text is already cleansed and stemmed, the algorithm just iterates over the text and remove all the listed 266 words if found. For example, the previous text ‫يترصد"‬ ‫ليفربول‬ ‫و‬ ‫سيتي‬ ‫ومانشستر‬ ‫تشيلسي‬ ‫بين‬ ‫قوي‬ ‫موقع‬ ‫يوم‬ ‫,"ال‬ "A Strong match between Chelsea and Manchester City, and Liverpool awaits" the output of the algorithm will be ‫يترصد"‬ ‫ليفربول‬ ‫سيتي‬ ‫مانشستر‬ ‫تشيلسي‬ ‫قوي‬ ‫"موقع‬ after removing ‫"ال"‬ "the", ‫"يوم"‬ "today", ‫"بين"‬ "between" and ‫"و"‬ "and".
After this step each Arabic Wikipedia article will be presented as a title, a List of tokens (cleansed, stemmed, and stop-words removed), and a List of categories the article associated with. These articles are now ready in a file to be distributed among the working nodes of a standalone Spark cluster.
At this point, our knowledge source contains only Arabic Wikipedia articles and each article body is presented as a list of tokenized and partially stemmed tokens.
( Table 3.5) shows some information about our base knowledge.

Tag Recommendation system
After preparing the data, it is now ready to go through the system. In the following steps we generate the singular value decomposition matrices to be searched for the most similar articles to the input short text, but first we need to calculate term frequencies Tf-idf, then convert document representation into vectors.

Computing the Tf-idfs
At this point all the articles are presented as Arrays of terms, each corresponding to a document. The next step is to compute the frequencies of each term in the document Tf, and for each term within the entire corpus DF. We apply Tf-idf weighting because it negates the effect of high frequency terms in determining the importance of a document. And we use log to the base 10, to diminish the values of the results, since we are dealing with huge number of documents and terms.
Tf-idf is a well-known numerical statistic that is intended to reflect how important a term is to a document in a collection or corpus (Han et al., 2011). And we employ it to gain statistics about our corpus as follows: Where tft,d is the number of the term appearances in the document, N the total number of documents in the corpus, and dft is the number of documents in the corpus that contain the term.

Vectorization
With the Tf-idf matrix in hand, we can perform the singular value decomposition, but first we need to convert the Tf-idf into sparse vectors for two reasons. The first reason is that it is essential to perform the singular value decomposition. The second reason depends on the nature of our data which contains mostly zeros for each document. A sparse vector implementation would be more space efficient since it only stores the indices of the terms and its non-zero values neglecting all terms with zero values which makes it a space efficient technique and help speed up calculations.

(3.2)
Where m, n, k are the number of document, number to terms and the number of concepts respectively. It is important to know that S is a k x k diagonal matrix that holds singular values. Each diagonal element in S correspond to a single concept or topic, which relates to a column in U and column in V and its magnitude correspond to the importance of this concept for the corpus. A key insight of LSA is that only small number of concepts are important to representing the data (Ryza et al., 2015). On the ground of that we chose k to be 1000 concepts, which is more than enough to represent the Arabic Wikipedia.
To make this as simple as possible, consider the example presented in chapter 2.
After performing the SVD on Tf-idf matrix of 5 articles that contain 7 unique terms, the resulted 3 matrices will be theoretically as shown in (Figure 3.2) taking the number of concepts k=2.

Figure (3.2): Result of SVD for a 5 documents 7 terms matrix
U is an m*k matrix whose columns form a basis for the article space. S is a k*k diagonal matrix, each of its entries correspond to the strength of a concept. V is a k*n matrix whose columns are basis of the term space.
It obvious from the values of S that the first concept is the most important in representing the corpus (5 documents) because it holds the largest value 12.4. This concept is related to the first column in U which holds 3 articles and also related to the first row in V which holds 4 terms. Let's be clear that the article ‫أوسلو"‬ ‫"معاهدة‬ in U is the most important to the first concept with value (0.58) while the article ‫المدنية"‬ ‫"الدولة‬ is the least important to the concept with value (0.15). Furthermore, the term ‫"الدولة"‬ in V T is the most important to the same concept with value (0.56). As well, the first three documents in U and the first 4 terms in V contribute in the first concept but not the second since there values that correspond to the second concept are zeros. In other words, the first column in U and the first row in V T are mapped to the first concept.
At this stage, we can refer to a concept as the main topic that describes the articles it contains. But concepts are not names, they are just concepts. However, we can simplify things by naming them. For example, we can name the first concept "Policy""‫"سياسة‬ or "International affairs" ‫دولية"‬ ‫"شؤون‬ and we can name the second concept "Sports" ‫"رياضة"‬ or "Football" ‫قدم"‬ ‫."كرة‬ A key insight of LSA is that only a small number of concepts are important to representing the data, e.g. two are sufficient in the example. So the corpus of the example basically talks about policy and football.
The system now is ready to receive the input short text and select the appropriate tags.

Tag Selection
After performing the SVD on the Arabic Wikipedia we can select tags for the input short text. The input text has to pass through the preprocessing steps. Then select the top similar articles. The preprocessing of the short text is vital because it allows us to map the terms of the short text to the terms of the Wikipedia. Note that the terms of the Wikipedia had gone through preprocessing in earlier steps. This allows two terms in both the short text and the Wikipedia article to be identified as equal and consequently the short text and the article are identified as similar.
It's clear now that the first two matrices U and S are the article space and the concept space respectively. Having a new preprocessed input of short text, we can compute the cosine similarity between itself and every other article simply by multiplying vectors and divide the result by their lengths (Sidorov, Gelbukh, Gómez-Adorno, & Pinto, 2014). (Figure 3.3) shows this part of the system, and Equation 3.3 represents the cosine similarity between vectors. This is a benefit of cosine similarity over Euclidian distance, Murkowski distance and Manhattan distance.
2. Compared to Jaccard similarity, adjusted based similarity and correlation based similarity these metrics used to calculate how much similar all the items are to each other in the matrix. Cosine and Jaccard similarities take less execution time and the cosine similarity performs excellent on huge matrices.
It is also worth mentioning that comparing two long vectors with small number of term is time inefficient, but the representation of the document and the tweet is done using sparse vector. A sparse vector keeps only the indices of the terms that has value other than zero. This help speed up computations and also increase space efficiency.
But it may increase the creation time of the vector of the input tweet.

Text Preprocessing
The short input test goes through all text processing procedure as Wikipedia articles did.

Cleansing:
As discussed before all Latin characters, special characters, and punctuations, which presented in (Table 3.2), are removed.

Tokenization and stemming:
Separating all Arabic Language additive pronouns from terms, then partially stem these terms with the help of Farasa stemmer.

Stop-Word Removal
Removing all Arabic Language stop-words, including the generated additive letters and pronouns that was separated in the previous step.

Vectorization
The previous preprocessing will produce clean terms of the short text. These terms have to be formed as a vector to be compared to the Wikipedia articles in the concept space resulted from the SVD. As a matter of fact, these terms may contain some terms that are not in the Wikipedia, because of a miss-spilling for example. These terms has to be remove before creating the short text vector. The remaining terms are used to create the query vector by setting the value of the term to its inverse document frequency to maintain the weighting scheme used in the original term-document matrix (the input of the SVD) and compare it to the articles in the next step. Before the comparison and after forming the short text vector, it is multiplied by the matrix V T to compute the concept space vector of the short text.

Selecting the top N Similar articles
Selecting the similar articles depends mainly of computing the cosine similarity between the vector of the short text and the rows of the US matrix. As explained previously, it is exactly as comparing document in the concept space, the only difference is that we compare a new document (short text) presented as a vector, then return the documents with the highest scores. This enables LSA to discover hidden semantics between the short text and documents. To give more insight into the importance of this step we report that experiment based on only the 10 short texts which resulted in around 2000 different tags. Imagine the number of tags that a 100 short texts would produce.

Selecting Tags
In Wikipedia, each article is assigned to a number of categories. Each category groups a number of Wikipedia articles together. The articles of a category are similar to each other. If we look closely to these articles we will find that they describe the name of the category they belong to or vice versa. Meaning if we consider the category name is a title of a book, each article is considered a chapter in that book. Any chapter in an English grammar book can be tagged "English grammar". Also an article can belong to a number of categories, consider the chapter "Introduction" that is found in many books.
In our system, tags are meant to be categories and titles of some of the 7 top articles similar to the short text. Because the tweet is similar to these top articles, the categories that contains some of them also can include the input tweet. In other words, this category-the one contains some of the 7 articles-describes the content of the tweet in a general way and can be used as a tag for it. Accordingly, because the tweet is similar to the content of these articles, their titles may be suitable as tags for the tweet.
We consider a title to be appropriate if it contains some terms of the tweet. Titles that satisfy this condition are more specific than Wikipedia categories. Speculating in the example of (Figure 3.2) the short text " ‫الفلسطينية‬ ‫السالم‬ ‫معاهدة‬ ‫االسرائيلية‬ " "Palestinian Israeli peace treaty" may result in similarities with the first two articles that share the category ‫اإلسرائيلي"‬ ‫العربي‬ ‫"الصراع‬ "Arab Israeli conflict" which considered an appropriate tag in a broad manner. This presents selecting 'categories as tags' discussed below. Furthermore, the title of the first article contains the term "‫""معاهدة‬treaty" which exists in the short text. This allows it to be elected as a tag. So it is given a higher score letting the title ‫أوسلو"‬ ‫"معاهدة‬ "Oslo treaty" appear in the top tag suggestions.
This tag describes the tweet in particular. This stage has two steps; obtaining the categories of the 7 articles with the highest scores, note that we treat categories as if they have no hierarchy. Then adding analogous titles of the 7 articles as follows:

Categories as tags
It is obvious that if two articles are similar to each other, there is a chance to be partners in a category. We can refer to it as category, subject, topic, division, class, tag, etc. but let us call it category as it is in the Wikipedia. This means that it can be suggested as a tag. But our articles, which has been compared to the short text in the concept space, are assigned to variant types of categories, and we are concerned with the categories that involve some or all of them preferably. One simple way to identify these categories -or tags-is to pick out intersection between the categories of the articles. These tags are assigned a weight or a score equals the number of intersections.
The highest the score of the tag, the most appropriate it would be. It is worth mentioning that categories cover the general aspects of the short text. We can describe the procedure as follows: For example, " ‫سيتي‬ ‫ومان‬ ‫تشيلسي‬ ‫بين‬ ‫ّة‬ ‫قوي‬ ‫موقعة‬ ‫د‬ ‫يترصّ‬ ‫وليفربول‬ " "A Strong match between Chelsea and Manchester City while Liverpool awaits" the articles with the highest scores to this short text are shown in (Table 3.6).
Procesure1: selecting tags from categories of top articles Let D={d1,d2,.., d7} be the set of documents similar to a short test based on SVD.
Let Cdi = {Cd1, Cd2,..., Cdj} be the set of categories for document di We compute the importance of each category by using the following equation:

‫سيتي‬ ‫مانشستر‬
The categories ‫الممتاز"‬ ‫اإلنجليزي‬ ‫الدوري‬ ‫"أندية‬ "English Premier League clubs" has 3 intersections, indicating that it is a category for three of the similar articles, and this make it appear first in the suggestions. While " ‫األندية‬ ‫رابطة‬ ‫أندية‬ ‫األوروبية‬ " "European Club Association" appears last as less relevant because it has only 2 intersections.

Titles as Tags
This is the second part of the tag selection procedure, after selecting the categories, the system moves on to check out the titles of the most similar articles. It is simply selects the title that contains a term of the short text. This is very efficient when the terms refer to names of persons, locations, etc. The title that suffice this criteria is likely to be a most relevant tag. Consequently, we set its score as the maximum category intersection +1. If the title contains more than one term, its score is incremented by the number of terms it contains. We can describe the procedure as follows: For example, referring to the example in (Table 3.6)" ‫ومان‬ ‫تشيلسي‬ ‫بين‬ ‫ّة‬ ‫قوي‬ ‫موقعة‬ ‫د‬ ‫يترصّ‬ ‫وليفربول‬ ‫"سيتي‬ the titles of the selected articles that contains a term of the short text are ‫"تشيلسي"‬ " Chelsea", " ‫ما‬ ‫ن‬ ‫سيتي‬ ‫شيستر‬ " "Manchester City", and ‫"ليفربول"‬ "Liverpool", and they are more relevant and appropriate as tags than categories. So, Procedure2: selecting tags from titles of top articles Let T={t1, t2, …, tn} be the terms of the tweet Let MaxCatScore be the maximum score of categories Let Li={l1,l2,…,l7} be the set of the titles of the 7 top articles We compute the importance of the title as follows: For i=7 to 1 IF li contains terms in T THEN Set score of li = MaxCatScore + number of terms it contains they are assigned a higher weight. Each title has a score 4 which equals 3+1. Checking the titles is carried in reverse order as the procedure suggests. This means that we examine the titles with the least scores before the ones with high scores. It keeps the order of the selected titles unless one contains more than one term. In the example above the order of titles will be as presented in (Table 3.6) even they has the same score. Titles cover the specific aspects of the short text unlike categories that are broader. The criteria we adopted let title tags appear at the top in the suggestions, while categories appear last.

Case study
In the following case study, we illustrate a full scenario of the short text tag suggestion, showing how the short text is processed, until the suggestion of tags. At this point our system is started, Wikipedia formed into three matrices, and these matrices are stored in the memory in a distributed fashion, ready for any input.
Suppose a user posting " ‫ومصمم‬ ‫المبرمج‬ ‫بين‬ ‫الفرق‬ ‫الجراف‬ ‫يك‬ " "Programmer vs graphic designer" on a social media website, for example. Our system grabs the text of the post and suggest tags for it as follows:

Preprocessing
The input text is first processed by cleansing all non-Arabic letters, punctuations, and special characters. Afterwards, the text is tokenized and segmented. Finally, stopword removal is applied, as (Figure 3.4) shows.

Vectorization
The terms of the short text is ready to be formed as a vector. This is done by setting the value of the term to its inverse document frequency to maintain the weighting schema of the original matrix. (Figure 3.5) shows the text as vector. Then the vector is multiplied by the matrix V T to compute the concept space vector of the short text.

Figure (3.5): Short text as a vector
In the Tf-idf matrix of the Wikipedia articles, each column represents an article where each row in that column in the importance of a term in that article. We can refer to this column as the vector of the article We treat the tweet as an ordinary article in Wikipedia, and that its Tf-idf score is calculated with reference to Wikipedia as a corpus. Tf-idf is calculated by first calculating the frequency of terms in the tweet but we consider it as if appears once in the short text in our case while if it appears twice or more it will has a negligible effect compared to Wikipedia. Then, the inverse document frequency is calculated by dividing the total number of Wikipedia articles by the number of articles containing the term, and then taking the logarithm of that quotient. This formula is presented in The aim of vectorizing the tweet with reference to Wikipedia as a corpus is make its Tf-idf representation comparable to the Tf-idf representation of other Wiki articles, and thus the application of the similarity measure (cosine measure) becomes possible using Equation 3.3.

Select 7 most similar articles:
The vector generated in the previous step is now compared to the rows of the US matrix, which denotes the Wikipedia articles. The dot product between the tweet vector and each row of the US matrix results in the cosine similarities between them. Then articles are sorted according to that similarity and the 7 top articles are retrieved. (Table   3.7) shows the 7 articles with the highest scores that are similar to the short text in this case study.

Tag selection
With the articles in hand, the system looks for intersections between the categories of the top articles, setting the number of the intersections as score of the category (tag). It is performed according procedure 1. (Table 3.8) shows the categories and their scores. In this case the first category has score=3 indicating better suitability than the other two. But the list will be updated in the next step. After finding all the category intersections, the system looks for titles that contain terms of the input short text. If found, the system sets the weight of the title to the maximum category score incremented by the number of terms in the short text it contains according to procedure 2. (Table 3.9) show the titles selected by the system. The first title contain two term of the short text ‫"مصمم"‬ and ‫."جرافيك"‬ The score is set to 3+2 terms =5, while the other contain only one term each, so the weight is set to 3+1=4.
All titles in the (Table 3.9) are more appropriate than categories in (Table 3.8) as tags. Tags in both tables are presented to the user in a descending order showing the tag with the highest score at the top of the list. The categories are replaced by the titles that equals them such as the category "Graphic design" " ‫تصميم‬ ‫الجرافيك‬ ". The full list of the tags are presented in (Table 3.10). Luckily, all the tags in both tables considered suitable except for " ‫تصميم‬ ‫المعلومات‬ " "Information design". Also, one can notice that titles with the same scores are presented in their same order of relevance (refer to (

 Wikixmlj
Wikixmlj is a Java API for parsing Wikipedia XML dumps (wikixmlj, 2016). It is part of the larger WikiSense project aimed at understanding Wikipedia for semantic annotation of texts. It provides easy access to Wikipedia XML dumps, and have been used in different works (Santoso, Nugraha, Yuniarno, & Hariadi, 2015). Wikixmlj is available on (Github, Wikixmlj 2016).

 Farasa Segmenter
Farasa is a fast and accurate text processing toolkit for Arabic text. Farasa consists of the segmentation/tokenization module, POS tagger, Arabic text Diacritizer, and Dependency Parser. It have been used in recent works (Abdelali, Darwish, Durrani, & Mubarak, 2016). Farasa is available on (QCRI, 2016).

 Apache Spark
Apache Spark (Sprak, 2016) is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley's AMPLab, and open sourced in 2010 as an Apache project (Zaharia et al., 2010).
It provides a highly-optimized machine learning library called MLlib (Meng et al., 2016) which has several features that are particularly attractive for matrix computations (Bosagh Zadeh et al., 2016;Zadeh et al., 2015). Spark enables us to maintain the huge data in memory in a distributed manner.

Summary
This chapter presents the methodology we followed to construct our tag suggestion system. First, the XML dumb was parsed for complete articles; body, titles and categories and stored in a text file to be distributed among working nodes. Then the tagging process begins by preparing the system. Text preprocessing is applied to the bodies of the articles, cleansing, segmenting and stop-word removal in order. The third step is constructing the Tf-idf matrix then the Singular Value Decomposition.
The system is now ready to receive any input which is the fourth step. The input is preprocessed, vectored, and compared the articles in the concept space to find the most similar ones. The final step is to generate tags from the titles and the categories of these articles. The category selection is based on the intersection, while the title selection depends on containing a term of the input text.

Introduction
This chapter presents the system we utilized to assess and evaluate our tag recommender system. The main objective of the evaluation is to assess the reliability of the tag recommendation system: we aim to explore the extent to which the proposed system can accurately suggest suitable and correct tags to the input tweet from relevant Arabic Wikipedia articles.
Similar approaches from the state of the art have been evaluated by being compared to other approaches (Hassan et al., 2012;Otsuka et al., 2014). However, we are not aware of any similar approach that utilizes the Arabic version of Wikipedia for the tagging of short texts to compare with. Therefore, we opted to assess our system by experts' evaluation of the results.

Dataset
The dataset is a set of 100 tweets selected randomly from three different domains: Sports, Technology, and News mainly Palestinian news. The tweets were divided according to the subjects as follows: Sports; 36 tweet, Technology; 41 tweets, and News; 23 tweets. The aim is to assess how the generated recommendations are affected by changing the domain of knowledge. In addition, we emphasize that the selected 100 tweets were used only for the evaluation step, and were not used beforehand to tune or test the system during the design and implementation. (Table   4.1) shows a snapshot of the dataset. The complete dataset can be downloaded from https://github.com/YousefSamra/ShortTextTagging.

Experiment settings
Some sizes of data cannot be processed on a single machine. Operations on data may require memory spaces that could not be located in one machine. Performing the singular value decomposition on the Arabic Wikipedia requires tens of gigabytes of memory to make it doable. Besides, this heavy computations needs an efficient environment to handle it in a reasonable time, regardless that we are not concerned of time efficiency in our experiment. Those reasons lead us to utilize Apache Spark in the experiment. We restrict our attention to Spark, because it provides a highlyoptimized machine learning library called MLlib (Meng et al., 2016) which has several features that are particularly attractive for matrix computations. Spark cluster parallel environment provides us with a sufficient memory space that is distributed among the nodes of the of a standalone cluster.
The experiment were carried out in a computer lab. It consists of 20 identical laptops which we used as a Spark cluster. The settings of the experiment was as follows: 1. Master node: it is the computer that executes the code of the system and organize the communications with other worker nodes, collects and saves the results. The specifications of this machine is depicted in (Table 4.2).  Our data which is the Arabic Wikipedia articles that are cleaned, tokenized, and segmented were transferred manually to every worker node. Data is needed on worker nodes to lighten the load of communications and data transfer among the cluster. Also, we had to deploy Apache spark on worker nodes. Worker nodes had to be started manually because there is no way to start them automatically.
After starting the master and the worker nodes, we can run our code on the cluster and record the results to be evaluated.

Evaluation Process
The evaluation process had two experiments. First experiment aimed to determine the number of the top articles the system has to utilize in order to result in a qualified and considerable number of tags. The second experiment was for the assessment of our system. We ran the tag recommender on the dataset and recorded the results which are an ordered set of titles and categories of top articles (as tags) for each tweet. In the next sections we discuss in details the two experiments and their results.

Experiment 1: Determining the top N articles
As explained in Section 3.4.3 in Chapter 3, the tweet, will be compared with Wikipedia articles by using the cosine similarity measure. Then, top similar articles will be used to identify the recommended tags by exploiting their titles and categories (refer to Section 3.4.4) Therefore, we aim at this stage to explore how the accuracy, in terms of the correctness of generated tags, is affected by changing the number of top articles used for tag recommendation. We also aim to optimize our system by identifying the best number of articles that should be used to give the best possible recommendations.
We tested our recommendation system with only 10 tweets while varying the number of top similar articles from 2 to 20. For example, the first trial used only the top 2 Wikipedia articles to recommend tags, while the last trial used 20 articles. Tags generated from each trial were validated by six human experts, two in each field, who marked each tag as "Correct" or "Incorrect". A tag was considered correct if it highlighted the meaning of the tag, or it can be used to categorize the tweet. (Table   4.4) shows a tweet and a sample of the resulted tags. The first column presents the tweet. The second and the third columns present correct tags. Tags in the first column highlights the meaning of the tweet while tags in the third column are considered categorization of the tweet which describes the topic (subject) the tweet belongs to.
The last column presents incorrect results that experts considered inappropriate as tags. It is important to notice that the total number of generated tags from all trials was 2007. This large number of tags that needed to be validated by the experts explains why we limited the number of tweets for this experiment to 10 tweets only rather than 100 tweets. (Table 4.5) illustrates the results of changing the number of similar documents.
The first column shows the changing number of articles for the ten tweets. The second column shows the average number of recommended tags that are correct for each number of articles for all tweets. The third column shows the average number of incorrect tags for all tweets. The final column shows the accuracy for each trial. Notice that the total number of recommended tweets increases as the number of articles increases. During early investigation of the results applying a step of 2 for the number of top articles, we noticed a small peak of accuracy (65.8%) at 10 top articles. Testing other values for top articles before and after this peak was required. Therefore, we recorded results for 7, 9, and 11 top articles as presented in (Table 4.5).
(  .1) shows that at low number of articles the accuracy looks high, but the number of tags are very small. For example at 2 articles the average accuracy was 71.4% but the number of tags was (1-2) tags on average for each tweet which is very small and sometimes not related to the short text. Some texts had 100% accuracy while some had 0%. While at 18 articles, the incorrect tags began to exceed the correct ones causing accuracy to drop to 50.7%. The system resulted in (8 to 24) correct tags, and around the same number of incorrect ones for each tweet.
Using bigger number (N) for top articles to have more tags will also increase the number of wrong ones. comparing the results when N=20 and N=18 the system added 6 more correct tags but also introduced 19 more incorrect at N=20, which decreases accuracy and increases tag ambiguity. Besides, using bigger number of top articles increases the number of correct tags, but the majority of these new tags are general or broad which categorize the tweet rather than highlighting the meaning of it. For example, (Table 4.6) shows results at N=18 for the tweet in the previous example in (Table 4.4), all of the new tags are general and similar to the tags in the third column.
But no specific or highlighting tags were added. The best number of selected top articles (N) that the experiment suggests is tend to be 7 articles, which preserve balance between the number of correct tags (5-10) for each tweet and an acceptable accuracy of 67.4% at this experiment. On the other hand 7 articles will generate a reasonable number of tags for experts to conceive. There will be 10 resulted tags for tweets. Accordingly, 7 top articles is the N that we chose for our system, avoiding our experts the burden of fruitlessly investigating an immense number of tags, leaving other choices for future work. However, restricting the experiment on only 10 tweets is a limitations, since repeating the experiment on a different 10 tweets may result in selecting different number of top articles.
Investigating results for a second experiment costs the experts time and effort. But we believe that the selected number of top articles will remain around 7 based on the structure of Wikipedia.

Experiment 2: evaluation of the system
For the assessment of our system we ran the tag recommender on the dataset and recorded the results which are an ordered set of tags for each tweet. Tweets with their corresponding generated tags were divided into 3 groups according to subject domains.
Then each group was handed to two human experts in each domain to examine the tags and mark the suitable ones. Since two human experts validated the tags, we considered only the tags that both experts agreed upon to be correct. (Table 4.7) shows how each tweets and its recommended tags are presented to the expert for validation.
The expert was asked to mark each tag as "1" if it is correct or "0" if it is incorrect. One should also notice that the order of recommended tags was preserved and considered in the evaluation. A good recommender approach should order recommendations so that most relevant ones come first.

Evaluation Metrics
Most state of the art works have adopted precision (Gong & Liu, 2001), recall (Otsuka et al., 2014) and f-measure (Hassan et al., 2012) to evaluate the performance of their approaches. While being simple and descriptive, recall and consequently F-measure, requires a pre-knowledge of all possible correct tags for each short text, which is infeasible in our case.
Therefore, what is appropriate for our tag recommender is to take into account the rank of the items. In recommender systems, the most important result for a final user is to receive an ordered list of recommendations, from best to worst. So, we adopted Precision at position K (P@K) where k from 1 to 10, Mean Average Precision (MAP), and Mean Reciprocal Rank (MRR). Works, such as (Allahyari & Kochut, 2016a;Bogers & Van den Bosch, 2008) had applied these metrics.
The first two metrics emphasize on the quality of the top K tags, while the MRR focuses on a practical goal, "how deep the user has to go down a ranked list to find one useful tag?" (Sun, Chen, & Rudnicky, 2017).
The metrics are defined as follows (Liu, 2009): To define MAP, one needs to define Precision at position k (P@k) first, k positions} k top in the documents {relevant # P@k(q)  K in our system denotes the number of recommended tags for each tweet. For example, P@5 corresponds to the number of relevant tags for a tweet from the first 5 results. We aim to explore how the precision is affect when changing the number of tags to be examined.
Then, the Average Precision (AP) is defined below: Where m is the total number of documents associated with query q. The mean value of AP over all the test queries is named MAP.
Where n is the number of queries.
Mean reciprocal rank (MRR): For query q, the rank position of its first relevant document is denoted as r(q). Then 1/r(q) is defined as MRR for query q. It is clear that documents ranked below r(q) are not considered in MRR. Recommended tags for each tweet were first assessed by human experts. The above evaluation metrics were calculated based on the expert's evaluation of tags. (Table 4.8) depicts a sample short text, ordered tag results, expert evaluation, and the calculations of P@k, AP@k, and reciprocal rank where maximum k=10.  (Table 4.9) presents the evaluation metrics of the tag recommender. The results depicted in (Table 4.9) have been calculated for 100 tweets in three different domain subjects processed through the system and then judged by expertsin each subject-for tag suitability and relevance. We have expected around a thousand tags, ten for each tweet on average according to experiment 1. We had 933 tags because some of the tweets had less than 10 tags. Their top articles belong to different categories. It is possible to have such number of tags based on the top articles. Top articles that are related to each other share categories more than weakly related top articles. Shared categories are suggested as tags.

Results and Discussion
Inspection of the results revealed that the system achieved a good performance by 84.39% mean average precision, which as we think and the results suggest are adequate for a tag recommendation system. Also the system achieved a considerable mean reciprocal rank of 96.53% which means that the user will find a suitable tag as the first or mostly the second result that proofs the effectiveness of our simple rank algorithm. But this was not the case with all input tweets, we have recorded a few where a suitable tag did not appear neither first nor second. As an example, the tweet " ‫أبردين‬ ‫يعزز‬ ‫موقعه‬ ‫في‬ ‫المركز‬ ‫الثاني‬ ‫بالدوري‬ ‫األسكتلندي‬ " had only one proper tag at k=6 resulting in AP@k =16.67% and reciprocal rank = 16.67% too. Detailed discussion is provided in the next section.
We were also interested in examining the differences across different subject domains. Results for each subject domain is depicted in (Table 4.10). Results from the  While applying more investigation into the results we noticed that the precision is higher at the top of the list. Meaning, as we encounter new results from the list of recommendations, the precision drops down indicating weak relatedness of the tags at the rear of the list. (Figure 4.2) shows average precision for the results of the 100 tweets at k=1 to 10. This result is consistent to a large extent with most web search and information retrieval systems since it introduces more relevant tags at the top of the list than on the bottom of the list.

Figure (4.2): AP(1-100)@k(1-10)
To further explain our results, we inspected the results thoroughly to identify the main sources of strengths and weaknesses. Strengths can be stated in the following points: 1. Comparison in the concept space: this is mainly the job of singular value decomposition. Classifying articles into concepts before comparing them with the input tweet gives higher scores to the articles in the concept that the tweet belongs to.
Leading to better matches to the input. For example the term ‫"زيدان"‬ could be the philosopher " ‫يوسف‬ ‫زيدان‬ ", the actor " ‫أيمن‬ ‫زيدان‬ ", or the media figure ‫زيدان"‬ ‫آل‬ ‫,"بدر‬ but

Chapter Conclusions
In this work, we have developed a tag recommender system for short Arabic texts by exploiting Arabic Wikipedia as a base knowledge. Given a short Arabic text, the system compares it to the Wikipedia articles in the concept space to find the most relevant and articles then uses these articles to suggest ranked tags from their titles and categories.
The system process consists of the following steps: First, configuring Arabic Wikipedia: in this step the XML dumb is parsed for complete articles; body, titles and categories. Then text preprocessing is applied including, cleansing, segmenting and stop-word removal. Second, preparing the system: this step constructs the Tf-idf matrix then the Singular Value Decomposition. Third, in this step the system compares the input to the articles in the concept space to find the most similar ones. Forth, electing tags: this step is to select tags from the titles and the categories of relevant articles. The category selection is based on the intersection, while the title selection depends on containing a term of the input text, these tags are ranked using a simple ranking procedure. The tag recommender system is evaluated over 100 short texts from online Arabic tweets in three different subject. The results of the system were evaluated by experts' subject opinion. Then the system is assessed based on the evaluation metrics of mean average precision, mean reciprocal rank. Results indicated that the system achieved high relevance measures with 83.39 mean average precision and 96.53 mean reciprocal rank.
This work has the following research contributions: To our knowledge, this is the first work to explore the Arabic short text tagging using Arabic Wikipedia. Arabic Wikipedia has only been exploited recently by the Arab computer researchers and few efforts from the literature have tried to extend to the Arabic version of Wikipedia for different purposes such as determining relations between topics (Kanan et al., 2015) and named entity recognition (Althobaiti, Kruschwitz, & Poesio, 2014) but not the tag recommendation.
Our work proposed a simple ranking procedure that is especially designed for ranking results in our case. This is different from other ranking algorithms, but in our humble opinion the system can be used in other application such as suggesting links in "Read More" section that offers documents similar to the current document in the same website. Also, the system, as it is, can be employed for auto categorization of Wikipedia articles.
Our system, is one of few works that utilize latent semantic analysis to non-Latin languages compared to Latin languages. These works, including ours, proof the possibility of employing LSA to achieve high performances.
As far as we know, most works utilize LSA to summarize documents or to find similarities between existing documents. This work is one of a scarce to confirm the applicability of introducing new document to the system.
The results show that the system help mapping poorly composed short texts into real life concepts that can help improve other information retrieval processes. Also it helps unifying tags among users which can improve classification and linking by providing more insight to the content and the meaning (purpose) of the short text.
We proposed an in-depth evaluation of our tag recommender and explored the potential shortcomings and strengths of each involved process. This detailed evaluation can inform Arab researchers with the various options and recommendations for designing similar approaches.
For the uniqueness of this work, we have some aims for the future: 1. Evaluate the system in the field of question answering. Dealing with Arabic Wikipedia as the source knowledge and the question as a short text, the system must provide one article, at best, that contains the answer of the question.

Exploit the latent semantic analysis of Arabic Wikipedia for other
applications such as finding similarity between Arabic documents or recommender systems.
3. Explore solutions for the weakness points discussed in Section 4.6. For example, results can be improved by unifying the way of writing foreign words in Arabic.
4. Proof the generality of the tag recommender by Appling it to the English Wikipedia.