PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution

Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root1-pattern2 paradigm in Arabic. Root-Pattern distributions in form of P(root|pattern), P(pattern|root) and P(pattern|pattern) are additionally estimated. The aspect of considering the Maximum Likelihood Estimation (MLE) on the root-pattern level as a higher-level of abstraction, has been widely neglected in Arabic research community despite its advantage in reducing ambiguities within Arabic morphological analysis and its impact on cognitive aspect on Arabic word perception [1]. In the preprocessing phase, the html files were converted to 974 unfiltered raw text files with the size of about 180 GB. These files were morphologically analyzed towards extracting and counting frequencies of patterns, roots, particle, and stems and particularly root-pattern occurrences. Based on a resulting corpus containing around 18,482,719 raw words, a language data model is constructed containing 9,311,246 bi-grams of morphologically analyzed wordform, including around 3.49 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bi-grams in form of conditional probabilities covering a subset of around 8086 roots with 20413 possible pattern-forms. As this data model is considering the root–pattern phenomenon in Arabic, the created data are useful for researchers working on cognitive aspect of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, spell-checking, and resolving ambiguities in morphological parsing.


a b s t r a c t
Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bigram analysis, however with focus on the root 1 -pattern 2 paradigm in Arabic. Root-Pattern distributions in form of P(root|pattern), P(pattern|root) and P(pattern|pattern) are additionally estimated. The aspect of considering the Maximum Likelihood Estimation (MLE) on the root-pattern level as a higher-level of abstraction, has been widely neglected in Arabic research community despite its advantage in reducing ambiguities within Arabic morphological analysis and its impact on cognitive aspect on Arabic word perception [1] . In the preprocessing phase, the html files were converted to 974 unfiltered raw text files with the size of about 180 GB. These files were morphologically analyzed towards extracting and counting frequencies of patterns, roots, particle, and stems and particularly root-pattern occurrences. Based on a resulting corpus containing around 18,482,719 raw words, a language data model is constructed containing 9,311,246 bi-grams of morphologically analyzed wordform, including around 3.49 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bi-grams in form of condi-tional probabilities covering a subset of around 8086 roots with 20413 possible pattern-forms. As this data model is considering the root-pattern phenomenon in Arabic, the created data are useful for researchers working on cognitive aspect of Arabic such as visual word cognition, morphophonetic perception, morphological analysis, spell-checking, and resolving ambiguities in morphological parsing.
©  3 The "Morphological Dataset" and the " Raw Text Corpus" are just supplemental byproducts to verify the extraction process of the major "PRo-Pat Dataset". Our current model uses these texts as bag of words. We have added these additional datasets upon Journal requirement of making all data used in the process of creating the major Dataset available. However advanced deep cleaning of all textual Corpus can be followed as a future work. The text was just basically extracted from Html file and partially cleaned (around 30 million Html files). However, these raw data are still useful for many purposes. The major goal of this presentation is focused on presenting a missing dataset on a higher level of abstraction, namely, Root-Patters probabilistic data model, which have been successfully used in spell correction, Query-Expansion and Relevance Assessment.

Value of the Data
• The main goal of a Language Model is to estimate the probability distribution of word sequences generated in a natural language such as Arabic. In statistical Natural Language Processing estimating probabilities is mainly based on Maximum Likelihood Estimation (MLE), which can be achieved by bi-gram analysis. In our Data Model, Arabic words sequence distribution were additionally considered from a cognitive point of view considering the singularity of the root-pattern phenomenon in Arabic word distributions [ 1 , 2 ]. This aspect allows us to estimate word sequences on a higher level of abstraction for words as root occurrences, and all their potential occurrences with multiple morphophonetic patterns [ 5,6 ]. Predictions on P(root|pattern), P(pattern|root), (root|root) represent a higher level of abstraction than computing co-occurrences of their instances. • The dataset provides natural language processing community with a predication model for Arabic language that is built on a representative large corpus [3] . The corpus contains 18,482,719 words gathered from 29,192,662 different html webpages. • Additionally, this data model is useful in reducing ambiguines, particularly in nondeterminism evolving in probabilistic bi-directional Root or Pattern extraction process. PRo-Pat probabilistic data represents in this connection an additional aid to rank possible root ambiguity based on maximal root estimation under certain patterns. For example, the maximal root or ranked root extraction can be estimated based on the following Bayesian classification model: On the other hand, the backward estimation, i.e., the maximal or ranked pattern estimation under certain roots, can be estimated based on the following Bayesian classification model: These models provide researchers with supportive knowledge to resolve different types of ambiguities problems. Furthermore: • N-gram of roots, stems, patterns language models might be an additional aid to instantiate surface level sentences where ambiguities might occur due to missing diacritics or predicting certain word sequences. • Word similarly in context of spell error correction and Term translation [4] .
• Representing word sequence as root-pattern associative network is useful in query expansion modeling and word-topic relevance assessment [ 5,6 ].

Objective
The initial objective behind creating this dataset is to provide researchers with probabilistic Data to reduce ambiguities evolving in morphological and Part of Speech Tagging (POS) in context of improving spell checkers and search engines. These aspects were useful in improving performance of tools used by the researchers.

Data Description
The data model contains a corpus of 974 raw text files extracted from 29,192,662 ClueWeb html files stored 180 GB. Each text file contains in average 30,0 0 0 different webpages. The size of the corpus is 18,482,719 terms. The data are organized in 7 files tables, considering different unigram frequency of occurrence (root, pattern, stems) and their bi-gram MLE in form of conditional probabilities. The data model contains in addition, a morphological Database resulted in using three different morphological parser and stemmer (Arabic Textware and Alkalil morphological analyzers, and khoja stemmer).
The data model contains 9,311,246 bi-grams of morphologically analyzed wordform, with around 3.4 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bigrams in form of conditional probabilities covering around root 8086. Fig. 1 shows a sample of initial-raw ClueWeb html file before parsing and basic cleaning.    Fig. 3 shows a fragment of morphological analysis based on Arab Textware (p) and Alkahil morphological Analyzers. Content: The table contains multiple records descripting the output morphological analysis performed by Arab Textware and Alkahil morphological Analyzers. For Example, the first record of the except shows the wordform: "alBaSSaMyn," Stem: "bassamyn," Root: "BSM", Pattern: C 1 aC2C 2 ā C 3 yn, where C 1 , C 2 , and C 3 are roots' radicals. The root "BSM" is occurring in the templatic pattern: C 1 aC 2 C 2 ā C 3 yn as follows: C 1 = B, C 2 = S, C 3 = M. The basic meaning of this root means smile. Analog are the rest records. Fig. 4 shows a sample of a fragment of root, pattern frequency of occurrence towards estimating their corpus MLE. Content: The table contains multiple records descripting the result of frequency counting for Arabic Roots and Patterns. For Example, the first record shows the root "BDL" with 579228 frequencies of occurrence. The basic meaning of this root is substitute. The frequency of the pattern "C1aC 2 C 2 C 3 " is 3620475326, where C 1 = B, C 2 = D, and C 3 = l. Analog are rest records. Fig. 5 shows a sample of a fragment of root-pattern and pattern-pattern frequencies towards their MLE estimation. Content: The table contains multiple records descripting the result of frequency counting for Arabic Root-Pattern co-occurrence. For Example, the first record shows the frequency of occurrence for the: ŠKY -taC 1 aC 1 aC 3 u with 50364. ŠKY basic meaning is "doubting". And Pattern-Pattern frequency of occurrence for instance "C 1 aC 2 C3u -C 1 uC 2 C 3 i " is 2493. Analog are the rest records. Fig. 6 shows a sample of a fragment of P(root|pattern), P(pattern|pattern), and P(pattern|root) MLE. Content: The table contains multiple records descripting the result of the statistical analysis and their estimations. For example, the first record shows the MLE estimation of the probability that the root: DMR given the pattern "taC 1 aC 2 C 2 uC 3 ii" is 0.0014; i.e., P(DMR| taC 1 aC 2 C 2 uC 3 ii) = 0.0014. Analog P(C 1 iC 2 C 3 i|mustaC 1 yC 3 uu) = 0.0214 and so on .

Experimental Design, Materials and Methods
The following processes were considered in the process of creating the Dataset see ( Fig. 7 ).

• Html files extraction
• As the Clueweb database is organized in a set of WARC archive files (Web ARChive), these files were extracted. • Html files parsing • The next step is to extract the plain text from the html files, and this is done using an html parser. • The html parsing process reads and loads an html file and provides a way to traverse the html DOM for further processing like cleaning, formatting, or text extraction. • The result is 974 text files with the size of about 180 GB.

• Indexing
• stablishing a vector space of all the distinct terms of the parsed text and count the frequencies of each two consecutive terms.

• Filtering
• Filter out non-Arabic terms which might still be remaining (like Persian words).

• Morphological Analysis
• Arabic Textware and Alkalil morphological analyzers, and Khoja stemmer were used to analyze corpus terms; this should give us for each word all possible roots, stems and patterns.
Sample from the Morph output (see figure ( Fig. 7 ). • About 18.4 million words were extracted, with an indicator to the source of the analysis for each word (Arab Textware, , Al-Khalil, or Khoja), see ( Fig. 3 ). Content: the first record, the word form "muntadiate"(forums, plural from) parsed in Prefix: al article, Stem: mundata (forum), Suffix: at, Root: NDY, Pattern: muC 1 taC 2 C 3 , and so on.

• Basic Counting
• Counting the occurrence of frequency of roots, stems, and patterns appearing in the morph database. • Root-Pattern, Root-Root, Pattern-Pattern Analysis • To extract the root-pattern probabilities we need to directly use the data output to count the number of times a specific root-pattern combination appears in a word, then to calculate the conditional probabilities as MLE estimation based on Laplace smoothing.