Grammar-aware phrase dataset generated using a novel python package

The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm.


a b s t r a c t
The past technique of manual dataset preparation was timeconsuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed "Oromo-grammar" a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm. In this work, our objective is to design a system that automatically build a clean dataset for NLP in an efficient way. To do so, first, we collected a suitable Oromo raw text from online sources. We prepared the raw data into a single text file as an input. We designed a Python implemented software, that accepts the raw text, pre-process it, and generate the intended dataset. The system initially accepts the raw text as an input, extract all verbs from the text, trims the verbs into stems, and finally resynthesizes the stems with an appropriate affixes and personal pronouns. Eventually, the system generates a phrase-based grammatical dataset that is used for different NLP model training applications.

Data format Raw and analyzed Description of data collection
To prepare the dataset, we collected a suitable Oromo raw text from online sources. We used a sample of 200kb (about 100 pages) raw text to process the dataset. We prepared the data into a single text file and saved it into a local directory for data input. Before the data is extracted into tokens, we normalized the entire text using our pre-processing algorithm to remove white spaces, punctuations, and upper cases, which have no significant value to the intended data process. Data source location

Value of the Data
• The dataset is used as the main input source to train neural networks and deep learning models in modern NLP applications like machine translation, sentence completion, and grammar and spell checker. It also helps linguists and academia in teaching language grammar structures. The dataset is not a mere collection of phrases. Rather, it's a grammar-aware dataset that can easily assist the machine to easily recognize syntactic elements or grammar rules during model training. • This dataset benefits most researchers in NLP domain, especially to deal with low-resource languages. There are a few to none datasets available for low-resource languages for researchers to conduct research in NLP application. It gives the researchers the opportunity to directly dive into the research without the issue of getting data sources. • The NLP researchers who focus on the Oromo language can directly use the dataset for sentence completion, and grammar and spell checker model training. They can also use it machine translation, after completing the English part of the phrases. The method can easily be reproducible to any other languages with a systematic analysis and slight modifications to the source code, specifically with the affixation structures of the other language to easily generate a similar dataset.

Objective
Over the past decades, machine translation (MT) and other NLP application has seen many advancements [1] . In the past, many researchers used Rule-based (RBMT) [2] , Statistical Machine Translation (SMT) [3] , and Neural Machine (NMT) methods [4] . Specifically, the last decade has seen many improvements using the NMT methods. However, those methods couldn't reach the quality of human translation [5] . Even the latest state-of-the-art NMT model is not as proficient as human translation. It's partly due to the lack of a suitable dataset to train the models. The situation is worse for under-resourced languages like Afaan Oromo [6] .
The two commonly used datasets for language modes are rule-based and phrase-based. In practice, data preparation was used to be done manually by human translators, which needed much time and effort. The other method of data acquisition was using web scraping. This method also comes up with a bunch of data errors. Recent studies focused on seeking a solution to those errors [7 , 8 , 9 , 10] . Currently, users demand well-structured and summarized textual dataset [11] . Inspired by the unfulfilled data demand of NLP applications, specifically for lowresource languages, we build Oromo-grammar dataset [12] . The dataset is automatically generated using the custom Python package that we developed.

Data Description
So far, many researchers and the NLP community have heavily relied on manual dataset preparation for machine learning tasks. But this approach is tedious and time-consuming, since machine learning requires large dataset for the best performance. For most developed languages, datasets are readily available to be used by different NLP applications. However, for underresourced languages, like Afaan Oromo, there were no publicly available datasets to build such applications. As already said, it's difficult to build a new dataset manually from the scratch. This is the main reason that motivated us to build this new dataset. Fig. 1 shows the sample input text consumed by the system, and the generated output as a csv file from the system. The screenshot portrayed here shows a very small snippet of the csv file that contains thousands of the system-generated dataset. When we give our system in kilobytes of text data, our system generates megabytes of a dataset from the given raw text.
As depicted by Fig. 1 , the "root" column stores all the verb forms extracted from the raw text. The "stem" column stores all the list of trimmed stems from the coressponding verbs. The remaining columns show the derivatives of coressponding phrases generated by our system. Those phrase features will help as the useful source for different NLP applications.

Experimental Design, Materials and Methods
We developed a novel algorithm and methods to develop our software package using Python language. The algorithms are written in a logical flow to solve the problems: from text preprocessing, root word and stem formation, and dataset generation. The methods are shown by next Algorithms 1 , 2 and 3 .