Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words

Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stop-words by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data.


a b s t r a c t
Natural Language Processing requires data to be preprocessed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stopwords by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data.

Value of the Data
• These datasets are important because they contribute to improving Swahili textual data preprocessing especially Swahili being a low resource language. For other languages such as English there are well documented resources for textual data pre-processing and can be accessed through different libraries which is not a case for Swahili. • The datasets will benefit researchers, application developers and anyone interested in machine learning especially in natural language processing and works with Swahili textual data.
• These provided datasets can be used during data pre-processing stage for Natural Language Processing tasks such as Topic Analysis and Sentiment analysis to remove stop-words, replace slang and typos while working with any Swahili textual data. • Also, these datasets can be updated and reused to fit into certain domain areas.

Data Description
This section provides an individual description of each dataset in the following paragraphs. Common Swahili Stop-words; The dataset contains over 254 unique Swahili words that are regarded as Stop-words since they do not add much meaning to a sentence, hence can be ignored without sacrificing the meaning of Swahili sentences. The entire dataset is lowercased and stored in a Comma Separated Value file format with 8-bit Unicode Transformation Format. The dataset can also be saved in other formats such as Tab separated values, .TXT, Json, and others depending on how it will be used in Machine Learning tasks. We provide the dataset on the link https://data.mendeley.com/datasets/mmf4hnsm2n/1 accessible for public use.
Common Swahili Slangs; The dataset contains 2 columns and over 234 unique rows, one column for slang and other for respective Swahili proper word. All words are lowercased and stored in a Comma Separated Value file format with 8-bit Unicode Transformation Format. We provide the dataset on the link https://data.mendeley.com/datasets/b8tc96xf3h/1 publicly accessible.
Common Swahili Typos; The dataset contains 2 columns and over 431 unique rows, one column for typo and other for respective Swahili proper word. All words are lowercased and stored in a Comma Separated Value file format with 8-bit Unicode Transformation Format for easy use in machine learning pre-processing stage [4] . We provide the typo dataset updated over time on the link https://data.mendeley.com/datasets/3xmsjhdrc9/1 . Table 1 below show required steps for Python script to prepare Swahili stop-word dataset. Table 2 belowine shows required steps for Python script to prepare Swahili slang dataset. Table 3 below shows required steps for Python script to prepare Swahili typos dataset.

Experimental Design, Materials and Methods
This section provides details on the methodology used to prepare the datasets. we describe the procedures for developing Common Swahili Stop-words dataset, Common Swahili Slangs Table 1 Required steps for Python script to prepare Swahili stop-word dataset.
1. Open the corpus dataset for reading 2. Remove punctuation marks 3. Lowercased 4. Perform tokenization 5. Count word occurrence in a list of words obtained on above step 6. Generate a list of tuples for most frequent words 7. Export in text file for review Table 2 Required steps for Python Script to prepare Swahili Slangs dataset.
1. Open the corpus dataset for reading 2. Remove punctuation marks 3. Lowercasing 4. Selecting random messages from each topic 5. Export each batch corresponding to each topic to its respective text file for review 6. Combining results from reviewers with already known Swahili Slangs from IKS 7. Remove duplicates based on slangs words Table 3 Required steps for Python script to prepare Swahili typos dataset.
1. Open the corpus dataset for reading 2. Remove punctuation marks 3. Lowercased 4. Perform tokenization 5. Count word occurrence in a list of words obtained on above step 6. Generate a list of tuples for least frequent words 7. Create batches of words depending on frequencies 8. Export each batch to its respective text file for review

Preparing common Swahili stop-words dataset
We first used datasets from [3] to create a corpus which only included Swahili conversations. The collected Swahili corpus is from Tanzania SMS platform; the data was made up of 248,944 Swahili messages with a total of 4 million words and 320 thousand unique words. The corpus has a wide scope of topics that included: Health, Education, Menstrual Hygiene, Corona, WASH, Nutrition, HIV, Violence against Children, and U-Report. We obtained our dataset by processing the generated corpus as observed by [5] and [6] using a Python script which remove punctuation marks [7] , lowercased [8] , perform tokenization of the dataset [9] , and generate a list of tuples with words and their corresponding frequencies using freqdist function from Natural Language Toolkit (NLTK) [10] . After that, we took more than 10 0 0 most frequent words to be reviewed by Swahili experts. It was reviewed by three people including a member of Institute of Kiswahili Studies (IKS) at the University of Dar es Salaam (UDSM) to remain with only words that can be ignored without sacrificing the meaning of Swahili sentences. Required steps of our Python script is shown on Table 1 . Also, we translated English stop-words [11] to Swahili, then they were reviewed with help of a member of IKS and combined with previously obtained stopwords. Finally, the resulting stop-words were exported in a text file. Fig. 1 shows a word-cloud presentation of top 200 most frequent Swahili stop-words as they appear on corpus.

Preparing common Swahili slangs dataset
We prepared the Swahili dataset of slangs and their respective proper Swahili words by reviewing textual data collected from a SMS platform based in Tanzania [3] . In this platform young people from all regions to express their opinion on issues they care about, connect with each other, connect with their leaders and get real-time information and feedback on new initiatives and campaigns [3] . We obtained our dataset by processing the generated corpus by using a Python script which remove punctuation marks [7] , lowercase [8] , then selecting 500 random messages from each topic to be reviewed with help of Swahili experts from IKS, who identifies words that are used as slangs and provide their respective proper Swahili words. The respective required steps of our Python script is shown on Table 2 . The resulting dataset was then combined with already known Swahili Slangs from IKS to create this dataset.

Common Swahili typos dataset
We generate the common Swahili typos dataset by using datasets from [3] to create a corpus which only included Swahili conversations. The collected Swahili corpus is from Tanzania SMS platform; the data was made up of 248,944 Swahili messages with a total of 4 million words with a wide scope that included: Health, Education, Menstrual Hygiene, Corona, WASH, Nutrition, HIV, Violence against Children, and U-Report. We obtained our dataset by processing the generated corpus by using a Python script which remove punctuation marks [7] , lowercased [8] , perform tokenization of the dataset [9] , and generate a list of tuples with words and their corresponding frequencies by using freqdist function [10] . After that, we took more than 1500 least frequent words to be reviewed with help of Swahili experts from IKS. The respective required steps of our Python script is shown on Table 3 . Least frequent words were reviewed to identify common misspelled words in batches depending on their frequencies; the batches were of 5 to 10, 11 to 15 and 16 to 20 words occurrences. With help of Swahili expert from IKS we then fill in their respective proper words to generate a typos dataset. Fig. 2 shows a word-cloud visual representation of top 200 Swahili typos and the frequency in which they appear on corpus.

Ethics Statement
The work does not involve human subject nor animals but ethical requirements for publication in Data in Brief journal are observed.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.