Medical dataset classification for Kurdish short text over social media

The Facebook application is used as a resource for collecting the comments of this dataset, The dataset consists of 6756 comments to create a Medical Kurdish Dataset (MKD). The samples are comments of users, which are gathered from different posts of pages (Medical, News, Economy, Education, and Sport). Six steps as a preprocessing technique are performed on the raw dataset to clean and remove noise in the comments by replacing characters. The comments (short text) are labeled for positive class (medical comment) and negative class (non-medical comment) as text classification. The percentage ratio of the negative class is 55% while the positive class is 45%.


Specifications
Applied Machine Learning Specific subject area Medical dataset classification for Kurdish short text over social media Type of data Text Figure Table  How the data were acquired Facepager application is used for collecting the comments after configuring. Data format Raw Description of data collection Each post is separated accurately to describe the type of class (medical or non-medical), then the link of the post is copied and pasted in the Facepager application for gathering the specified comments. Data

Value of the Data
• This is an effort of collecting a dataset in the field of medical text classification for the Kurdish language. Moreover. It can be beneficial for supporting and modeling patient health systems, health policies, and regulations. • The data is preprocessed and ready for implementation by those researchers and scholars who conduct research work on the Arabic Alphabet, such as Persian, Arabic, and Urdu. • The dataset can be used with several preprocessing steps such as stemming and lemmatization.

Data collection
In this era, the health of people is a serious subject that researchers work on it closely [5 , 6] . For this purpose, it is important to read humans' views over social media. In this work, the Facebook application is used as a social media for creating a proper MKD. Nevertheless, to say that for predicting the right sight of humans by using machines, a good resource (dataset) is necessary. As it is clear, there are so many channels, websites, and live posts that can be used for this purpose. The database in this work id consisted of 6756 samples, which are divided into two different classes (medical and non-medical). The samples were collected from various pages  and different areas as shown in Table 2 . The number of medical comments (positive class) is 3076 while the non-medical comments (negative class) are 3680.

Methodology
On social media, the data can be viewed in various types, such as image, video, text. In this work, the data set is collected from the text. Facebook application is used for collecting the comments of users. Some different tools and techniques can be utilized for collecting the comments, the Facepager tool is one of them that has been used for this reason [7] . The following steps should be followed for obtaining the data as shown below in Fig. 1 .
As shown in Fig. 1 , the first step is downloading the Facepager software for collecting the comments. The second step is locating and installing the files. The third step is to open the software and create a new database for saving the text file in (.db) format. The fourth step is adding nodes and putting the Facebook ID of the specified link after converting it over the internet. The fifth step is to log into Facebook via the Facepager tool. The sixth important step is configuring resources as (/ < page-id > /posts) and parameters filed as (message) and specifying a start date and end date to fetch posts between those specific dates as shown in Fig. 2 .
The seventh step is configuring a tool for fetching comments by clicking on a specific post and configuring resources as (/ < post-id > /comments) and parameters filed as (message) as shown in Fig. 3 .
The last and final step is exporting the comments as a CSV file as shown in Fig. 4 .

Data set preprocessing
Preprocessing is one of the most important challenges for decreasing the noise on social media. Due to Kurdish users on the Facebook application using different Unicode to share their opinion and views. This causes a big issue for recognizing text and makes different characters shape. Using different scripts also increases the number of features (word) [1 , 4 , 8 , 9] . Accordingly, python language is used to create a new tool for implementing the below steps on the text as shown in Table 3: 1. Removing noise (URL, User mentions, and Hashtag) on social media users will provide extra information for their relatives and friends by using URL, mentions (@user name), and hashtags (#special topic) that information are helpful for users but it is noise for the machine. It has to be removed. 2. Replacing elongated characters: users on social media sometimes use elongated words purposely to emphases about special things, such as ( ‫ةييييييييج‬ ) (chiye), which means (Whaaaaaaat), which should be replaced with a base word ( ‫ةيج‬ ) (chiye), which means (what).   3. Incorrect spelling and grammar: sometimes it is easy for users to correct the misspelling and grammar but machines cannot understand and it is challenging. These three words ( ‫اڵڵاشام‬ , ‫ءاشام‬ ‫هللا‬ ) (masha allh), which means (Allah has willed it) used as a misspelling instead the correct word ( ‫,)ەللااشام‬ which means (Allah has willed it). 4. Removing punctuations: users on social media use them to express special emotions, which are easy for a human to recognize. Nevertheless, those punctuations make usefulness for machines to translate and become inefficient for text classification. These punctuations are removed . 5. Removing numbers: numbers increase the number of features in text datasets on social media and they are not helpful for the machine to understand. However. Kurdish users use different types of numbers, such as (English, Arabic, and Kurdish) numbers as shown. 6. Replacing characters: due to the Kurdish language using the same script of Arabic language for some characters and some users on social media use Arabic Keyboard for writing. This has become an issue for matching and selecting features. However, the issue has been solved by replacing the character as shown below:

Dataset labeling
After collecting the dataset, another important step is labeling the samples. For this purpose, three annotators read the samples accurately and manually labeled the unlabeled samples for two classes (medical and non-medical). This process needs a huge effort and consumes time. For labeling each sample, the annotator annotates the sample based on some special words in the medical domain and the meaning of each sentence as shown in Table 4:

Ethics Statement
All omments in the dataset belong to users in the Facebook application and it is scrapped. The data has been distributed over Facebook and thus, it has been collected and labeled. Moreover, we confirm that all the data is insensitive and anonymized data.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.

Data Availability
Medical Sentiment Analysis Dataset for Kurdish Short Text over Social Media (Original data) (Mendeley Data).