Kurdish News Dataset Headlines (KNDH) through multiclass classification

The rapid growth of technology has massively increased the amount of text data. The data can be mined and utilized for numerous natural language processing (NLP) tasks, particularly text classification. The core part of text classification is collecting the data for predicting a good model. This paper collects Kurdish News Dataset Headlines (KNDH) for text classification. The dataset consists of 50000 news headlines which are equally distributed among five classes, with 10000 headlines for each class (Social, Sport, Health, Economic, and Technology). The percentage ratio of getting the channels of headlines is distinct, while the numbers of samples are equal for each category. There are 34 distinct channels that are used to collect the different headlines for each class, such as 8 channels for economics, 14 channels for health, 18 channels for science, 15 channels for social, and 5 channels for sport. The dataset is preprocessed using the Kurdish Language Processing Toolkit (KLPT) for tokenizing, spell-checking, stemming, and preprocessing.


Subject:
Applied Machine Learning Specific subject area: Kurdish News Dataset Headlines (KNDH) through Multiclass Classification.Type of data: Text Figure The specific URL is added to the page intended to collect data, and then some headlines are selected for fetching the texts.Data source location: Charmo University Data accessibility: + Repository name: Mendeley Data Data identification number: 10.17632/kb7vvkg2th.2Direct link to the dataset [1] : https://doi.org/10.17632/kb7vvkg2th.2

Value of the Data
1.This is an attempt to build a huge multi-categorical dataset for the Kurdish language.Moreover, it can be beneficial for improving the sentiment analysis field in the Kurdish language.2. The data include the headlines of popular Kurdish news websites, which researchers can use to conduct research in the language at a syntactical level.3. Various algorithms can be applied to predict different models in text classification.4.This dataset provides another reference for the Kurdish language, making it closer to being resourced.5.Each news website organizes its articles into categories before publishing them, allowing users to quickly select the categories of news that interest them whenever they visit the site.For instance, some readers want to read about the most recent technological advancements, so they always click on the technology section when they visit a news website.They might be interested in politics, business, entertainment, or even sports, but they may not enjoy reading about technology.Currently, due to the need for datasets, the content administrators of news websites manually categorize Kurdish news articles.However, they can also use this dataset to build a highly accurate model, since KNDH is a massive Kurdish dataset for news classification based on five categories, to deploy a machine learning model on their websites that reads the news headline or the news content and identifies the category of the news.

Objective
The Kurdish language is classified as less resourced in terms of natural language processing (NLP).The similar datasets for other languages previously were conducted, but the sources for the Kurdish languages is inferior and a small number of the dataset available related to the language [2 , 3] .The language needs essential tools such as name recognition, lemmatization, POS tagger, etc.This issue is primarily rooted in need for a more efficient corpus.The datasets available in the language include collecting comments and tweets from social media.The primary issue regarding these datasets is that they contain many grammatical and dictation errors.Since the language does not have an excellent tool to preprocess those data, the datasets need to be cleaned or require manual preprocessing.To better understand the syntactic and semantic nature of the Kurdish language and have an adequate dataset, our research group collected texts from news headlines written by academic people and contained small numbers of errors.The dataset is suitable for performing text classifications and achieving satisfactory results.

Data Description
The Kurdish language belongs to the Indo-Iranian family of Indo-European languages.It is well-known to be a close relative to the Persian language.The speakers span the intersections of Iran, Turkey, Iraq, and Syria.The Kurdish language is one of the official languages in Iraq and has regional status in Iran.The language has 40 million speakers [4] .Central Kurdish (Sorani) and Northern Kurdish (Kurmanji) are two of the main dialects of the Kurdish language [5] .However, there are other minor dialects, such as Gorani (Hawrami), spoken in some residential settings in Iraq and Iran, and Zazaki, which is used in Turkey [6] .Historically, many styles of the alphabet have been used for writing Kurdish, namely Cyrillic, Armenian, Latin, and Arabic.The dataset is the Sorani dialect which has 36 letters as vowels and constants [7] , as shown in Table 1 .

Constants Letters
Constants Letters in Latin Vowels Letters Vowels Letters in Latin The letters in this language do not have capitalization forms, which are written starting from the right-hand side [8] .The Sorani dialect is distinguished by its lack of gender.In Sorani's writing, possessive pronouns, definiteness markers, enclitics, and postpositions are used every time they are inserted as suffixes [9 , 10] .Furthermore, it contains two tenses, past and present, and singular and plural cases, but with complex morphology [11] .As for the future, the language benefits from its auxiliary verbs to denote actions that will occur in this period.The language is highly inflectional due to many affixes and clitics [12] .Jugal (2014) states that Sorani does not apply gender or grammatical case for its nominals.Although, it has an entire article marking system for definite, indefinite, and demonstrative in singular and plural forms [13] .Regarding Verb, Kurdish has around 300 single-word verbs, which are inflected based on the personal pronouns, which include (first (singular-plural), second (singular-plural), third (singular-plural)), tense(past, present, future), aspect(indefinite, perfect, progressive, imperfective), and mood (indicative, subjunctive, conditional) [14] .The Kurdish language employs compound construction forms to produce new vocabulary, namely (Noun + Verb), (Adjective + Verb), and (Preposition + Verb) forms [15] , as shown in Table 2 .Regarding syntax, the Kurdish language follows subject-object-verb (SOV) order.Since the language is a pro-drop or null subject, thus, the removal of the subject in a sentence will create zero effects on its meaning [16] .The instances are explained as shown in Table 3 : It can be seen that in Table 3 , The words ‫نم"(‬ ", ‫دازائ"‬ ‫و‬ ‫مارائ‬ ", and ‫)"ەکەڕوک"‬ serve as subjects in the examples.Once they are omitted from the sentences, the meaning of the sentences has not been affected by their removal.Due to this case, the Kurdish language is recognized as a null subject language.

Data Collection
Technological advancements today have made it possible for news to spread worldwide.News agencies have to cover many things daily due to the tremendous change in the world's state.News in various categories is bombarded on the internet.Every news agency broadcasts, reports or writes fancy headlines to incite users when an incident occurs worldwide.Designing a model which can categorize the news headline is an essential step.An excellent dataset is required to train such a model.In this study, the total number of headlines is 50,0 0 0. Samples were collected from many websites such as Rudaw, Payam, Knnc, Kurdsat, etc.The samples are equally distributed across five classes (social, sport, science, health, and economy), as shown in Table 4 .The Number and Percentage of Collected Data according to the channels are shown in Table 5 .

Experimental Design, Materials and Methods
On the internet, different types of data are available; in this era, the dataset collection is text.For text data gathering, various methods and tools are proposed.The ParsHub tool and the BeautifulSoup library has been used to collect news headlines.The following eight steps should be followed to obtain the data using ParsHub, as shown below: 1. Download: https://www.parsehub.com/quickstart2. Sign in: using a registered email account 3. New Project: Create a new project for storing the texts 4. Add Link: Click on start project on this URL 5. Select Headlines: Select the headlines on the webpage.6. Specify PageNnumbers: specify the number of pages you want to collect headlines from them.7. Get Data: Start collecting the data 8. Export Data: Dataset is exported as a XLSX file format.
As shown in the above steps, the first step is installing ParsHbu software for collecting the texts.The second step is signing in with an email account.The third step is creating a new project.The fourth step is adding the URL to the page intended to collect data.The fifth step is selecting some headlines, as shown in Fig. 1 .Notably, it is imperative to select three headlines, and the program will automatically select the others within the page based on the researcher's choice.In the sixth step, we specify the number of pages from the website we will extract headlines, as shown in Fig. 2 .Using this software, users can extract texts from 200 pages on a website.The last step is exporting the headlines as a XLSX file format, as shown in Fig. 3 .

Dataset Preprocessing
One of the most important steps after collecting the dataset is preprocessing.In the Kurdish language, the preprocessing steps for Kurdish data were obtained online, includes removing non-Kurdish words, special characters, elongation (letter repetition), symbols, stop words and ineffective numbers.Following that, we tokenized the texts in the dataset.It is crucial pointing out that word tokenization is also challenging due to the nature of the Kurdish language, which is purely morphological.Thus, the language requires its unique tool for performing such tasks.Word tokenization is another process acquired from using KLPT (Kurdish Language Processing Toolkit) [2] .This tool tokenizes Kurdish texts according to the morphological features of the language.The word-tokenization feature helps find the stem of the verbs, as shown in Table 6 .

Dataset Labeling
Dataset labeling has a significant effect on machine learning and deep learning tools.A dataset can be labeled in three methods.The first method involves reading and understanding texts through human effort.The second method is automatic labeling, which uses pre-trained annotation models to annotate the text.Semi-automatic labeling combines both human and automatic labeling as a third step.In this work, automatic labeling is used for that purpose.Thus, the annotation process is independent of human effort.Due to ParsHub's automatic category extraction, the category in which the news was published can be determined.In other words, it uses the tags written under each news headline, as shown in Fig. 4 .

Ethics Statement
This manuscript contains data acquired by using two web scraping tools.Regarding, the Terms of service (ToS), all web resources listed in Table 4 and used in the dataset allow for scraping and distributing data.Due to the fact that Kurdish news websites are free and open to everyone, thus, allowing articles to be scrapped.We confirm that data are not used for any fraudulent purposes, such as making profits (e.g., business), DDoS, data theft, or any other bad intentions.Regarding copyright, news reports are published on public news websites that can be accessed easily by anyone who has access to the Internet, and It is similar to search engines which use bots to index Web pages.It is important to note that the data in this article belong to the news websites.In this dataset, the privacy rights of individuals are protected.Even though the data is free and available to everyone, we have removed each website's identity (Uniform Resource Locator URL).The data collection process did not involve the collection of personal information.We removed identities from the dataset if they appeared in it.The dataset was neutralized according to legal and ethical guidelines and policies.The purpose of this task is to build a dataset that can classify news texts into multiple classes, not to target users or channels or any political parties.The data in the dataset were obtained from publicly available news channels.There is no scraping of data directly from social media platforms (such as Twitter and Facebook).Thus, it does not violate their scrapping policies.

Fig. 3 .
Fig. 3. Sample of Corpus in XLSX.Additionally, because ParsHub allows users to extract data several times, we used the BeautifulSoup has been used to crawl the remaining data for the specified dataset.Researchers can use BeautifulSoup's Python library for data collection using the code below.import requests from bs4 import BeautifulSoup link = 'https://www.xendan.org/

Table 4
Number and Percentage of Collected Data according to the categories.

Table 5
Number and Percentage of Collected Data according to the channels.