AHD: Arabic healthcare dataset

With the soaring demand for healthcare systems, chatbots are gaining tremendous popularity and research attention. Numerous language-centric research on healthcare is conducted day by day. Despite significant advances in Arabic Natural Language Processing (NLP), challenges remain in natural language classification and generation due to the lack of suitable datasets. The primary shortcoming of these models is the lack of suitable Arabic datasets for training. To address this, authors introduce a large Arabic Healthcare Dataset (AHD) of textual data. The dataset consists of over 808k questions and answers across 90 categories, offered to the research community for Arabic computational linguistics. Authors anticipate that this rich dataset would make a great aid for a variety of NLP tasks on Arabic textual data, especially for text classification and generation purposes. Authors present the data in raw form. AHD is composed of main dataset scraped from medical website, which is Altibbi website. AHD is made public and freely available at http://data.mendeley.com/datasets/mgj29ndgrk/5.


a b s t r a c t
With the soaring demand for healthcare systems, chatbots are gaining tremendous popularity and research attention.Numerous language-centric research on healthcare is conducted day by day.Despite significant advances in Arabic Natural Language Processing (NLP), challenges remain in natural language classification and generation due to the lack of suitable datasets.The primary shortcoming of these models is the lack of suitable Arabic datasets for training.To address this, authors introduce a large Arabic Healthcare Dataset (AHD) of textual data.The dataset consists of over 808k questions and answers across 90 categories, offered to the research community for Arabic computational linguistics.Authors anticipate that this rich dataset would make a great aid for a variety of NLP tasks on Arabic textual data, especially for text classification and generation purposes.Authors present the data in raw form.AHD is composed of main dataset scraped from medical website, which is Altibbi website.AHD is made public and freely available at http://data.mendeley.com/datasets/mgj29ndgrk/5 .© 2024 The Author(s

Value of the Data
• AHD is the largest, to our knowledge, available and representative Arabic Healthcare Dataset (AHD) for a wide variety category.• AHD offers up to ninety distinct categories, making it robust for accurate text categorization.
• AHD offers over 808k distinct questions and answers, making it robust for accurate healthcare systems and chatbots.• In contrast with the few small available datasets, AHD's size makes it a suitable corpus for implementing both classical as well as deep learning models.

Background
The progress of Natural Language Processing (NLP) is not significant in Arabic Language.To bring this progress introducing large datasets and research methodology should be emphasized.Therefore, authors have constructed a large Arabic Healthcare Dataset (AHD).The main objective of AHD is to contribute to healthcare system and Chatbots.AHD is created from Arabic content, which can help to develop practical in Arabic healthcare.

Data Description
Numerous language-centric research on healthcare is conducted day by day.To address shortcomings of Arabic natural language generation models, authors introduce a large Arabic Healthcare Dataset (AHD) of textual data.For this motivation, authors named our dataset 'AHD' [ 3 ].
The largest Arabic Healthcare Dataset (AHD) as we know was collected from medical website.The AHD consists of more than 808k rows 90 variety categories AHD adopted the annotation of each question as it appeared on its source website, Altibbi.Table 1 summarize distribution of question and answer per category.
Fig. 1 show the distribution of question and answer per category.The questions and answers in Arabic Healthcare Dataset (AHD) have different lengths.The average length (Average number of characters) of the questions and answers are 115, and 152, respectively.Authors also figured out the maximum and minimum word count of the Arabic Healthcare Dataset (AHD).The maximum characters for questions and answers are 348, and 32,767, and the minimum is respectively 3, and 2.Besides character counts, authors identify the word counts also.All of these pieces of information are mentioned in Table 2 .These pieces of information are determined from raw data.
The data is kept in raw format as excel; no cleaning, stemming or any type of pre-processing is applied after scraping.The AHD contain some English symbols, punctuation, digits, and almost no Arabic diacritics.AHD_englishe.xlsxprovides raw data that includes questions, answers, and health care categories translated from Arabic to English.

Experimental design, Materials and Methods
Arabic texts may exhibit a scarcity of healthcare.To address this problem and to facilitate the training of natural language generation models on correct Arabic healthcare texts, it is necessary to construct a large dataset that is dedicated to Arabic healthcare.Fig. 2 shows Experimental design, materials and methods.

• Website selection process
The website selection process required the following conditions to be met: -The website must be specialized in medicine, especially healthcare.
-The website must be used mainly in Arabic and not as a translation into Arabic.
-The website should allow membership only for specialists who are experts in the medical field, such as doctors, nurses, and pharmacists, after reliably proving their experience.-The website should allow questions to be answered only by members.
The authors choose the Altibbi website [ 2 ] after examining it and ensuring that it fulfills the previous conditions.

• Web-scraping
The dataset was retrieved from websites using web-scraping tools, as well as Python, which has many packages, including Requests and BeautifulSoup, which support the retrieval of data from the web, authors utilized Google Colab, Google's cloud-based notebook.
Requests is an integral Python module for handling HTTP requests, using methods such as GET, POST, DELETE, and HEAD.The GET method was used to retrieve HTML pages from the specified sources.In addition, BeautifulSoup, another package used, can extract information from HTML pages, but the only information required from this dataset is the question, answer and category.

• Data construction and curation
In the collection process, several criteria were considered to retrieve data from medical website (Altibbi), as follows: • The medical website is retrievable, as some websites strictly unretrievable.
• The structure of the medical website is based on pages that loop according to date.• The website medical addresses one of the targeted categories.
• Compare the dataset with other datasets Authors kept dataset in raw format.No cleaning, stemming or any type of pre-processing is applied after scraping.AHD contains some English symbols, punctuation, digits, and almost no Arabic diacritics.
In Table 6 , a comparison between the AHD from Arabic Healthcare Question and Answer, along with other datasets described in the relevant literature (Abdelhay et al., 2023) [ 1 ].
Also, Table 6 shows a comparison between our dataset (AHD) and the other datasets, which indicates that AHD dataset is the largest Arabic dataset in the healthcare domain.AHD can be used for several tasks, such as text classification or text generation.
Tables 3 and 5 shows sample of Arabic Healthcare Dataset (AHD) which translated to English [ 3 ].

Limitations
There are several limitations to the AHD that need to be acknowledged.Firstly, the AHD was collected from a one website.Secondly, the AHD is unbalanced, as some categories contain a large number of questions and answers, while some categories contain a small number of questions and answers.

Fig. 1 .
Fig. 1.Distribution of question and answer per category.

Table 1
Distribution of question and answer per category.

Table 2
All numeric information for Arabic Healthcare Dataset (AHD).Experimental design, materials and methods.

Table 3
All numeric information for Arabic Healthcare Dataset (AHD) in translated to English.

Table 4
Sample reading comprehension of the Arabic Healthcare Dataset (AHD).

Table 5
Sample reading comprehension of Arabic Healthcare Dataset (AHD) in translated to English.

Table 6
Comparison Arabic Healthcare Dataset (AHD) with other datasets.