AFND: Arabic fake news dataset for the detection and classification of articles credibility

The news credibility detection task has started to gain more attention recently due to the rapid increase of news on different social media platforms. This article provides a large, labeled, and diverse Arabic Fake News Dataset (AFND) that is collected from public Arabic news websites. This dataset enables the research community to use supervised and unsupervised machine learning algorithms to classify the credibility of Arabic news articles. AFND consists of 606912 public news articles that were scraped from 134 public news websites of 19 different Arab countries over a 6-month period using Python scripts. The Arabic fact-check platform, Misbar, is used manually to classify each public news source into credible, not credible, or undecided. Weak supervision is applied to label news articles with the same label as the public source. AFND is imbalanced in the number of articles in each class. Hence, it is useful for researchers who focus on finding solutions for imbalanced datasets. The dataset is available in JSON format and can be accessed from Mendeley Data repository.


a b s t r a c t
The news credibility detection task has started to gain more attention recently due to the rapid increase of news on different social media platforms. This article provides a large, labeled, and diverse Arabic Fake News Dataset (AFND) that is collected from public Arabic news websites. This dataset enables the research community to use supervised and unsupervised machine learning algorithms to classify the credibility of Arabic news articles. AFND consists of 606912 public news articles that were scraped from 134 public news websites of 19 different Arab countries over a 6-month period using Python scripts. The Arabic fact-check platform, Misbar, is used manually to classify each public news source into credible, not credible, or undecided. Weak supervision is applied to label news articles with the same label as the public source. AFND is imbalanced in the number of articles in each class. Hence, it is useful for researchers who focus on finding solutions for imbalanced datasets. The dataset is available in JSON format and can be accessed from Mendeley Data repository.

Value of the Data
• AFND is a large, diverse, single-labeled Arabic news dataset that is suitable for training deep learning models to detect Arabic fake news. • Different natural language processing (NLP) techniques that focus on Arabic language processing can use the dataset to find errors and noise for correction. • The dataset is convenient for both deep learning and conventional machine learning algorithms due to its size and diversity. • Unsupervised machine learning algorithms can use the dataset to process and classify undecided news articles into credible or not credible. • The dataset is imbalanced in the number of articles that belongs to each class. Also, there is a high variation in the number of tokens/words within articles. Therefore, different NLP techniques and machine learning algorithms can use the dataset to solve these problems.

Data Description
Although the 134 news websites are free and public, we have replaced the identity (Uniform Resource Locator URL) of each website with "source_1", "source_2", "source_3", etc. in the JSON objects in a file named "sources.json" under the main directory to anonymize the news source websites. Each object has an anonymous name of the news source and a label for the news source (credible, not credible, or undecided). Fig. 1 shows a portion of the contents of the file "sources.json".
The scraped public Arabic articles are stored in 134 sub-directories within the directory named "Dataset". Each sub-directory is named according to the anonymous names (e.g. "source_1") of the 134 news sources. Each sub-directory has a file named "scraped_articles.json" which contains the information of the articles that were scraped from the public news source and stored as an array of JSON objects. Each object stores the title, text, and the publication date of the article. Fig. 2 shows an example of a JSON object for a scrapped article from an anonymous news source. Table 1

Experimental Design, Materials and Methods
Here, we present the issues that we encountered during the collection of the dataset and the methods that we used to handle them.
Many of the public news websites have separate link pages that are related to local news of some Arab countries. The local pages were selected rather than the main page for each source to avoid international news and to scrape more public news articles. Out of the 134 news sources, 82 public websites support the RSS feed of local news. Thereby, the public news articles for these sources were scraped from the RSS feed and the local page links.
The public news articles were scraped for six months starting from February 1, 2021 to July 30, 2021. During this period, we have continuously checked the collected public articles to ensure the soundness of the data scraping. Duplicated articles, unrelated articles of the public source (external links), articles in non-Arabic languages, articles supported by videos rather than text, and articles that are unrelated to events or news were all eliminated. Moreover, no preprocessing methods were performed on the articles text. It is worth mentioning that all these articles are available in the public domain.
Misbar, the public Arabic fact check platform, uses a rating system that labels published articles in public news websites. There are eight labels which are: True, Fake, Misleading, Selective, Suspicious, Myth, Commotion, and Satire. We used Misbar and searched manually for all Arabic news that were already labeled in the previous 9 months. Also, during the scraping phase, Misbar was continuously checked for new articles from any of the 134 public news websites. Some of these news articles were published by one or more of the chosen news websites. When all public news articles (based on Misbar) of a news website were classified as True, then the website was classified as credible. On the other hand, when all public news articles (based on Misbar) of a news website were not-True (belong to one of the remaining seven categories), then the website was classified as not credible. Finally, public news websites that have articles that belong to both categories (True and not-True) were classified as undecided. After that, all scraped articles from a credible websites were labeled as credible, scraped articles from not credible website were labeled as not credible, and scraped articles from an undecided website were labeled as undecided. This distant labeling method can result in errors and noise and our analysis has shown that. Hence, natural language processing techniques can be used by researchers to find errors and correct them.
We have investigated the AFND dataset and it is found that it is imbalanced in terms of the number of news websites and the number of articles in each class. The percentage of public websites within the undecided class is less than credible and not credible classes. However, the number of public articles that belong to the undecided class is the largest. This observation opens a research problem on using natural language processing techniques to mine the articles text to find correlations and semantic similarities that can automatically detect public Arabic fake news. The dataset was used in a research paper, titled "Detecting Arabic Fake News Using Machine Learning" [4] , were different machine learning models were applied for Arabic fake news detection and classification.

Ethics Statements
This work involved no human subjects, no animal experiments, and data was not collected from social media platforms.
• Terms of service (ToS): The public and free Arabic news websites are available for any one and allow the news articles to be scraped. We are not using the data for any fraudulent activities such as making profit (e.g.business), DDoS, data theft, or any other sort of bad intention. • Copyright: The data does NOT belong to users (e.g. social media). The data are published on free and public news sites that can be consumed by any one who has access to the Internet. This is similar to search engines that use bots to index web pages. • Privacy: Although the data is free and published in the public domain, we have replaced the identity (Uniform Resource Locator URL) of each website with "source 1", "source 2", "source 3", etc. to anonymize the websites and the news articles. • Scraping policies: There is no special scraping policy.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.