Labeled entities from social media data related to avian influenza disease

This dataset is composed by spatial (e.g. location) and thematic (e.g. diseases, symptoms, virus) entities concerning avian influenza in social media (textual) data in English. It was created from three corpora: the first one includes 10 transcriptions of YouTube videos and 70 tweets manually annotated. The second corpus is composed by the same textual data but automatically annotated with Named Entity Recognition (NER) tools. These two corpora have been built to evaluate NER tools and apply them to a bigger corpus. The third corpus is composed of 100 YouTube transcriptions automatically annotated with NER tools. The aim of the annotation task is to recognize spatial information such as the names of the cities and epidemiological information such as the names of the diseases. An annotation guideline is provided in order to ensure a unified annotation and to help the annotators. This dataset can be used to train or evaluate Natural Language Processing (NLP) approaches such as specialized entity recognition.


pora 1 and 2,
criptions and tweets, were automatically extracted from the platforms YouTube (December 2020-January 2021) and Twitter (March-April 2021).The corpus 3, including 100 transcriptions, was automatically extracted from the YouTube platform (February 2019 to October 2021).These three corpora were preprocessed after the extraction.Preprocessing programs are included in the dataset.Manual annotations were performed in accordance to the annotation guideline available within this dataset 1 .For right privacy, the content of tweets is not available in the dataset, instead there is the id of the tweet.On the other side, transcription texts are included.


Data format Raw and Standardized Parameters for data collection

The collection of texts respects two conditions: the texts have to be written in English and the term 'avian influenza' must be present in the text body for the tweets and in their title for the YouTube videos (i.e.primary data sources).The textual transcriptions collected from YouTube have been nor alized by adding punctuation and capitalization, and by correcting transcription and lexical errors for specific terms (e.g.disease names).For Twitter data, non-Unicode characters (emoticons, etc.) have been removed, as well as correcting lexical errors of specific terms (i.e.secondary data sources).


Description of data collection

The dataset is constituted of 5


Value of the Data

• This dataset contributes to the available resources for Natural Language Processing (NLP) on specialized domains and more precisely in the field of epidemiology.• This dataset is useful for computer scientists for NLP

nd data mining tasks.

• This d
taset can be used for evaluation

training purposes
for the entity recognition task.

• The annotators have identified a variety of entities (e.g.diseases, symptoms, virus, hosts).

These entities are relevant to recognize epidemiological information.


Data Description

After an extraction proc ss from YouTube and Twitter (i.e.primary data sources), the dataset is constituted of five table data files (.tab) normalized (i.e.secondary data sources) and annotated (i.e.final data sources) One annotation guide (.pdf) details the instruction to the annotators

well as the choi
es that were made while designing the study.

The python codes to reproduce the transformation of the documents for the annotation step is available on github. 3he five data files are distributed as follows:


Manual annotation of spatial and thematic entities -Corpus 1 (small)

• A table data file containing the YouTube transcription data, manually annota ed, from corpus 1, named corpus 1Y (datafile_1Y_manual_annotation.tab);• A table data file containing the Twitter data, manually annotated, from corpus 1, named cor

s 1T (datafile_1T_manual_annotation.tab).


Automatic annotation of s
atial and thematic entities -Corpus 2 (small)

• A table data file containing the YouTube transcription data, automatically annotated, from corpus 2, named corpus 2Y (datafile_2Y_automatic_annotation.tab);• A table data file containing the Twitter data, automatically annotated, from corp

2, named corpus 2Y (datafile_2T_automatic_annotation.tab).


Automatic
nnotation of spatial and thematic entities -Corpus 3 (big)

• A table data file containing the transcriptions YouTube data annotated automatically from corpus 3, named corpus 3Y (datafile_3Y_automatic_annotation.tab).

The files from the three corpora include data organized in tables.An example of the Y

Tube transcription data row is given in Table 1 and an example of the
witter data row is reported in Table 2 .

The data are described through a set of features from YouTube transcription data of the three corpora:

-id: the id f the data on the social media; -publication_date: the date of publication of the data on the social media; -raw_text: the raw text (no transformation) of the data;  -normalized_text: the normalized t xt (after applying transformations) of the data; -spatial_entity: A list of spatial entities annotated, with their labels (GPE, LOC, FAC); -thematic_entity: A list of annotated thematic entities, with their labels (Disease or Syndrome, Virus, Sign or Symptom).

The data are described through a set of features from Twitter data of both corpora:

-id: the id of the data on the social media; -publication_date: the date of publication of the data on the social media; -spatial_entity: A list of annotated spatial entities, with its label (GPE, LOC, FAC); -thematic_entity: A list f annotated thematic entities, with its label (Disease or Syndrome, Virus, Sign or S mptom).

The annotation guideline ( annotation_guide_spatial_thematic_entities.pdf ) presents the instructions to the annotators.These instructions and the choices made are summarized in the next section.The annotation framework defines several tags to annotate the texts.Spatial concepts are annotated with three tags:

• GPE Geopolitical entity) is used to annotate entities representing countries, cities, states etc;

• LOC (Non-GPE locations) is used to annotate entities representing mountain ranges, bodies of water, etc; • FAC (Faculty) is used to annotate entities representing buildings, airports, highways, bridges, etc.

Thematic concepts are annotated with three tags:

• Disease or Syndrome is used to annotate entities represent ng a disease or a syndrome;

• Virus is used to annotate entities representing a virus;

• Sign or Symptom is used to annotate entities representing a sign or symptom.

Number and distribution of annotated info mation in the corpora are given in Tables 4  and .


Experimental Design, Materials and Methods


Acquisition and pre-processing

The cor ora were obtained automatically from the platforms YouTube nd Twitter, thanks to their dedicated APIs.The texts from the web were stored i .txt files, with the aim to obtain distinct files to annotate.The pre-processing consists of f

iled below.First, in order to o
timize the recognition of entities, the raw text of these files has been normalized with the automatic addition of punctuation by using the python library punctuator 4 and with the automatic addition of capital letters, by using the POS tagging provided by the python library SpaCy5 .

Then, the text has been cleaned by deleting non Unicode symbols and eventual noise originated by the transcription process.As a final normalization step of the raw text, a correction of specific terms (e.g.disease names) is applied by using regular expressions [1] .

After the normalization task, the manual and automatic an otation of the spatial and thematic entities can be performed on the text.


Manual data labeling

In order to have a unified annotation of the spatial and thematic entities, a guideline was created.This annotation guide was written by a specialist in NLP.Three other p rsons (specialists in epidemiology) validated and adjusted these choices.By following this guide, one person (NLP specialist) annotat

manually the spatial
entities on the raw text data of 70 tweets and 10 transcriptions.In the same way, one person (data scientist specialized in epidemiology) manually annotated the thematic entities on the same data.This process results in the corpus 1 (1Y and 1T) and represents the ground truth data.



table files (i.e. 3 table files for YouTube transcriptions data and 2 table files for Twitter data).The table files describe each data through a set of features.
Data source locationThe data are hosted on the INRAE Dataverse. The data were manually collectedwithin the UMR TETIS, Univ. Montpellier, AgroParisTech, CIRAD, CNRS, INRA