COVID-19 and Media datasets: Period- and location-specific textual data mining

The vocabulary used in news on a disease such as COVID-19 changes according the period [4]. This aspect is discussed on the basis of MEDISYS-sourced media datasets via two studies. The first focuses on terminology extraction and the second on period prediction according to the textual content using machine learning approaches.

( continued on next page )

Description of data collection
These datasets contain a set of news articles in English, Spanish and French extracted from MEDISYS (i.e. advanced search) according dedicated criteria. A corpus (i.e. textual data) by location (UK, Spain, France) and period (March, May, July 2020) has been collected from MEDISYS. Corpora have an adapted format for BioTex ( * .txt) and Weka ( * .arff). Terms extracted ( * .csv) with the BioTex system from these corpora are available. Data

Value of the Data
• This dataset is important for spatiotemporal analysis of media content regarding COVID-19. The methodology is generic and could be implemented for other study cases based on MEDISYS data. • This data could be used by computer science scientists (NLP and data mining domains) and for social science and humanities research. • The formats of these datasets are suitable for NLP approaches (e.g. BioTex) and data-mining tools (e.g. Weka). • Other corpora could also be collected with the method outlined in this data paper. The code (Perl) enables the conversion of other textual data from MEDISYS in suitable formats for NLP and data-mining tools.
Other corpora can be collected with the same method. The code (Perl) required to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available on Dataverse: https://doi.org/10.18167/DVN1/ZUA8MF . The textual data acquisition phase is done manually by querying MEDISYS. The other tasks described in this paper are automatic (i.e. pre-processing, terminology extraction, classification). In future research, we plan to use RSS feeds provided by MEDISYS in order to collect data automatically.

Experimental Design, Materials and Methods
With these datasets, two experiments were conducted per location: 3-Terminology extraction tasks using NLP approaches to highlight specific terms . Terminology extraction is based on the BioTex system [2] . Several measures are implemented in BioTex for term ranking. The F-TFIDF-C criterion based on a combination of TF-IDF [3] and C-Value [1] were used for these datasets. TF-IDF highlights discriminative terms, while C-Value favors phrase (i.e. multiword term) extraction. The extraction results are available in the Dataverse repository associated with this study (BioTex parameters used: F-TFIDF-C measure, number of syntactic patterns: 10, 3 languages). Table 1 presents a sample of these results, i.e. terminology selected with the word mask from UK_MOOD_Terms, SP_MOOD_Terms and FR_MOOD_Terms raw data. We found that the vocabulary could be period-and location-specific, but similar trends were also noted for some aspects like mandatory mask-wearing in July for all locations (e.g. mandatory mask, mandatory mask-wearing, máscara obligatoria en comercios, máscara obligatoria, masque obligatoire ).
3-Classification tasks using machine learning approaches for period prediction . Based on a vector space model representation (i.e. bag-of-words), the objective is to predict periods using machine learning approaches. Note that each month represents the class to predict with supervised learning techniques (i.e. in the ARFF files, each article has a label associated with the month). Table 1 Terms obtained with BioTex filtered with the word mask ( mask, máscara, masque ) and the associated rank for the 3 locations.  Classification scores (i.e. precision (P), recall (R) and F-measure (F)) using 10 cross-validations for period prediction at 3 locations (NB: Naïve Bayes, SVM: Support Vector Machine, RF: Random Forest).  Table 2 presents the results obtained with 3 supervised algorithms (i.e. Support Vector Machine (SMO with PolyKernel), Random Forest (bagSizePercent = 100, maxDepth = 0, numIterations = 100) and Naive Bayes) with the raw data UK_MOOD_Corpus_WEKA, SP_MOOD_Corpus_WEKA , FR_MOOD_Corpus_WEKA . Other Weka algorithms can also be used with these corpora.

Ethics Statement
The author confirms compliance with the ethical policies of the journal, as noted on the journal's author guidelines page. No ethical approval was required because this study did not involve any experimental protocol on humans or animals, and only open source online data were used. This work is based on a sample of data from MEDISYS (public site) that is an open access system available for all users 4 .

Declaration of Competing Interest
The author declares no conflicts of interest.