Event-Dataset: Temporal information retrieval and text classification dataset

Recently, Temporal Information Retrieval (TIR) has grabbed the major attention of the information retrieval community. TIR exploits the temporal dynamics in the information retrieval process and harnesses both textual relevance and temporal relevance to fulfill the temporal information requirements of a user Ur Rehman Khan et al., 2018. The focus time of document is an important temporal aspect which is defined as the time to which the content of the document refers Jatowt et al., 2015; Jatowt et al., 2013; Morbidoni et al., 2018, Khan et al., 2018. To the best of our knowledge, there does not exist any standard benchmark data set (publicly available) that holds the potential to comprehensively evaluate the performance of focus time assessment strategies. Considering these aspects, we have produced the Event-dataset, which is comprised of 35 queries and set of news articles for each query. Such that, C={Qs,Ds}, where C represents the dataset, Qsis query set Qs={q1,q2,q3,…….,q35}and for each qi there is a set of news articles qi={dr,dnr}. dr,dnrare sets of relevant documents and non-relevant documents respectively. Each query in the dataset represents a popular event. To annotate these articles into relevant and non-relevant, we have employed a user-study based evaluation method wherein a group of postgraduate students manually annotate the articles into the aforementioned categories. We believe that the generation of such dataset can provide an opportunity for the information retrieval researchers to use it as a benchmark to evaluate focus time assessment methods specifically and information retrieval methods generically.


Data
The Event-Dataset contains relevant and non-relevant news articles for 35 popular events ( Table 1). The news articles are annotated by human annotators into relevant and irrelevant classes (Fig. 1, Fig. 2), providing rational for their judgment ( Value of the data Event-Dataset presents relevant news articles related to 35 popular events from the past and future and it has the following applications: It can be used for information retrieval tasks. It can be used for temporal information retrieval tasks. It can be used to evaluate the methods of estimating the focus time of documents [5]. It can be used for text classification purpose.  For each event, news articles are collected using Google News search. A web scraper is developed to extract the documents, where the scraper searches the query in Google news search. The queries are comprised of events discussed in the following section. The extracted documents are then annotated by postgraduate students. In the annotation process, each article is observed by two participants for the relevancy, i.e., either relevant to the event or not. The annotators are then asked to justify their judgment. This dataset is developed with an intention to determine focus time of news articles [1e4]. The Event-dataset can also be used for general information retrieval and text classification tasks ( Table 4). The dataset contains 35 temporal queries and a set of relevant and nonrelevant news documents.

Experimental design, materials and methods
35 popular events from the past and future are selected that occurred during the years of 1997e2022. The events are well reported all over the world and selected randomly. www. brainyhistory.com website is used to verify the events. This website maintains a list of popular events from year 1e2015 AC. A couple of future events are related to sports events, such as Football and Cricket world cup. Google trend tool (www.trends.google.com) can also be used to verify the popularity of events from the year 2008 to the current year. The events, the corresponding year and a brief description is presented in Table 1.
Queries: In order to retrieve the most relevant documents, we use explicit temporal queries Q t , Qt ¼ fQ text ; Q time g comprises of two parts: a textual part Q text and temporal part Q time , where Q text ¼ fW 1 ; W 2 ; W 3 …::W n g and Q t ¼ fYear e g. The textual part Q text comprises of query terms (i.e., event name) and the temporal partQ time is the year when the event occurred. Such queries are normally referred to as explicit temporal queries. Explicit temporal queries capture the real world meaning of time [6]. For instance, to collect relevant news documents pertaining to an event of Prince Charles wedding, the query is "Prince Charles Wedding 2005". Multiple related queries are used to extract the event related news articles. For example, for BP Oil Spill event the queries were " BP oil spill 2010 00 , "Deepwater Horizon oil spill 2010 00 , "BP oil disaster 2010 00 , "Gulf of Mexico oil spill 2010 00 and "Macondo blowout 2010".
Platform and Process: Google News API is used to extract the document collection using a spider. The queries for events are searched in Google News using API. The crawler is designed in such a way that the top 100 news are extracted against the individual query. As a single event has multiple related queries, the probability of retrieving duplicate documents is high. To address this problem, the crawler discards all the duplicate document and only downloads unique documents. The crawler searches for three types of information. i.e. the title of a news story, creation date and text content of the original story. In some pages, if the creation time is not available, the publication time or updating time is used as creation time. The top k news articles (k ¼ 100) ranked by the Google news search are extracted against each event. After removing the duplicate documents, a total of 2926 news documents against 35 queries is collected for the dataset.
Preprocessing: Standard preprocessing methods are used to clean the data. These methods include: removing unwanted text, conversion to lower case, removing duplicate documents, removing documents have wrong creation time and removing hyperlinks or images descriptions. Annotation: A gold standard is developed by relying on human judgments in identifying the actual focus time from news documents. Total of 2926 news documents (related to 35 events) is distributed among 70 post-graduate students. Each participant is assigned the news documents related to a specific event (query) and asked to label each document as relevant or non-relevant according to the given event. Thus, for each event, the extracted news documents are labeled by 2 participants. The relevance of a document to the query obviously ensures that the document relates to a corresponding event (i.e., event presented in the query). If annotators discovered that the document content is dominated by the main event [7], they marked it as relevant, otherwise non-relevant. The event relevant documents in the dataset follow the notation "news-peg" [8], which is defined as an event which prompted the author to the article. News-peg serves as a measure of noteworthiness, estimating the role of event importance in prompting the author to write an article.
The participants were asked to provide the rational for each judgment. The annotator rational is in the form of the short excerpt (2 or 3 sentences), explaining why the annotator think a document is relevant or irrelevant. This method of the annotation is proved to be efficient in Information Retrieval (IR) tasks and incurs no additional time as the annotator might be already doing so implicitly [9,10]. Table 2 shows an example of rationales for two documents. Finally, we consider those documents to be relevant where both the participants agreed with respect to the temporal annotation. Whereas, documents are irrelevant if both annotators mark it as irrelevant or have conflicting remarks. For 771 (26.34%) documents the annotators have conflicting remarks whereas, for 2155 (73.65%) documents the annotators have the same remarks. Out of these 2155 documents, 810 (27.68%) documents are marked as relevant whereas, 1345 (45.96%) are marked as irrelevant. Table 3 shows the Event-Dataset description. The relevant documents against each individual event in the dataset are presented in Fig. 1. Q 1 ; Q 2 ; ……::; Q 35 on X-axis shows the events in alphabetic order as described in Table 1. Fig. 2 presents the statistics of relevant and non-relevant documents in the dataset.
The Event-dataset can also be used for classification tasks. The selected events can be categorized into high-level categories. Table 4 shows 35 events into 4 high-level classes including, environmental, political, violence and entertainment and sports. Event detection and topic modeling techniques can be evaluated using Event-dataset. For such purpose, the relevant documents for each event can be used to test topic modeling and event detection methods.

Transparency document
Transparency document associated with this article can be found in the online version at https:// doi.org/10.1016/j.dib.2019.104048.