A dataset for connecting similar past and present causalities

In this data article, we present a dataset that includes past causalities and categories to connect similar past and present causalities. First, we collect past causalities by referencing certain well-known Japanese high-school textbooks. Subsequently, we select 138 causalities that are useful for analogizing from the causalities to considering solutions for confront present social issues. To enhance the analogy, we describe each causality in three contexts: background including problems, solution methods, and their results. We define 13 categories based on the selected causalities and Encyclopedia of Historiography. The past causalities belong to more than one category. In addition, to train machine learning models including classifier, we collect 900 past events from Wikipedia, and assign one or more categories to the past event data. We perform statistical analyses to understand the quality of the dataset. The proposed applications of the dataset include training machine learning models such as classifiers for past causalities and information retrieval for ranking present social issues according to the similarities between the present and past causalities.


a b s t r a c t
In this data article, we present a dataset that includes past causalities and categories to connect similar past and present causalities. First, we collect past causalities by referencing certain well-known Japanese high-school textbooks. Subsequently, we select 138 causalities that are useful for analogizing from the causalities to considering solutions for confront present social issues. To enhance the analogy, we describe each causality in three contexts: background including problems, solution methods, and their results. We define 13 categories based on the selected causalities and Encyclopedia of Historiography. The past causalities belong to more than one category. In addition, to train machine learning models including classifier, we collect 900 past events from Wikipedia, and assign one or more categories to the past event data. We perform statistical analyses to understand the quality of the dataset. The proposed applications of the dataset include training machine learning models such as classifiers for past causalities and information retrieval for ranking present social issues according to the similarities between the present and past causalities.

Data Description
The published dataset [3] (see metadata in Table 1) consists of seven types of data. The first type includes 138 past causalities in historical_causalities_data.csv file. In this dataset, all causalities include their backgrounds and results. The second type includes the categories that are defined for causalities stored in historical_causalities_categories.csv. This file includes 13 categories: Reign (Rg), Diplomacy (Dp), War (Wr), Production (Pr), Commerce (Cr), Study (St), Religion (Rl), Literature and Thought (LT), Technology (Tc), Popular Movement (PM), Community (Cn), Disparity (Ds) and Environment (En). The third type includes 900 past event data to support training machine learning models. In the dataset, all events include only descriptions of the events; in other words, the event data excludes both backgrounds and results of the events. The past_events_wikipedia.tsv file contains the descriptions of the 900 past events and their categories. The three files (historical_causalities_regions.tsv, causali-ty_regional_distribution.tsv, and causality_temporal_distribution.tsv) provide additional information of the causalities included in the historical_causalities_data.tsv file. The historical_causalities_regions.tsv includes region where each causality occurred. The other two files (causality_regional_distribution.tsv and causality_temporal_distribution.tsv) Specifications Table   Subject Information Systems Specific subject area Data mining, Digital History, Labeled Dataset for Machine Learning Type of data  Fig. 1e6 and Table 4 are stored in: causality_regional_distribution.tsv (Fig. 1), causality_temporal_distribution.tsv (Fig. 2), and Statistics.tsv ( Fig. 3e6 and Table 4). Parameters for data collection Description of data collection The collection processes were performed by manual inspections of experts who have Ph. D. degrees in related research fields.

Data source location
As for describing data of past causalities, we referenced past causalities in Japanese textbooks: Shosetu Sekaishi B (Se B 304) [1] and Sekaishi B (Se B 301) [2]. As for data collected Wikipedia texts, we used following links: https://www.en. wikipedia/* where "*" is replaced by one of from 1 to 1999.

Value of the Data
The dataset is useful for training machine learning models as a labeled dataset to predict causalitydcausality similarity over time.
The dataset can be beneficial for HistoInformatics researchers and history education researchers who are developing new teaching tools to bridge the gap between the past and present, and computer scientists working on temporal data mining including information retrieval. Once event detections and classifiers are developed by the dataset, they can use them to assign causality categories to other new data. Through this process, the dataset is able to re-examine how accurately categories are defined, identify which categories are (in) dependent of time, or add new categories to the dataset. We provide the scores of statistical analyses to measure/evaluate dataset quality for estimating which machine learning models can be effectively trained on our dataset. In addition, from these scores, it is able to develop other dataset such as translation to Chinese, adding new temporal events with detail comparisons between our scores and new dataset to identify changes of accuracies.
include distributions of causalities by regions and centuries, respectively. Last, Statistics.tsv file includes scores of statistical analyses, which are described in "Statistical Analysis" section. This file provides all raw scores for estimating which machine learning models are useful for several kinds of applications, e.g., classification and information retrieval (IR) algorithms that can bridge the past and present.

Causality Data Collections
The 138 past causalities were created by authors in three steps. First, we collected over 700 past causalities by referencing well-known Japanese high-school textbooks: Shosetu Sekaishi B (Se B 304) [1] and Sekaishi B (Se B 301) [2]. Second, we selected the causalities if they could be useful for considering solutions for present social issues. Finally, we described each causality in three contexts: background including problems, solution methods, and their results.

Category Definition
The causality categories are defined to organize the causalities with the useable historical framework [5] as described in Ref. [4]. In Ref. [5], Lee claims that causalities over different times can be The first column includes centuries. BC centuries are represented with minus ("-") in this file.The second column includes the number of causalities when occurred in the century of the raw.

causality_temporal_distribution.tsv
Numbers of causalities for all centuries.
The first column includes region names.
The second column includes the number of causalities where occurred in the region of the raw.

Statistics.tsv
Scores of statistical analyses described in this paper This file contains all scores of statistical analyses described in the "Statistical Analysis" section. This file provides raw data of Fig. 3e6. bridged if they belong to the framework because the framework is an overview of the long-term patterns of change and not a mere outline story skimming a few peaks of the past. Under this idea, the category definition processes comprised 2 steps. First, we reviewed Encyclopedia of Historiography [6] to define categories for connecting past and present causalities. In the review process, we listed all the main topics from the encyclopedia and subsequently selected a topic only if it included the longterm patterns of change. Second, we evaluated if each topic included causalities independent of time.
As we extracted causalities from history textbooks, we divided them into three temporal periods: ancient, medieval and modern periods. If a topic included causalities from the three temporal periods, we used it as a category in the dataset. Moreover, we added some new categories in the dataset if we found new topics that included causalities from the three temporal periods. Finally, we defined the 13 categories described in Data Description section.

Event Data Collections
The 900 past event data were crawled from Wikipedia articles whose tiles were the years from 1 to 1999, for example, http://en.wikipedia.org/1. The collection process was as follows: 1) All events were crawled from yearly Wikipedia articles. 2) It was manually reviewed whether the crawled events could be useful to consider any solutions for present social issues. 3) At most, 50 events per century were randomly sampled to cover a wide range of durations.

Basic Statistics
Tables 2 and 3 summarize the statistics of the entire published dataset and the number of causalities for each category, respectively.
Figs. 1 and 2 plot distribution of the numbers of causalities by region and centuries. These figures help us to understand tendency of the published dataset because temporal and spatial features are the most important features of history. Fig. 1 plots the number of causalities per century. Naturally, the distribution curve increases near the present. This indicates that the closer  the causalities are located to the present on the temporal axis, the greater is their usability for considering solutions for present issues. Fig. 2 plots the distribution of the number of causalities where they occurred. We can see that most causalities occurred in China and Europe, as they have long-term histories.

Statistical Analysis
In addition to the basic statistics, the published dataset provides scores of similarity between data points and statistics of clusters to help to train machine learning models. The provided scores are results of the following five analyses.  1. Calinski and Harabasz (CH) [7]. This measure estimates how close all data to each other in a cluster and how far data in different clusters locate. Thus, the higher score of this measure indicates the high quality of the given clusters. The formal equation is defined as follows: CHðkÞ ¼ ðn À kÞ BðkÞ ðk À 1Þ WðkÞ where B(k) and W(k) are intra-and inter-cluster sums of squares for k clusters, respectively, and n is the number of clustered data.

Mutual information (MI)
. This measure evaluates the similarity of two categories A and B as an information-theoretic approach. Let P(a) and P(b) are marginal probabilities, and P(a, b) is a joint probability. MI calculates volumes of information a given set generates about the other set. This is done by the following equation.
MIðA; BÞ ¼ X b2B X a2A Pða; bÞlog Pða; bÞ PðaÞ PðbÞ The provided scores of this dataset are generated from adjusted MI (AMI) [8] that is a variant of MI defined as follows: AMIðA; BÞ ¼ MIðA; BÞ À EðMIðA; BÞÞ maxðHðAÞ; HðBÞÞ À EðMIðA; BÞÞ where EðMIðA; BÞÞ is the expected MI between two given categories, max is a function that returns the largest value among given values, and HðAÞ is the entropy of category A.
3. Jaccard index (Jaccard). This measure employs an assumption that if two sets S A and S B for two categories A and B have many common data, then the two sets are similar to each other. As the size of given sets affects the common numbers, this measure normalizes the score by taking account of the total sizes of given two sets. In other words, if a given set is huge compared with other sets, the huge set tends to include several elements of other sets. This idea is represented as follows: This is an entropy-based measurement; the lower this score, the more similar the two given probability distributions are. As our dataset includes texts written in natural language, we first apply TF-IDF to convert the text into numbers. Let w, d and WðdÞ are a word, a document, and a word set of d. The TF-IDF estimates the importance of w in d by counting the numbers of occurrences of the word in the document and by the numbers of documents including the word. Once all data can be converted to vectors whose elements are scores of the importance of the words, the similarity between two data can be measured by JS divergence that is an extension of KullbackdLeibler (KL) divergence. These approaches are defined as follows: TF À IDFðw; dÞ ¼ tf w; d jDj jfd 0 2D j w i 2Wðd 0 Þgj 5. Meta-data similarity. This measure counts the number of common categories shared by two causalities. Similar to the Jaccard index, the higher the score, the more similar the two causalities are. Thus, given two causalities, the measure is represented as the sum of the common categories. This is formally defined as follows: where AND is the logical AND. It is 1 if both operands are 1; otherwise, it is 0. This measure considers the feature vectors (FV i and FV j ) of two causalities that are defined from the categories of the two causalities. If causality C i has the kth category, then FV ik is 1; otherwise, it is 0. We applied the above measures for all combinations of causalities within each category (intracategory) and within two different categories (inter-category), which are described as follows: 1. Intra-category Similarity. This similarity represents the average score of similarity between all combinations of two causalities in a category. 2. Inter-category Similarity. This similarity represents the average score of similarity between all combinations of two causalities from two different categories.

Scores of Statistical Analysis
The CH score for the past causality data was 1.0829. This indicates that the intra-cluster and intercluster sums of squares for the k clusters are almost the same. Table 4 shows all scores of all intra- Fig. 3. Inter-category meta-data similarity. category measurements. Overall, these scores indicate that all texts in the same category are not similar to each other. These scores indicate that the published dataset covers several kinds of causality topics. If it is necessary to train machine learning models only on the causality texts, it is better to use simple IR algorithms, for example, the query word matching method, or to employ transfer learning using the categorized past events. Fig. 3 shows all scores of the inter-category meta-data similarities on the causality data. Three combinations of two categories, RgdDp, CrdSt and DsdPM contain more common categories compared to other combinations. Fig. 4 plots the MI scores for all combinations of categories on the causality data. This figure indicates that three categories (St, Rl and LT) are more similar to each other compared with other category combinations. Fig. 5 plots the Jaccard index for the inter-category analysis. Similar to the  scores of MI, three categories (St, Rl and LT) are more similar to each other compared to others. However, the Jaccard similarity scores between the three categories are lower than the MI scores. Fig. 6 shows the TF-IDF þ JS scores. Two combinations of categories are similar to each other. The scores for the St, Rl and LT categories are smaller compared to other combinations. In addition, the Dp, Wr, Pr and Cr categories have relatively lower scores compared to other combinations except the combination of St, Rl and LT. Thus, these four categories are more similar to each other compared with other category combinations.

Summary of the Statistical Analyses
All statistical analyses for causalities indicated that the similarities between intra-and intercategory data tended to be low. Thus, if it is necessary to use only past causality data in machine learning study such as IR specialized for history, it is better to use simple techniques, e.g., simple wordbased pattern matching and counting common categories. In contrast, if it is able to use both past causality and event data together, using more sophisticated machine learning models such as SVM, naive Bayes classification, and random forests is a good choise as the published data includes 1038 categorized data.