Infodemiological data of high-school drop-out related web searches in Canada correlating with real-world statistical data in the period 2004–2012

The present data article describes high-school drop-out related web activities in Canada, from 2004 to 2012, obtained mining Google Trends (GT), using high-school drop-out as key-word. The searches volumes were processed, correlated and cross-correlated with statistical data obtained at national and province level and broken down for gender. Further, an autoregressive moving-average (ARMA) model was used to model the GT-generated data. From a qualitative point of view, GT-generated relative search volumes (RSVs) reflect the decrease in drop-out rate. The peak in the Internet-related activities occurs in 2004 (56.35%, normalized value), and gradually declines to 40.59% (normalized value) in 2007. After, it remains substantially stable until 2012 (40.32%, normalized value). From a quantitative standpoint, the correlations between Canadian high-school drop-out rate and GT-generated RSVs in the study period (2004–2012) were statistically significant both using the drop-out rate for academic year and the 3-years moving average. Examining the data broken down by gender, the correlations were higher and statistically significant in males than in females. GT-based data for drop-out resulted best modeled by an ARMA(1,0) model. Considering the cross correlation of Canadian regions, all of them resulted statistically significant at lag 0, apart from for New Brunswick, Newfoundland and Labrador and the Prince Edward island. A number or cross-correlations resulted statistically significant also at lag −1 (namely, Alberta, Manitoba, New Brunswick and Saskatchewan).

Examining the data broken down by gender, the correlations were higher and statistically significant in males than in females. GT-based data for drop-out resulted best modeled by an ARMA (1,0) model. Considering the cross correlation of Canadian regions, all of them resulted statistically significant at lag 0, apart from for New Brunswick, Newfoundland and Labrador and the Prince Edward island. A number or cross-correlations resulted statistically significant also at lag À 1 (namely, Alberta, Manitoba, New Brunswick and Saskatchewan

Experimental features
Validation of Google Trends-based data with "real-world" data taken from the Canadian Statistical Office was performed by means of correlational and crosscorrelational analyses Data source location Canada (regional and provincial levels) Data accessibility Data are within this article

Value of the data
Google Trends (GT)-based data (infodemiological data) show good correlation with "real-world" data obtained from the Canadian Statistical Office and can be used/re-used by scientific community/researchers for surveys/studies concerning high-school drop-out in Canada and/or other countries.
These data could be further statistically processed and a mathematical predictive model could be designed and refined.
To the best of our knowledge, this is the first application of GT in the field of education. These data could be used to complement existing monitoring programs about educational attainment and may inform new strategies and policies.

Data
This paper contains infodemiological data on high-school drop-out in Canada in the period 2004-2012 obtained from Google Trends (GT) ( Table 1, Figs. 1 and 2). These data were correlated (Table 1) and cross-correlated (Tables 2 and 3) with "real-world" data obtained from the Canadian Statistical Office for the same study period. Further, these data were modeled using a predictive approach (Fig. 3), whose fitting and quality parameters are reported in Table 4.

Experimental design, materials and methods
GT (Google Inc, Menlo Park, CA, USA; http://www.google.com/trends/), a freely available, online tracking system of Internet hit-search volumes that recently merged with its sister project Google Insights for Search, was used to explore Internet activity related to high-school drop-out rates in Canada. In order to make comparisons between terms and countries easier, GT does not provide users with raw, absolute figures, but adjusts them by calculating the proportion of searches for userspecified term(s)-keyword(s) among all searches carried out using Google. Then, GT provides users with the so-called relative search volume (RSV), which is the query share of a particular term (or terms) for a given location and time period, normalized by the highest query share of that term (or terms) over the time series and presented on a scale from 0 to 100. Each point of the graph generated by GT is, indeed, normalized, that is to say divided by the highest point, which is conventionally set at 100, depending on a topic's proportion to all queries on all topics. For this reason, two countries with the same absolute volume of queries for a given term (or terms) may not have the same RSV.
GT enables users to search using two different search strategies, namely the "search term" and the "search topic" options. In the first case, GT searches exactly the term or string of terms provided by the user, whilst in the second case not only the typed term(s) but all those related to the provided keyword(s) are searched. Since the second search strategy generally results into a broader search, this last option was preferred.
From a methodological point of view, in order to ensure accountability and reliability of the data, this data-article was carried out following the checklist proposed by Nuti and colleagues [1].
The most recent available drop-out rates data were collected and downloaded from the Canadian Statistical Office for the period 2004-2012 [2][3][4][5]. In particular, the data from the Labour Force Survey (LFS) were utilized: LFS, besides estimating employment and unemployment rates, collects also demographic and education information, such as educational achievement of the population and Table 1 Correlational analysis between Google Trends-generated high school drop-out-related data and official Canadian drop-out statistical data, broken down by year and gender.

Year
Average school attendance rate. These data are coupled with the age of the respondents and combined to obtain the drop-out rate. LFS assumes that, by the age of 20-24 years, high school education should have been usually completed. Further, in order to correct eventual biases and errors, provincial drop-  Heat map at sub-regional/town level. out rates are averaged over a span of 3 years, whilst, being more statistically robust and reliable, national estimations are not averaged [2][3][4][5].
Correlational and cross-correlational analyses were carried out between the GT-generated search volumes and the statistical data about Canada drop-out rate.   GT-generated data were modeled using an autoregressive-moving-average (ARMA) predictive model, which provides a parsimonious description of a time series in terms of two polynomials, one for the auto-regression and the second for the moving average. Different ARMA models were run, and the best one was chosen on the basis of quality parameters, such as R-squared, sum squares error, mean square error, and root mean square.
GT was last accessed on 7th July 2016. All statistical analyses were performed using the commercial software Statistical Package for Social Science (SPSS, version 23.0, IL, USA) and NCSS Data version 11.0 (NCSS, LLC). Figures with a pvalue o 0.05 were considered statistically significant.

Transparency document. Supporting information
Transparency data associated with this article can be found in the online version at http://dx.doi. org/10.1016/j.dib.2016.09.032.