Improving the estimation of educational attainment: New methods for assessing average years of schooling from binned data

Background The accurate measurement of educational attainment is of great importance for population research. Past studies measuring average years of schooling rely on strong assumptions to incorporate binned data. These assumptions, which we refer to as the standard duration method, have not been previously evaluated for bias or accuracy. Methods We assembled a database of 1,680 survey and census datasets, representing both binned and single-year education data. We developed two models that split bins of education into single year values. We evaluate our models, and compare them to the standard duration method, using out-of-sample predictive validity. Results Our results indicate that typical methods used to split bins of educational attainment introduce substantial error and bias into estimates of average years of schooling, as compared to new approaches. Globally, the standard duration method underestimates average years of schooling, with a median error of -0.47 years. This effect is especially pronounced in datasets with a smaller number of bins or higher true average attainment, leading to irregular error patterns between geographies and time periods. Both models we developed resulted in unbiased predictions of average years of schooling, with smaller average error than previous methods. We find that one approach using a metric of distance in space and time to identify training data, had the best performance, with a root mean squared error of mean attainment of 0.26 years, compared to 0.92 years for the standard duration algorithm. Conclusions Education is a key social indicator and its accurate estimation should be a population research priority. The use of a space-time distance bin-splitting model drastically improved the estimation of average years of schooling from binned education data. We provide a detailed description of how to use the method and recommend that future studies estimating educational attainment across time or geographies use a similar approach.


Results
Our results indicate that typical methods used to split bins of educational attainment introduce substantial error and bias into estimates of average years of schooling, as compared to new approaches. Globally, the standard duration method underestimates average years of schooling, with a median error of -0.47 years. This effect is especially pronounced in datasets with a smaller number of bins or higher true average attainment, leading to irregular error patterns between geographies and time periods. Both models we developed resulted in unbiased predictions of average years of schooling, with smaller average error than previous methods. We find that one approach using a metric of distance in space and time to identify training data, had the best performance, with a root mean squared error of mean attainment of 0.26 years, compared to 0.92 years for the standard duration algorithm.

Conclusions
Education is a key social indicator and its accurate estimation should be a population research priority. The use of a space-time distance bin-splitting model drastically improved PLOS

Introduction
As a key marker of global progress, the accurate measurement of education is of great importance for population research. Education has been prioritized as a global development indicator, especially in the Millennium Development Goal 2 and Sustainable Development Goal 5 targets [1,2]. Educational attainment has also been linked to numerous health outcomes at both the individual and national level. The association between maternal education and child mortality and morbidity is especially robust, and has been the focus of much study [3][4][5][6][7][8][9][10][11][12]. Increases in education among women has also been strongly linked to reduced fertility, and subsequent decreases in maternal mortality [13][14][15][16]. Due to its predictive power across many domains, education is routinely used as a covariate in many studies estimating demographic or disease trends in the absence of complete data [17][18][19][20][21] or seeking to control for confounding by socio-economic status [22][23][24]. The accurate measurement of educational attainment is therefore of paramount importance for facilitating reliable and unbiased population research, and monitoring development targets. Information about educational attainment is available from a large number of survey and census data sources [25][26][27], either in single-years of attainment or in binned levels, such as "achieved primary education," or "some secondary education completed." Calculating average years of schooling is a straightforward arithmetic mean when data are available in single years of attainment. However, when data are only available in binned form, assumptions about the distribution of individuals within each bin must be used. In their 1986 work, Psacharopoulos and Arriagada were among the first to propose a simple approach, which we refer to here as the "standard duration method" [28]. They calculate the average years of schooling � S using L i , the proportion of the population who have completed the i th level of school and S i , the standard duration of school completed by people finishing the i th level, as seen below in Eq 1.
The standard duration method therefore simply assumes a single time invariant number of completed years for all individuals in each bin, which is either equal to the typical duration of the level of schooling in question, or equal to the midpoint of the bin when the educational level is reported as incomplete. Therefore, for an educational system where primary education represents years one to five of educational attainment, an individual reporting completed primary education receives five years, and an individual reporting incomplete primary receives three. This same general formula has been used by a large number of studies including those of Barro and Lee, the UNESCO Institute for Statistics, and others [12,[28][29][30][31][32][33][34][35][36].
The assumptions involved in the standard duration method are strong for a number of reasons. Information about incomplete attainment is often provided inconsistently, and so individuals who have completed some portion of a bin are coded down to completion of the level of education below. All individuals with tertiary education are often represented with a single code for "university," obscuring individuals who completed tertiary education of a non-standard length, such as a Master's or Doctoral degree. These assumptions represent a possible source of differential bias if the inaccuracies they induce are not consistent between multiple geographies and time periods. Differences in true drop-out patterns or education binning provider has a straightforward registration process involving completing a data usage agreement. After the completion of this registration process, any investigator has the same access to the files that the authors used in this study. This data repository is located on the Global Health Data Exchange, and it can be found here: http://ghdx.healthdata.org/ search/site/education. The authors confirm that they have no additional privileges in accessing these data beyond any other user who has registered with the relevant data providers.
schemas between data sources could lead to attainment estimates which are not directly comparable. Furthermore, the number of bins in each data source can vary, and it is reasonable to assume increased bias would be observed in surveys with fewer bins. Although there is an existing body of literature describing sources of bias in estimates of educational attainment, previous work has not addressed the standard duration method explicitly. Instead it has focused on identifying data that have been recorded incorrectly [37,38] or adjusting average years of schooling for the inherent quality of the education [31,39]. In this paper we seek to provide the first characterization of the bias and accuracy involved in using the standard duration method to estimate educational attainment. We also propose two models to split binned education data into single-year values and evaluate all three methods using out-of-sample predictive validity. We amassed a large database describing the distribution of years of educational attainment for a large number of countries and years. This training database, and the proposed methods, allow practitioners who need to work with binned educational attainment data to split their binned data to single years of education to minimize the bias in their estimates. The practitioner can find a list of countries and years in our training database in the S1 File, and all code is made available in the the S2 File.

Data sources
We compiled a database of n = 1,680 surveys and census datasets that provide education in both single-year (n = 1,518) and binned (n = 162) formats (see Table 1 for a summary by data provider and S1 File for a complete source list). The single-year datasets were acquired from a number of survey and census data providers, while the binned data were all obtained from Integrated Public Use Microdata Sample (IPUMS) datasets [27]. All single-year datasets were used to calculate the proportion of each sex-specific, five-year age group with each single-year of educational attainment from 0 to 18. We used 18 years of schooling as the top code in our analysis as it is a common choice among providers of single-year education data [40], and it is reasonable to assume that the importance of education for health diminishes greatly after the completion of 18 years, which represents 2 to 3 years of graduate education in most educational systems. All binned datasets were used to calculate the proportion of each sex-specific, five-year age group within each bin of educational attainment provided in the survey. Table 2 shows the total count of binning schemas by number of bins present in the data.

Models to split bins into single year values
We developed two models to split binned education values to single-years of attainment. Instead of mapping each bin to a single average number of years of attainment, as is done in the standard duration method, we probabilistically split the bin into the proportion of individuals with each single year contained within the bin. Therefore, a bin containing individuals with at least one, but not more than four years of education, would be split into the proportion with one, two, three, and four years of attainment. This approach both allows for more nuanced estimation of average years of schooling, the study's primary objective, and allows for the preservation of more precise information about the distribution of education within each population being measured. The first model uses nested hierarchical mixed effects, and the second uses a metric of space-time distance to determine an optimal training dataset before using a simple averaging process. Both models are run separately for each survey and sex specific five-year age group being split. In both cases, the first step is to take the binning schema present in the survey data being split and apply it to each single-year dataset in the training sample inducing the same bins. Next, one of the models delineated below is used to predict what proportion of each bin should be allocated to each single-year of attainment within that bin. Finally, the predicted proportions are normalized to ensure internal consistency for each bin. This step is necessary to adjust for the independent estimation of the proportion within each single year, guaranteeing that the final estimates will add up to 100% of the bin of educational attainment being allocated. All code used to run the models is made available in the S2 File.
Nested hierarchical mixed effects model. The first model uses the full set of all available training data, after the binning schema being split has been applied to each dataset, and a set of nested hierarchical mixed effects to capture the temporal and spatial trends in the distribution of proportions within each bin. The model is shown in Eq 2.
p b,y represents the logit transformed proportion of individuals in bin b who are assigned to the single-year of education y. / l represent random intercepts for each country, which are nested within / r random intercepts for each region used in the Global Burden of Disease 2013 Study [21]. / a are random five-year age-group-specific intercepts, nested within / s sex-specific random intercepts. β 0 is a global intercept, and β 1 captures any overall secular trend.
Space-time distance model. The second model tests the hypothesis that geography and time are the most important determinants of the proportions within each bin, and that given the large amount of data available for use in this analysis, a more accurate prediction may be achieved by strategically subsetting the training dataset to only the most proximate data. The model first determines the distance between the survey being split, and all available datasets in the training set, using a metric of space-time distance on a 0 to 1 scale. Distance in time is defined as the difference in years between the two datasets, divided by the total range of years present in the training dataset. Distance in space is created using a set of region and superregion groupings established by the Global Burden of Disease 2013 study [21]. Data from the same country have a spatial distance of 0, data in the same region have a distance of .33, data in the same super-region have a distance of .66, and all other data have a distance of 1. The  5  6  7  8  9  10  12  13  14  15  17  18  Total   Sources  26  25  37  8  2  1  3  4  29  6  3 18 162 Number of binning data sources used in the analysis. Shown as counts of unique country-years by the number of bins present in the binning schema of the survey. https://doi.org/10.1371/journal.pone.0208019.t002 New methods for assessing average years of schooling from binned data space and time distances are combined using the below formula, shown in Eq 3.
D F represents the final distance between two datasets, D S represents the distance in space, D T is the distance in time, and π is the space-time weight. If π has a value of 1 then the full value of D F comes from spatial distance. If π is 0, distance in time has the full weight. For all intermediate values between 0 and 1, space and time distances are combined with relative importance. The model then keeps the η closest datasets and calculates the mean proportion within each bin-year using the simple formula shown below in Eq 4.
β 0 is a global intercept used to calculate the average p b,y or proportion of individuals in bin b with y years of educational attainment. The hyper-parameters in this model, π and η, are optimized using a grid search and out-of-sample predictive validity, as detailed below. As a sensitivity analysis we also tried testing the effect of using a space-time distance crosswalk approach that weights training data by distance, as opposed to weighting all training data points equally (S3 Table).

Comparison to standard duration algorithm
To compare the standard duration algorithm to the two bin-splitting models defined above, we used the typical normative assumptions used by Barro and Lee [29][30][31][32], De la Fuente and Dom [34][35][36], and others. These include assuming that a) any individual who reports completing a level of education dropped out immediately after that level's completion and b) individuals who report starting but not finishing a level of education drop out on average at the exact midpoint of the range of possible values.
After the space-time distance model was optimized, we used the model with the best-performing parameter set, as well as the nested hierarchical mixed effect model and the standard duration method to split the actual binned data to single year values and produce average years of schooling estimates. Differences in the estimates were analyzed to understand the implications of model selection on global trends.

Out-of-sample predictive validity
In order to assess the performance of our two bin-splitting models, and compare them to the standard duration method, we use out-of-sample predictive validity testing [41][42][43][44][45][46]. We take the approach of artificially binning a large quantity of single year surveys in our dataset, and comparing the performance of each model in splitting bins into its component single years to the true single year proportions. We use ten-fold cross validation, which entails apportioning our database of single-year education data sources into ten equally sized sections. We iteratively 'knock-out' each 10% section of the data by randomly selecting and applying a binning schema from our set of binned education datasets to the single-year data. We then use the remaining 90% to predict the newly binned testing data, for which we know the true singleyear proportions and evaluate the model performance. This process was repeated three times to ensure the results are not a function of idiosyncrasies in one particular test-train split, and the results were averaged [47]. Knock-outs were randomly chosen in a country and data source specific fashion, to attempt to mimics the true pattern of observed data availability, where binning patterns are almost always specific to a country and data source [46]. For the space-time distance based model, three iterations of validation were completed for each parameter set in a grid search of pairs of π and η values. We iterated over all combinations of π values from 0 to 1 in .1 increments, as well as η values of 1,2,4,6,8,12,16,20,40,60 and 80.
The outcome measures used to evaluate out-of-sample predictive validity included root mean squared error (RMSE), which represents the average model error, and median error, which highlights any prediction bias. Each statistic was calculated with respect to the models' prediction of both average years of schooling, as well as the standard deviation of educational attainment. As the primary metric of educational attainment, average years of schooling is of special importance in establishing model performance [28]. The standard deviation of education has been used in a number of studies to assess inequality of education, and was used to evaluate the various models' performance in estimating the full distribution of education [48][49][50]. As the difficulty of predicting within-bin proportions should decrease with a greater number of bins present in a schema, all predictive validity measures for each model were assessed both globally and with respect to the number of bins in the schema being split. Trends in bias and accuracy were also analyzed with respect to the magnitude of the true population mean being estimated to capture ways in which the effect of binning may vary between differently educated populations.

Comparative model performance Accuracy and bias.
Overall the standard duration method shows the highest level of bias among the three methods-globally underestimating educational attainment values-as well as increased error relative to the two other models tested. Table 3 shows overall RMSE and median error for both average years of schooling and the standard deviation of education for all bin-splitting models. The standard duration model showed a median error of -0.47 years, which represents a substantial downward bias in predictions of mean attainment. We also observed an upward bias in estimates of the standard deviation of attainment, with a median error of 0.14 years. The standard duration method also had the highest average error in both average years of schooling, and the standard deviation of attainment.
The space-time distance model produced the most accurate predictions, shown by the lowest RMSE, in both mean years of schooling (0.26 as compared to 0.92 for the standard duration method) and the standard deviation of educational attainment (0.28 compared to .6113 for the standard duration method), followed by the nested mixed effects model. The space-time distance model produced effectively unbiased estimates of both average attainment and the standard deviation of attainment. The nested mixed effects model predicted nearly unbiased estimates of mean attainment but showed a small upward bias in the prediction of the standard deviation of attainment, with a median error of 0.10.
Performance over number of bins. Fig 1 shows the RMSE and median error in average years of schooling and the standard deviation of attainment across the number of bins in the binning schema applied. Generally, as the number of bins in the binning schema increased, all model showed increased accuracy. This is unsurprising as the difficultly of accurately predicting the proportion of individuals within each year rises directly with the width of the bins being split to single-year values. This effect was more pronounced for the standard duration model, which generally had the highest RMSE, and had markedly higher RMSE in mean attainment compared to both other models when the binning schema being split had fewer than 14 bins. The space-time distance model had the best performance in RMSE of mean attainment, ranging from an RMSE of 0.12 years for 15-18 bins to 0.32 years for 6 bins. As the number of bins decreased, both the space-time distance and nested mixed effect models showed slight increases in the RMSE of mean attainment. The space-time distance model had the most accurate predictions of the standard deviation of education across all numbers of bins, with an RMSE ranging from 0.14 for 15-18 bins to 0.36 for 6 bins, while the nested mixed effects model performed intermediately at higher numbers of bins and similarly to the standard duration method with fewer bins. The space-time distance model produced nearly unbiased predictions of both the standard deviation and average years of education across all numbers of bins, with a highest absolute Space-time distance model results shown using hyper-parameter set with optimal RMSE in mean attainment. Overall the space-time distance model has the lowest error and bias, the nested mixed effects model performs intermediately, and the standard duration method has the poorest performance. All metrics of performance tend to improve as bin number increases, and the predictive task becomes easier.
https://doi.org/10.1371/journal.pone.0208019.g001 median error value of 0.001, as compared to 0.19 for the nested mixed effects model, and 1.10 for the standard duration method. The nested mixed effect model showed less biased predictions of mean attainment compared to the standard duration method, but both showed similar levels of upward bias in predictions of the standard deviation of attainment as the number of bins decreased, with a maximum median error for the standard duration method of 0.33 for 8-13 bins, and 0.26 for the nested mixed effects model at 4-5 bins. The standard duration method produced downward biased estimates of mean attainment and upward biased predictions of the standard deviation of attainment, with both phenomena inversely related to the number of bins being split.
Performance over true mean. Fig 2 shows the RMSE and median error in both average years and the standard deviation of education by the true mean attainment of the data being split in one-year increments. The RMSE of mean attainment estimates produced using the standard duration method increased as the true mean attainment of the population increased, up to a maximum of 1.68 for a true mean of 17-18 years. The median error of mean attainment from the standard duration method also became increasingly negative with higher true mean values, to a minimum value of -1.72 for a true mean of 17-18 years. Together these results show that the standard duration method underestimates mean attainment among more highly educated populations. The standard duration method also overestimates the standard deviation of attainment with higher true average attainment, although the RMSE in the standard deviation seems less clearly related to true average education.
The nested mixed effects model has poor accuracy in predicting the standard deviation of education, especially in lower true mean attainment populations. The model does perform better than the standard duration model in RMSE of mean attainment, although worse than the space-time distance model. It also generally produces unbiased predictions of both average years and the standard deviation of education, although it shows some increased bias in higher true mean populations. The space-time distance model performs the most accurately of the three models, showing the smallest RMSE in the mean and standard deviation of attainment across all levels of true mean attainment. For example, at a true mean attainment of 9-10 years, the RMSE in mean attainment for the space-time distance model is 0.24, compared to 0.43 for the nested mixed effects model and 1.00 for the standard duration method. The model shows almost no bias until higher true levels of attainment, where a slight increase in downward bias is present, as evidenced by a median error in mean attainment of -0.33 for a true mean of 17-18. The model has relatively stable RMSE in both the mean and the standard deviation of attainment across true mean attainment values. Fig 3 shows the RMSE and median error in both average years and the standard deviation of education across the grid of hyper-parameters used in the space-time distance model that were tested using out-of-sample predictive validity. All outcome measures proved to be remarkably planar near their optima, with large areas of virtually identical RMSE and median error values near optimal parameter values. All predictive validity measures were also generally smooth with respect to hyper-parameters, and there were few local extrema.

Space-time distance model hyper-parameter grid search
The bias in the mean and standard deviation of attainment was close to zero across the vast majority of the grid of hyper-parameters, so optimal parameter values were chosen based on RMSE. The best hyper-parameter set with respect to RMSE in average attainment is a π of .6 and an η of 12. This parameter set also had very close to optimal predictive validity values in all other aspects tested, and was therefore chosen as the single best set for use in the model.

Impact of standard duration method on global education trends
In order to explore the potential impact of the use of the standard duration model on global education trends, we used each model to produce average years of schooling estimates using our binned education data. Fig 4 shows a scatterplot of the average years of schooling estimated using both the standard duration method, and the space-time distance model, which was chosen due to better performance than the nested mixed effects model in out-of-sample validity testing. A line of equality shows the point at which the two models produce identical estimates for the sample country-year-sex and five-year age group. In general, similar trends are seen in the global education estimates to those observed in the out-of-sample predictive validity exercise. The standard duration model tends to produce lower estimates of average years of schooling, although this is not observed in all circumstances. There are also large differences in the deviations between the predictions in the two models, with some regions tending to have predictions further from the line of equality. This suggests that, as indicated by out-of-sample predictive validity testing, the standard duration method is biased to a differing degree depending on context, therefore affecting the comparability of results across different geographies and time periods.

Conclusions
This study represents the first evaluation of the standard duration method for mapping binned education data onto single-year values and calculating average years of schooling. This method has been used widely by all major attempts to measure educational attainment, from the early estimates of Psacharopoulos and Arriagada to the more recent work by Barro and Lee and others. Average years of schooling estimates produced in this manner have been used widely, from being an essential component of the Human Development Index, to a key indicator for UNESCO and the World Bank, and a key measure of socio-economic status in numerous multi-country health studies. Nevertheless, as we show in this analysis, the standard duration method produces inaccurate and biased estimates of average years of schooling. This effect is uneven across the number of bins of attainment present in binning schemas and the level of true population mean years of schooling, which nearly assures that use of the method will produce less comparable results between geographies and time periods being measured. We show that the standard duration method is generally less accurate than models attempting to split bins into single-years and tends to substantially underestimate the average attainment of more Hyper-parameter grid search predictive validity. Predictive validity metrics for the mean and standard deviation of attainment predictions of space-time distance model, over the grid of hyper-parameters tested using out-of-sample predictive validity. The grid search was remarkably planar in the middle of the search space, as the model does not seem to be that sensitive to hyper-parameter selection. The best hyper-parameter set with respect to RMSE in average attainment is a π of .6 and an η of 12, which was chosen at the hyper-parameter set to split the binned data. https://doi.org/10.1371/journal.pone.0208019.g003 New methods for assessing average years of schooling from binned data highly educated populations. The use of this method has likely deflated global estimates of educational attainment, and induced inconsistencies between countries, in the most widely used education estimates at the time of publication. It is therefore clear that although the method is a very convenient approximation, it should only be used very cautiously, and comprehensive global estimates of educational attainment should employ a more nuanced approach.
This study demonstrates that an effective way to split bins of educational attainment into single-year values is to match the data source in question with a number of other surveys or census datasets that are close in terms of geography and time, and take the average of their school dropout pattern. While our analysis operationalizes this process with parameters, the specific number of surveys used and relative emphasis of space and time seems less to have a limited effect on the results. Therefore, the method could likely be replicated or possible even improved upon with expert opinion, especially when splitting a small number of surveys. Given the current widespread availability of single-year education datasets, most researchers should have no trouble finding suitable datasets to match with their binned education data sources.
Beyond improvements in accuracy and reductions in bias, the use of a space-time distance model represents a highly flexible approach to combining education data. The model allows Average years of schooling estimated using the standard duration method (X axis), and the space-time distance model (Y axis) using best performing hyper-parameters. A line of equality added to show the point at which the two models produce identical estimates. Each point is a country-year-sex and five-year age group. Overall the predictions from the standard duration model are lower, with substantial differences between super-region grouping used in the GBD 2015 study, suggesting differential bias. https://doi.org/10.1371/journal.pone.0208019.g004 for the use of any kind of binned and single-year data in concert. For many geographies it is only possible to find binned education data while for others only single-year data are available. The use of a space-time distance model allows for the combination of these data sources without concerns of compositional bias. Furthermore, the space-time distance approach produces highly unbiased single-year proportions with only minimal inaccuracy. Any lingering inaccuracy can be propagated through to final education estimates with the use of uncertainty intervals. The other advantage of the approach outlined in this analysis is that by using collapsed forms of person-level survey and census data to train the model, it is possible to split education bins for any custom age or sex groupings found in a given data source.
A main limitation of this study is that the results are representative only of the single-year training data and binning schemas used for out-of-sample predictive validity. There may be specific data sources that some researchers may wish to use that were not included in our analysis. Nevertheless, given that we have included 1,680 country-year datasets in this analysiswhich to our knowledge represents the most comprehensive collection of education data to date-our results should robustly generalize to most geographies and time periods. Another factor for consideration in interpreting our results is that although we have characterized the variation in accuracy and bias over the principal dimensions that could affect model performance-the overall average level of education in the population and the number of bins present in the binning schema used-it is possible that other dimensions we are not aware of exist that should be considered in model selection. Similarly, although we have optimized several key hyper-parameters used in the proposed methods, other parameters and model combinations were not tested due to limits of computation burden. We hope that future studies may corroborate our findings and further refine the proposed methods. Furthermore, it is important to note that although average years of schooling is the most ubiquitously used measure of educational attainment, it does entail other sources of bias that may complicate its use [31,[37][38][39]. We do not argue here that average years of schooling is the best metric of educational attainment, however in recognition of its prevalence in social science and health research, we recognize that evaluating and ameliorating bias in its estimation remains paramount.
Given the importance of educational attainment as a social determinant of health, indicator of socio-economic status, and metric of development, the accurate measurement of education is an important metrics aim. We have made available the code used in the analysis, and linked to a large number of data sources, so that other researchers may find it straightforward to implement a space-time distance model to produced unbiased and more accurate singe year education values. The best-performing method explored in this study is relatively easy to emulate, requiring only arithmetic and publicly available data from spatially and temporally proximate populations to implement. Therefore, we hope that its adoption in academic endeavors using measures of educational attainment can reduce bias, as well as improve the monitoring and scientific understanding of education.