Calculating fertility and childhood mortality rates from survey data using the DHS.rates R package

The Demographic and Health Surveys (DHS) are a major source for many demographic and health indicators in developing countries. Although these indicators are well defined in the literature, using survey data to calculate some of these indicators has never been an easy task for data users. This paper presents the DHS.rates software, a user-friendly R package developed to calculate fertility indicators, such as the total fertility rate, general fertility rate, and age-specific fertility rates, and childhood mortality indicators, such as the neonatal mortality rate, post-neonatal mortality rate, infant mortality rate, child mortality rate, and under-5 mortality rate, from the DHS data. The package allows for national and subnational indicators. In addition, the package calculates sampling error indicators such as standard error, design effect, relative standard error, and confidence interval for each demographic indicator. The package can also be used to calculate the same indicators from other population surveys such as the Multiple Indicator Cluster Survey (MICS).


Introduction
For the last 35 years, the Demographic and Health Surveys (DHS) Program has conducted more than 300 national surveys in more than 90 countries in Africa, Asia, and South America. The DHS surveys are based on nationally representative samples that allow for national and regional estimates. Standard DHS surveys are designed to provide information about fertility, family planning, maternal and child health, and childhood mortality levels. Most of the DHS surveys follow a two-stage sampling design, where census Enumeration Areas (EAs) are selected in the first stage as Primary Sampling Units (PSUs). In the second stage, a sample of households is selected from each selected PSU. From the selected households, all women age 15-49 who slept in the household the night before the survey are eligible to complete a questionnaire designed for women. In addition, in all households or in a subsample, all men of reproductive age (typically 15-49, 15-54, or 15-59) might be eligible to complete a questionnaire designed for men.
DHS surveys can be used in many ways. In addition to the data tables published in the DHS full reports and the key indicators reports, data users can download the survey datasets in PLOS

Fertility and childhood mortality indicators
The direct estimation methods are broadly used to estimate fertility and childhood mortality indicators based on household survey data. These methods were adopted by the World Fertility Survey (WFS), which ran from 1972 to 1984, and subsequently by the DHS Program. The WFS/DHS approach is well documented in several publications [8,9,10,11,12,13]. The WFS/DHS approach has also been used by other household survey programs, such as the MICS. In the direct estimation methods, data about the retrospective birth history are collected and used for the computation of fertility and mortality indicators [14]. In all DHS surveys, in addition to the background information collected from women age 15-49, information about birth history is collected from each interviewed woman. The birth history includes month and year of birth of each child, sex of each child, survival status of each child, age of each surviving child, and age at death of each deceased child. Such information allows for fertility rates and childhood mortality rates to be calculated from the DHS survey data. Since the rates are calculated based on birth histories collected in a survey, all rates are calculated as occurrences/exposure rates, where the numerator is the number of occurrences of some event of interest and the denominator is the population exposed to risk of the event within a reference period. In this section we will briefly introduce the definition of these fertility and childhood mortality rates and will introduce the survey variables required for calculating these rates. These definitions and methods are in line with those of the DHS, as detailed in the Guide to DHS Statistics [8].

Fertility indicators
Age-specific fertility rates (ASFR). The ASFR is the number of births occurring during a given year or a reference period per 1,000 women-years of exposure for a specific age group, in single years or five-year age groups. In the DHS surveys, ASFR is calculated as the number of births during a reference period of three years preceding the survey divided by the womenyears of exposure to childbearing. It is calculated for seven age groups of five years each (15-19, 20-24, 25-29, 30-34, 35-39, 40-44, and 45-49). For each age group, where B a denote the number of births to women in age group a during the reference period, and E a denote the number of women-years of exposure in age group a during the same reference period, the ASFR in age group a can be written as follows: From the DHS data, the information on the exact date of birth of each child can be used to directly calculate the numerator B a . To calculate the denominator E a , the exact date of birth of each woman can be used to count the number of women-years of exposure in each age group, taking into consideration that a woman can contribute to more than one age group during the reference period. For illustrations and examples about the calculation of the women-years of exposure, see [5] and [9]. In addition to measuring the fertility rate in different age groups, calculating the ASFR is crucial for measuring the TFR, as explained below.
Total fertility rate (TFR). The TFR is a hypothetical measure of women's fertility. It can be defined as the number of children who would be born per woman if she were to pass through the childbearing years bearing children according to a current schedule of age-specific fertility rates and not subject to mortality. In standard DHS surveys, ASFR a is calculated for a reference period of three years preceding the survey for the seven five-year age groups. Thus the TFR can be written as follows: General fertility rate (GFR). The GFR is the average number of children currently being born to women of reproductive age, calculated as the total number of births during a reference period divided by the total number of women-years of exposure, based on all women age 15-44 during the same reference period. In standard DHS surveys, the GFR can be written as follows:

Childhood mortality indicators
A childhood mortality rate can be defined as the number of deaths per 1,000 live births during a given year or reference period. In DHS surveys, five childhood mortality rates are calculated using a synthetic cohort life table approach, in which mortality probabilities for small age segments are combined into the more common age segments [8,14]. The five indicators are calculated based on a reference period of five years or ten years preceding the survey and can be defined as follows: 1. Neonatal mortality rate (NNMR): the probability of dying between birth and exact age 1 month; 2. Post-neonatal mortality rate (PNMR): the probability of dying between exact ages 1 month and 1 year, usually calculated as the difference between IMR and NNMR; 3. Infant mortality rate (IMR): the probability of dying between birth and exact age 1 year; 4. Child mortality rate (CMR): the probability of dying between exact ages 1 and 5 years; 5. Under-5 mortality rate (U5MR): the probability of dying between birth and exact age 5 years.
In the DHS approach, as documented in the Guide to DHS Statistics [8], the calculations of the five mortality rates defined above start with calculating the component death probabilities for eight age segments 0, 1-2, 3-5, 6-11, 12-23, 24-35, 36-47, and 48-59 months of completed age. Each component death probability is defined by a time period and an age interval, within which three birth cohorts of children can be defined as indicated in Fig 1. As the figure shows, where the lower and upper limits of the age interval are given by a l and a u , respectively, and the lower and upper limits of the time period are given by t x and t y , respectively, cohort 2 (children born between dates t x −a l and t y −a u ) fully contributes to the deaths and the children-years of exposure (defined as ABCD), whereas cohort 1 (children born between dates t x −a u and t x −a l ) and cohort 3 (children born between dates t y −a u and t y −a l ) partially contribute to the deaths and the exposure. For example, where the reference period is the five years that ended in September 2018 (t y ) and started in September 2013 (t x ), for the age segment 3-5 (a l = 3 and a u = 6 as defined in Table 1), the three cohorts can be defined as follows: Note that children born between July 2018 and September 2018 are not part of the three cohorts and therefore they do not contribute to the deaths and the exposure of the component death probability for the age segment 3-5.
As Fig 1 indicates, each of cohorts 1 and 3 is divided into two halves by AB and CD, respectively. Therefore, an assumption can be made that cohorts 1 and 3 are exposed to one-half of the total exposure and one-half of the deaths between ages a l and a u during time period t x to t y . After the three cohorts are defined, the component death probabilities p i can be calculated where i2{1,2,3,4,5,6,7,8} corresponds to the eight age segments 0, 1-2, 3-5, 6-11, 12-23, 24-35, 36-47, and 48-59 months of completed age, where the age segments bounds can be defined as in Table 1.
Taking the partial and full exposure into account, the component death probabilities can be calculated as follows: where D i1 denotes the number of deaths for children in age segment i (between ages a l and a u ) during a time period between t x and t y , and E i1 denotes the number of survivors at the lower where an assumption is made that all the deaths reported in the survey for cohort 3 for a time period that ends with the date of the survey represent one-half of the deaths that will have occurred to the cohort between ages a l and a u . Once the component probabilities p i are calculated for each age segment, the childhood mortality rates can be calculated as follows: Example: The contribution of children to cohorts. In this example, the contribution of seven children to the different cohorts is illustrated within different age segments. To simplify the illustration, the lexis diagram in Fig 2 is limited to the first four age segments and the reference period is limited to only one year where the reference period ends on December 1 st of 2016, a date that precedes the date of the survey. For each of the first three age segments, the three cohorts defined in whereas surviving children are indicated by arrows. All children were born on the first of the event month, and all children who died passed away on the first of the event month, except for child F who died during the first month of age. The seven children contribute to the component probabilities of the different age segments as follows: 1. Child A, born in August 2015 and died in May 2016, belongs to cohort 1 of age segment 3 and cohort 2 of age segment 4, and therefore this child partially contributes to the exposure of p 3 and fully contributes to the exposure and deaths of p 4 .
2. Child B, born in September 2015 and died in February 2016, belongs to cohort 1 of age segment 2 and cohort 2 of age segments 3 and 4, and therefore this child partially contributes to the exposure of p 2 , fully contributes to the exposure and deaths of p 3 , and fully contributes to the exposure of p 4 .
3. Child C, born in October 2015, belongs to cohort 1 of age segment 2 and cohort 2 of age segments 3 and 4, and therefore this child partially contributes to the exposure of p 2 and fully contributes to the exposure of p 3 and p 4 .
4. Child D, born in December 2015, belongs to cohort 2 of age segments 1, 2 and 3 and cohort 3 of age segment 4, and therefore this child fully contributes to the exposure of p 1 , p 2 and p 3 and partially contributes to the exposure of p 4 .

Child E, born in February 2016 and died in
October 2016, belongs to cohort 2 of age segments 1, 2 and 3 and cohort 3 of age segment 4, and therefore this child fully contributes to the exposure of p 1 , p 2 and p 3 and partially contributes to the exposure and deaths of p 4 .
6. Child F, born in April 2016 and died before the end of the same month, belongs to cohort 2 of age segments 1, 2 and 3 and cohort 3 of age segment 4, and therefore this child fully contributes to the exposure of p 1 , p 2 and p 3 , partially contributes to the exposure of p 4 , and fully contributes to the deaths of p 1 .
7. Child G, born in May 2016 and died in October 2016, belongs to cohort 2 of age segments 1, 2 and 3 and cohort 3 of age segment 4, and therefore this child fully contributes to the exposure of p 1 , p 2 and p 3 , partially contributes to the exposure of p 4 , and fully contributes to the deaths of p 3 .

Survey variables needed to calculate fertility and childhood mortality rates
As described in the definitions above, fertility rates and childhood mortality rates are calculated based on information about number of births and deaths for each woman during a reference period. This information is collected in DHS surveys under the birth history section. In DHS surveys, the birth history data are collected from all de-facto women age 15-49, or from de-facto women age 15-49 who have ever been married, where only ever-married women are eligible for the women's interview. The birth history variables are released in two types of datasets, the women dataset (IR) and the births dataset (BR).
In the majority of DHS surveys, a full birth history is collected from interviewed women about each live-born child to whom the woman has ever given birth. The birth history section includes variables for the date of birth (month and year) of each live birth, sex of each child, survival status of each child, age of each surviving child, and age at death of each deceased child. (For more details about the DHS birth history, see [8] and [10].) In addition to the birth history variables, date of birth of each woman and date of interview are captured in the women's questionnaire. In DHS surveys where data is collected only from ever-married women (ever-married women surveys), "all-women factors" are calculated to adjust the ever-married samples to estimate statistics based on all women. These factors have to be accommodated in calculations where the indicator is reported for all women. All dates are coded in month and year in separate variables, which are then recoded into a Century Month Code (CMC) format that corresponds to the number of months since January 1900, calculated as follows, where Y denotes the year of the event and M denotes the month of the event.
In addition to the variables needed to calculate the rates, other variables are needed to account for the sampling design and for estimating the standard error, such as the women's survey weight, sampling cluster, and sampling strata. All information and variables needed to calculate fertility and childhood mortality rates are listed in Table 2.

The DHS.rates R package
In this section we introduce the DHS.rates R package that was developed to calculate fertility and childhood mortality rates from survey data. In addition to replicating the fertility and childhood mortality rates published by the DHS Program in the survey reports or in the STATcompiler, the DHS.rates package allows data users to calculate the rates: 1. for different reference periods, more or less than three years for fertility rates or more or less than five years for childhood mortality rates; 2. based on calendar years so that the end of the reference period is not necessarily the date of the survey; 3. for different sub-populations or domains other than the ones produced by the DHS Program; and 4. based on surveys other than the DHS as long as the required variables are available.
The four attributes above provide data users with great flexibility in dealing with a large number of research problems related to fertility and childhood mortality rates based on survey data. In addition, the DHS.rates package was developed to be a user-friendly tool; while the user needs only to read the survey dataset in the R environment, all the complicated calculations of the fertility and childhood mortality rates are performed in the backend of the package away from the user.

The package functions and parameters
The current version of the DHS.rates, version 0.7.0, includes the following three functions: 1. fert to calculate the fertility rates (TFR, GFR, and ASFR) and to estimate standard error, design effect, relative standard error, and confidence interval for each rate.
2. chmort to calculate the childhood mortality rates (NNMR, PNMR, IMR, CMR, and U5MR) and to estimate standard error, design effect, relative standard error, and confidence interval for each rate.

The functions outputs
As described in the previous section, along with the fertility and mortality rates, the fert and the chmort functions estimate standard error (SE), relative standard error (RSE), and confidence interval (CI) for each rate. Moreover, the two functions estimate design effect (DEFT) for each rate as the ratio between the standard error using the given sample design and the standard error that would result if a simple random sample had been used. In addition the two functions provide the weighted and the unweighted women-years and children-years of exposure, (WN) and (N). The methods of calculating the standard errors in the DHS.rates package are in line with the DHS approach detailed in the DHS Sampling and Household Listing Manual [15]. Since the ASFR and GFR can be classified as ratio estimates, where a numerator y can be B a or ∑ a2A B a and a denominator x can be E a or (∑ a2A E a )−E 45−49 (see Eqs 1 and 3), the Taylor linearization method is used to estimate the SE for the ratio r = y/x as follows: seðrÞ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 where m h denotes the total number of clusters selected from stratum h, y hi denotes the sum of the weighted values of variable y (B a or ∑ a2A B a ) in cluster i in stratum h and x hi denotes the sum of the weighted exposures (E a or (∑ a2A E a )−E 45−49 ) in cluster i in stratum h.
Unfortunately, using the Taylor linearization method is not possible in the case of complicated indicators, such as the TFR and childhood mortality rates. In this case, replication methods, such as the Jackknife repeated replication method, can be used to estimate the standard error based on replications from the survey sample. According to the DHS approach, a nonstratified Jackknife repeated replication is used, where each replication considers all but one cluster in the calculation of the estimates. Therefore, the total number of replicates should be the same as the total number of sample clusters in the survey. Where a rate r can be the TFR or Table 3. Functions, parameters, and description. DHS.rates.

Function Parameters Description
Fert Data.Name The DHS women (IR) dataset or data from other survey with the same format Indicator Type of fertility rate to be calculated, "tfr", "gfr", or "asfr" Calculating fertility and childhood mortality rates from survey data any of the childhood mortality rates, the standard error of a rate r is calculated as follows: seðrÞ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 kðk À 1Þ X k i¼1 ðkr À ðk À 1Þr ðiÞ À rÞ where r (i) denotes the rate r calculated based on the replicate i where cluster i is excluded, and k denotes the total number of replicates or clusters. The standard errors in Eqs (12) or (13) are used to calculate RSE, the ratio between se(r) and rate estimate r, and confidence interval (CI) as r ± Za 2 se(r).

Illustrations
In this section, several examples will illustrate the use of DHS.rates functions (fert, chmort and chmortp) to calculate the fertility rates (ASFR, TFR, and GFR), the childhood mortality rates (NNMR, PNMR, IMR, CMR, and U5MR), and the childhood component death probabilities.

Examples of fert
The following example shows how to calculate fertility rates (ASFR, TFR, and GFR) based on data from a DHS survey. In the code below, in addition to installing and calling the DHS. rates package, it is recommended to install and call the haven package that will allow for reading the data into the R environment [16]. In this example, the read_dta function was used to read a Stata file of the women (IR) dataset of the 2015 Zimbabwe DHS (ZDHS) that is defined as ZWIR71 in the R environment. In the code below, the fert function is used to calculate the ASFR, where the Indicator argument is declared as "asfr". In addition to the rates for the seven age groups, the   The code above is equivalent to the code below, where several arguments are explicitly identified and set to the default values in the function. Some of these arguments are very helpful, especially where datasets other than the DHS are used.
> fert(ZWIR71,Indicator = "asfr", CL = 95, Cluster = "v021", + Strata = "v022", Weight = "v005", + Date_of_interview = "v008", The function results can be saved in separate matrix, ZW_asfr, as follows: > ZW_asfr <-fert(ZWIR71,Indicator = "asfr") Similarly, the fert function can be used to calculate the GFR or the TFR, where the Indicator argument is declared as "gfr" or "tfr" respectively, as below. Unlike the ASFR and GFR, to produce the standard error and the other precision indicators for the TFR, the JK argument should be declared as JK = "Yes" so that the Jackknife repeated replication method will be used to estimate the standard error, as below. Note that in addition to the standard output, the function produces the number of Jackknife replicates as the number of iterations reported below. In the above codes, the confidence level of the confidence intervals was set to the default 95%. The CL argument can be used to change the confidence level to 90, as below. The fert function allows for class level rates by declaring the Class argument. In the code below, the TFR is calculated for each wealth quintile. All the fertility rates above are calculated based on a reference period of 36 months (3 years) preceding the survey. The rates can also be calculated for different time periods; in the code below the TFR is calculated for 60 months (5 years) preceding the survey by declaring the Period argument, as below. Moreover, the rates can be calculated for any reference period by declaring the Period-End argument. In the code below, the TFR is calculated for a period that ended in September 2010 by declaring the PeriodEnd argument, as below. The package also is set to warn the user if the specified PeriodEnd extends beyond the dates of the survey. In the example below, the PeriodEnd was set to October 2016, although the survey fieldwork was in July-December 2015.
> fert(ZWIR71,Indicator = "tfr", PeriodEnd = "2016-10") Note: You specified a reference period that ends after the survey fieldwork dates. . . . . All the functions above can be used in case of data for ever-married women. In the example below, the TFR is calculated based on the ever-married women 2014 Bangladesh DHS (BDHS) by declaring the EverMW = "Yes" argument. Moreover, the AWFact argument can be used to declare the relevant all-women factor variable especially with the Class argument, as in the code below where AWFact = "awfactu" is used in the case of calculating the TFR for urban/rural areas. Note that in the previous example the AWFact argument is not declared, since the default all-women factor variable is set to the national factor, awfactt.  Similar to the fert function, several arguments can be used in the chmort function to declare variables with names other than the standard names used in the DHS surveys, such as Strata, Cluster, Weight, Date_of_interview, Date_of_birth, and Age_at_death. Also, arguments like CL, Class, Period, and PeriodEnd can be used in the chmort function to produce subnational rates and rates based on different reference periods and with a non-default 99% confidence interval, as below.  Examples for using other surveys: MICS. The following example shows how to use the DHS.rates package with other surveys such as the MICS. In this example, the read_sav function was used to read SPSS files of the women dataset (wm) and the births dataset (bh) of the 2017 Sierra Leone MICS. In the MICS, unlike the DHS, the birth history variables are not defined in the wm dataset. Therefore in the code below, the date of birth of child (BH4C) was merged from the bh file to the wm file using the child line number (BHLN) and reformatted the same way the birth history variables are formatted in the DHS women's datasets, where a separate variable is reserved for each birth up to maximum 20 children. In addition, the survey weight wmweight is defined. The data manipulation below requires the dcast function of the reshape2 package [17]. In the code below, the fert function is used to calculate the TFR, where the function arguments are used to identify the relevant names for the sampling strata, sampling clusters, survey weight, date of interview, and women's date of birth. In addition to calculating the TFR, the function produces a warning message to notify the user that not all the birth history variables b3_01:b3_20 existed in the data and that the missing variables were generated by the function. Warning message:

Make
In fert(SLwm06, Indicator = "tfr", Strata = "HH7", Cluster = "HH1",: Birth History variables b3_01:b3_20 are not complete; the missing variables were created Similarly in the code below, the chmort function is used to calculate the five childhood mortality rates based on the 2017 Sierra Leone MICS.  The DHS.rates web-application. All the DHS.rates functions are available for R nonusers through the DHS.rates Shiny, an interactive web-based application developed using the Shiny R package [18]. The DHS.rates Shiny is hosted on a Linux server that runs the Shiny Server program. The application can be found on the DHS Program website (https://rshiny. dhsprogram.com/apps/dhs.rates/). For more illustrations see S1-S3 Figs in S1 Appendix.

Discussion and future directions
In this paper we introduced the DHS.rates R package. The package was developed to be a useful tool for calculating fertility and childhood mortality rates based on survey data. The package is user-friendly, with several arguments to enable researchers to dig into different research topics. Although the package was designed to deal with the DHS surveys, where standard formats are used across all the datasets, the package arguments also accommodate other surveys as long as they collected the necessary data. In addition to fertility and childhood mortality rates, the package estimates precision indicators, such as standard error, design effect, relative standard error, and confidence interval.
As well as introducing the DHS.rates package functions in this paper, we presented several examples to illustrate the capabilities of the package.
The package functions are written in the R environment and are downloadable from http:// www.r-project.org. The first version of the DHS.rates package (version 0.1.0) was uploaded to the CRAN in March 2018. Since then, several versions have been released where minor edits were made or new functions were introduced. All the DHS.rates functions are available in full capacity for R non-users through the DHS.rates Shiny web-application. The DHS. rates package and the DHS.rates Shiny will continue to evolve as needed with upgrades to existing functions or new functions.

Availability
The DHS.rates package can be downloaded from the package page on the CRAN website (https://CRAN.R-project.org/package=DHS.rates) The DHS.rates Shiny can be found on the DHS Program website (https://rshiny. dhsprogram.com/apps/dhs.rates/).
The 2015 Zimbabwe DHS datasets can be downloaded from the data page (https:// dhsprogram.com/data/dataset/Zimbabwe_Standard-DHS_2015.cfm?flag=0) The 2014 Bangladesh DHS datasets can be downloaded from the data page (https:// dhsprogram.com/data/dataset/Bangladesh_Standard-DHS_2014.cfm?flag=0) The 2017 Sierra Leone MICS datasets can be downloaded from the data page (http://mics. unicef.org/surveys) Supporting information S1 Fig. DHS.rates Shiny. The web application is located in https://rshiny.dhsprogram.com/ apps/dhs.rates/ and composed of three tabs: Introduction, fert and chmort. In the introduction tab a brief background is provided and instructions are outlined. thank Bernardo Lanza Queiroz and two reviewers, Tim Riffe and another anonymous reviewer, for their helpful and constructive comments and suggestions that contributed to improving the final version of the paper.