Stratum-specific health outcome estimation in Pakistan using double goal CART

Muhammad Hamza; Shakeel Ahmed

doi:10.1371/journal.pone.0294736

Abstract

Post-stratification is applied when the subpopulation membership is observed only for sampled values and the goal is to estimate stratum-specific parameters which leads the survey statisticians towards primary goals i.e., classification of non-sampled units into different strata and prediction of the values of the study variables. Regression models, on one side, optimize the prediction of the study variable’s non-sampled values while the classification algorithms, on the other side, look for the classification of non-sampled cases into different strata. Hence, it is crucial to deal with these two goals simultaneously for the estimation of stratum-specific parameters. This study introduces the idea of a double-objective classification and regression trees (CARTs) approach for estimating stratum-specific parameters. Theoretical properties of the total estimator are derived. An application on the estimation of health outcomes in different domains is given to delineate the practical significance as well as the efficiency of the proposed CART-based method. The proposed estimator of population total performs better than the existing stratum-specific estimator in terms of relative efficiency for all choices of parameters. As an ensemble model, the random forest CART outperforms the other competing tree-based models and homogenous population model without using any auxiliary variable.

Citation: Hamza M, Ahmed S (2024) Stratum-specific health outcome estimation in Pakistan using double goal CART. PLoS ONE 19(2): e0294736. https://doi.org/10.1371/journal.pone.0294736

Editor: Sathishkumar Veerappampalayam Easwaramoorthy, Sunway University, MALAYSIA

Received: May 4, 2023; Accepted: November 7, 2023; Published: February 29, 2024

Copyright: © 2024 Hamza, Ahmed. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are cited within the paper and the url is attached here. https://dhsprogram.com/methodology/survey/survey-display-552.cfm.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

Survey statisticians have made major advances to the science of probability sampling, but most practitioners oppose the use of uncontrolled sampling due to the large variation in sampling units. Stratified random sampling controls the diversification with regard to the key study characteristics while maintaining the sample’s probabilistic nature. Numerous studies have been published for the modification and enhancement of the stratification methods following Neyman’s (1938) [1], groundbreaking work. Stratified sampling is justified only when the stratification variable is known prior to the sample selection. However, because such variables fluctuate over time and the sampling frame is built using census data from a few years ago, it is difficult to locate updated information about stratum indicators like house size, socioeconomic status, and education in the majority of health-related surveys.

Post-stratification, on the other hand, refers to the observation of the values of the study variable and the stratum membership variable after the sample has been chosen. For instance, a demographic survey typically can’t stratify according to age, because the age of individuals is not available until the sample is collected. Post-stratification is the practice of using auxiliary data in finite population parameter estimation to improve the precision and accuracy of estimates of the population parameters. [2] uncovered the method of post-stratification that leads to an impressive reduction in the level of sampling to get reasonable estimates of the population. By using a sampling strategy known as multiple inverse sampling [3], attempted to overcome the challenge of post-stratification in high sample sizes, as post-stratification is as effective as ordinary stratification with proportional allocation. Moreover [4], focused on the effects of the post-stratification procedure used in labor force survey (LFS) and investigated whether one can obtain a more precise estimator of parameters using registered information in post-stratification by using auxiliary information or not. Predictive modeling was used to examine post-stratification, with population values assumed to be random variables produced by a model and population quantities inferred using those models’ predictive capabilities [5].

Aside from design-based estimators, model-based approaches rely on the model relationship between the response variable and the predictors for enhancing the precision of estimates. Initially [6], used a straightforward regression model using auxiliary variables to predict the totals of the non-sampled units and their unknown and random quantities. [7,8] predicted the non-sampled values of the study variable in estimating the finite population total using a smooth function. Moreover, [9] introduced a model-based estimator that works with penalized spline regression function to obtain the model-assisted estimator of the total population by using the classical local polynomial regression (CLPR). After that [10], employed a model-based approach to estimate the unknown parameter of the study variable using local linear regression (LLR). Similarly [11], analyzed data in complex surveys by considering the nonparametric estimation methods. Later [12], proposed a novel method for estimating a finite population parameter, which considers a linear combination of population values in a super-population scenario with a known basis function regression (BFR) model. [13] discussed applying linear, mixed, nonparametric, and machine learning techniques to estimate finite population parameters using complex survey data and auxiliary information Under commonly used feature selection criteria in machine learning, the suggested estimator’s prediction error variance was computed. In order to apply machine learning predictions on unobserved data [14], suggested an active sampling technique for data subsampling. By overcoming design constraints, this method enhances performance in virtual simulation-based safety assessment of advanced driver assistance systems. Moreover [15], suggested a method that incorporates data from several sources with accepted practices to get estimates that are precise and reliable. They added that Big Data, which gives more rapid and detailed statistics, offers a solution to diminishing response rates and survey expenses. To make the transition from intended data to data-oriented statistics, it is necessary to comprehend the prerequisites for reliable inference. This goal is concretized through a number of statistical frameworks, however, these are broad approaches.

In machine learning, tree-based methods are favored due to ease of application and capturing linear and/or non-linear relationships between the variables without assuming a specific functional trend. Classification and Regression Tree (CART) algorithm to predict target variable values based on covariates and provide easily interpretable results. CARTs perform categorization and prediction based on observed data and can be employed for predicting the values of unobserved data. [16] examined the prediction performance of decision trees like CART and compared the results with those of other tree-based methods. Further [17], studied the prediction performance of decision trees like CART and a comparison has been also made between various tree-based algorithms. The performance of three non-parametric tree-based approaches was later examined by [18], for general forest mapping with high-resolution SPOT-HRG data because traditional methods like field surveys are time- and money-consuming. After that, [19] examined the effectiveness of software defect prediction as a research area in software engineering and the prediction capabilities of seven tree-based ensembles. Similarly [20], stated that the decision tree algorithm is the most important and efficient machine learning method.

[21] investigated the method of automatic diabetes prediction using random forest and gradient boosting classifiers. These tree-based ensemble methods with proper data processing, hyper-parameter tuning, and oversampling, can effectuate above 90% accuracy. Recently [22], developed a model-assisted technique based on random forests and estimated the functional relationship between the survey variable and the auxiliary variables. They also established the theoretical features of the procedure and calculated the associated estimator. Additionally, a model calibration process for dealing with numerous survey variables was covered.

When we need separate estimates in different study domains model-based approaches may be used for two purposes. First, the model is used for the classification of units in different study domains and secondly, a specific model will be applied for the prediction of non-sampled values.

The main contribution of the paper is to use two tree-based machine learning algorithms for obtaining separate estimates in different sup-population called domains, where the domain membership is observable only in the sample. The method proposed for the estimation of finite population parameters (population total) in this study uses a classification tree-based algorithm for classifying non-sampled units (the units not selected in the sample) into different strata (domains) and a regression tree-based algorithm for prediction of the values of the study variable on non-sampled units. To evaluate the performance of the estimator suggested in this study and to assess the applicability of the method we use bootstrap studies for two situations taking different health-related variables as the variables of interest.

In Section 2, we provide an overview of the classical model-based stratum-specific total estimator of population total with its finite sample properties. Section 3 provides the proposed tree-based algorithm for estimating stratum-specific total. Section 4 comprises bootstrap studies for two different cases to evaluate the performance of a stratum-specific total estimator. Section 5 concludes the study with some future recommendations.

2. Existing model-based estimation method

Let U = {1,2,3,….,N} be the set of serial number attached to the units in a finite population of size N. Further Y and X be the study and auxiliary variables with values y_i and x_i corresponding to the ith population unit for all i ∈ U. The population consists of H mutually exhaustive strata whose membership are assumed to be unknown prior to the survey. The stratum membership variable for the ℎ^tℎ stratum can be defined as A_hi which possess value a_hi = 1 if i^tℎ unit belongs to ℎ^tℎ stratum, a_hi = 0 otherwise, such that (1)

The stratum membership variable A_hi for h = 1,2,…,H is defined as independently distributed Bernoulli random variables. The mean and variance of the product A_hY can be obtained as (2) and (3)

The covariance between A_hiY_i and A_hiY_j for i≠j∀i,j∈U is zero as Y_i (conditionally) and A_hi both are independent random variables. Following Chamber and Clark (2012) [23], the expansion estimator for the ℎ^tℎ stratum total , is given by: (4) where is the sample mean for the hth stratum and is the estimator of λ_h. The derivation of , after taking expectation of the prediction error, is given by (5) with the unbiasedness condition The model variance of the prediction error can be obtained as (6) The expansion estimator with variance (7)

The expansion estimator has two attractive features one is BLUP property with respect to the model and the other is the compensation for unknown stratum size. However, the expansion estimator does not consider the known auxiliary information. However, classical model-based estimation approaches with auxiliary variables have filled this gap (see, [13]). When the model relationship between the study variable and the auxiliary variables is non-linear, we can no longer rely on the classical model-based estimators. The tree-based estimation procedure fills this gap and aids in efficiency improvement for the estimation of finite population parameter estimation.

3. Proposed model-based estimation method

Decision trees are non-parametric methods to screen the data into meager, extra “pure or homogenous groups known as nodes. An easy way to define “purity” is by increasing accuracy or by decreasing misclassification error. Decision tree models are suitable when there is a good reason to suspect non-additive interaction among variables or there are far too many variables under study. In general, depending on whether a statement is true or untrue, a decision tree will make a statement. CART is better at detecting this relationship than the use of interaction terms in linear models. Tree-based methods are favored for ease of application, captures linear and/or non-linear relationship between the variables without assuming a specific functional trend, and do not assume that all study variables are equal. To categorize the non-sampled units into strata and to predict the values of the study variable for the non-sampled part, two tree-based methods are used simultaneously and called it the double-goal classification and regression tree (DGCART) approach. Here, we modify the CART method to fulfil the dual objectives of stratification and prediction. The DGCART-based estimation algorithm is summarized in Table 1 and Fig 1.

Download:

Fig 1. Flow diagram of DGCART.

https://doi.org/10.1371/journal.pone.0294736.g001

Download:

Table 1. Illustration of DGCART for estimation of finite population total.

https://doi.org/10.1371/journal.pone.0294736.t001

We construct classification trees for the sampled data in Step 2, as a result, we obtained ′L′ nodes numbered as 1,2,…,l,…,L representing by a set of classes C = {C₁, C₂,…,C_l,…,C_L}. We then classify the nodes into different strata according to the majority vote i.e., the lth node will be classified to hth stratum if n_hl = max {n_1l, n_2l,…,n_Hl}.

The stopping rule for the classification tree is made by observing the increase in variance of A_h(h = 1,2,…,H). We set a reduction in variation function as (8) where C_F is the set of classes at final node of classification tree and C_F−1 is the set of classes at the node proceeding to the final node of the classification tree. At final node the λ_h is estimated using the node specific data. i.e. (9) where n_F is the total number of units at the final node .

The process continues until does not fall below a pre-specified value Δ_o. At this stage one should ensure that the sample size at note t for a given h should be at least 2 i.e., n_th≥2. Once the stopping criterion is met, we get the classified data as C = {C₁, C₂,…,C_l,…,C_L}. After classification of non-sampled units into different domains, we grow a regression tree from the sampled data i.e., [y: x₁, x₂, x₃,…,x_p] for prediction of the values of study variable.

Moreover, let be the set of nodes on regression tree constructed on sampled units. The values of the non-sampled units are predicted as the mean of the sampled values on a given node for example at t^th node the value of y_i is predicted as: (10)

The stopping rule is made for regression tree by observing the increase in variance of the study variable y at given node. We set a reduction in variation function as: (11) where is the set of classes at final node of regression tree and is the set classes at the node proceeding to the final node of the regression tree. At final node, the mean, and the variance of is estimated using the node specific data for h^th domain as:

The predictive estimation problem starts with partitioning the total of h^th stratum into sampled and non-sampled parts. The i^th value of the study variable for non-sampled part is, then, predicted using the mean value of that class. The resulting tree-based estimator for domain total is given by (12) and , after some simplification, we have (13) where .

An estimate of r_h can be obtained as: (14)

Inserting estimated value of in (13), we get (15)

When classification does not divide the data in a meaningful way the combined stratum mean coincide with the overall stratum specific mean i.e., and, as a result, the tree-based total estimator gives similar result as the expansion estimator, i.e. . The prediction error of the tree-based total estimator can be written as (16)

Applying model expectation on Eq (16), the prediction error given the set of classes C_l is (17) we have

Inserting in Eq (17), we get the conditional bias as follow (18)

The Model bias term reduces as the sum of class specific mean reaches to overall population mean which is the worst situation in terms of efficiency. To increase in efficiency, we need to compromise some amount of bias in prediction process.

Similarly (19)

Further, Variance of bias of is given as: (20)

4. Bootstrap studies

We conduct a bootstrapped study using Pakistan maternal mortality survey dataset, 2019 [24], for the two cases: (1) Taking the pregnancy loses as the study variable, and (2) Taking the delivery duration as the study variable. The dataset consists of N = 634 observations, after omitting rows having missing responses, with 28 variables (see details of the variables in Appendix A). Considering this dataset as the population, a simple random sample of size (n = 20,30,40,50,65, and 75) is drawn. Two separate trees are grown one for prediction problem and the other for classification of the non-sampled units using 5 different CART models and a random forest model. The summary of models used in this study are dscribed in Table 2.

Download:

Table 2. Details of DGCART Models used in the study (hyper-parameters).

https://doi.org/10.1371/journal.pone.0294736.t002

We have used different decision tree tune parameters (hyper-parameters) to tune the tree. There are 5 different CART models having different values of hyper-parameters and one random forest model. Different tree parameters including the maximum depth which intended to prevent overfitting the specifics minimal number of observations needed in a node for split to be attempted is specified by “min split” and the “min bucket” (number of observations that are permitted in a terminal node) The value "None" indicates that we didn’t utilize any values for the relevant hyper parameters in the model.

The Expected absolute prediction error (EAPE) of the stratum-specific total estimator is obtained under different models i.e. k = 1,2,3,4,5, rf. (21) where h = 1,2, and Q denotes the number of simulations. Further, the mean square prediction error (MSPE) of the stratum-specific total estimator is obtained under different models as follow (22) where h = 1,2, and Q denotes the number of simulation.

In R simple tree-based algorithms with some choices of tree size, splitting criteria, the number of trees to be produced, etc. are obtained using rpart package. The rpart employs a metric, like other partitioning algorithms, to choose the optimum rule for dividing the data. The method uses the Gini coefficient as the computational metric.

Case 1

Comparing Pakistan to other South Asian nations, Pakistan has the highest rate of pregnancy losses (30.6 pregnancy losses per 1000 total births) [25]. There is a paucity of literature on the lived experiences of Pakistani women who have experienced multiple stillbirths, despite the well-documented psychological effects of stillbirths on bereaved women [26]. Multiple stillbirths have a severe effect on women’s emotional and social welfare, so in Case 1, the usage of contraceptive methods has been taken as the stratum membership variable, h = 1,2, (Stratum 1 and 2) and the number of pregnancy losses (which ranges in 1 to 20) has been taken as the study variable.

Table 3 shows bootstrap study results for Case 1. There are five different CART models according to tree parameters and one random forest (rf) model. The table provides which shows estimated the hth stratum proportion. The tables include the mean of respective stratum, the expected absolute prediction error (EAPE), mean square prediction error (MSPE) and relative efficiency (RE) of the estimators for different choices of sample sizes. Table 3 provides that the mean number of miscarriages is higher for Stratum 2 i.e., mothers who have ever used contraceptive methods against women who have not used any birth control. The mean pregnancy losses for the mothers who ever used contraceptive method is in the range [1.5299 to 1.5745] and for those who do not use contraceptive method is [1.3895, 1.4301]

Download:

Table 3. Bootstrap results for Case 1.

https://doi.org/10.1371/journal.pone.0294736.t003

Decision trees are non-parametric methods to screen the data into meager, extra “pure, or homogenous groups known as nodes. An easy way to define “purity” is by increasing accuracy or by decreasing misclassification error. Decision tree models are suitable when there is a good reason to suspect non-additive interaction among variables or there are far too many variables under study. The average absolute deviation of the predictions from the true values of the parameter is obtained using the expected absolute prediction error (EAPE) measure for different models. No significant change is observed in EAPE values with a change in tree parameters, however, EAPE values have a slightly increasing trend with an increase in sample size which shows a trend of unbiasedness for larger sample sizes. Further, EAPE values are higher Stratum 1 as compared to Stratum 1.

Similar to EAPE, there is no significant change in MSPE values with a change in tree parameters. However, the relative MSPE values corresponding to the random forest is significantly smaller than all other single-tree models. The relative efficiency (RE) values are greater than one for all combinations of tree parameters showing the superiority of tree-based total estimators to corresponding estimators under the homogenous model. However, the value RE of the random forest model is higher among all competing models due to the ensemble technique applied in random forest (rf) algorithms for building classification t and regression models for observed data. The comparison of different competing models used in this study is visually displayed in Figs 2 and 3 for Case 1 bootstrap study.

Download:

Fig 2. Comparison of relative efficiency of the mean estimator with different DGCART models for different sample sizes under Case 1.

https://doi.org/10.1371/journal.pone.0294736.g002

Download:

Fig 3. Comparison of relative efficiency of the mean estimator with different DGCART models for different sample sizes under Case 1.

https://doi.org/10.1371/journal.pone.0294736.g003

Fig 2 graphically compares the relative efficacy of five single tree models and one random forest model. With n = 50, all single CART models exhibit comparable relative efficiency. Every model has a different relative efficiency for single CART models with n = 65, but models 2 and 3 have better relative efficiencies. According to the models, relative efficiency for n = 75 has changed, with model 4 having lower efficiency. The relative effectiveness of random forest is higher than that of the other individual CART models for all samples.

Fig 3 compares the relative efficacy of CART models with various sample sizes for the number of miscarriages per woman who did not use any kind of contraception before or throughout her pregnancy. By comparing the relative effectiveness of models with n = 75 to models with n = 50 and n = 65 in single CART models, we may determine that larger sample sizes can yield more accurate population estimates. The relative efficiency for all single CART models is about the same for n = 50, and it is even the same for n = 65. We have superior relative efficiency in the random forest model as compared to other single CART models because random forests build cumulative decision trees. The proposed tree-based strategy is particularly successful when applied in an ensemble model, as shown by the fact that the relative efficiency is even higher in the random forest model.

Case 2

Duration of delivery is a unique experience. Sometimes it’s over in a matter of hours. Delivery duration is the time of procedure that will give birth to your child [27]. As delivery duration is also an important variable which must be studied so in in Case 2, the usage of iron tablets during pregnancy has been taken as stratum-specific variable and the duration of delivery has been taken as study variable.

Table 4 shows bootstrap study results for case 2 i.e., “delivery duration” as the study variable and “usage of iron tablets” as stratum membership variable. There are five different CART models according to tree parameters and one random forest (rf) model. shows the stratum proportion for strata i.e., h = 1, and h = 2. represents the mean of respected stratum.

Download:

Table 4. Bootstrap results Case 2.

https://doi.org/10.1371/journal.pone.0294736.t004

Table 4 shows that the mean estimated time for delivery is almost equal in both strata i.e., stratum 1 and stratum 2. The relative efficiency is greater than all other single tree models in all results because random forest always provide good results as compared to single trees as the number of trees are more than 1 i.e., 500 in random forest.

We assessed 5 classification and regression models and 1 random forest model in our study and determined the EAPE for each model. No significant change is observed in EAPE values with change in tree parameters, however EAPE values have a slightly increasing trend with increase in sample size. Further, EAPE values are higher in smaller stratum (h = 1) as compared to the larger one (h = 2).

Table 4 also provided the Mean Squared Prediction Error for 5 single classification and regression tree models and 1 random forest model. Similar to EAPE, there is no significant change in MSPE values with change in tree parameters. However, the relative MSPE values corresponding to random forest is significantly smaller than all other single tree models. The relative efficiency (RE) values are greater than one for all combinations of tree parameters showing superiority of tree-based total estimators to corresponding estimators without utilizing any tree. The results given in Table 4 can be visualized from Figs 4 and 5.

Download:

Fig 4. Comparison of relative efficiency of the mean estimator with different DGCART models for different sample sizes under Case 2.

https://doi.org/10.1371/journal.pone.0294736.g004

Download:

Fig 5. Comparison of relative efficiency of the mean estimator with different DGCART models for different sample sizes under Case 2.

https://doi.org/10.1371/journal.pone.0294736.g005

Fig 4 provides a graphical representation the relative efficiency of the mean estimator when delivery duration is used as the study variable. For n = 20, 30 and 50, we have 5 single CART models and 1 random forest model. All of the models’ relative efficiency patterns are more than 1.20, with the random forest models’ pattern exceeding 1.65. In comparison to larger sample sizes, the relative efficiency for all single CART and random forest models is relatively low for n = 20.

Fig 5 illustrates the relative efficacy of mean estimator when delivery time is used as a variable of interest. The random forest model is more efficient than all single CART models, with a relative efficiency of more than 1.40. It is simple to determine that the relative efficiency varies in accordance with the hyper-parameter values in each single CART model.

As evidenced by its relative efficiency being higher in the random forest model and greater than 1 in all single tree models, our proposed tree-based method is more effective than the existing method. A value larger than 1 shows that the proposed technique is more efficient. Relative efficiency assesses the enhancement in efficiency of the proposed method over the existing method From both figures and tables we infer that the simultaneous application of classification and regression tree for stratification of non-sampled units assist in efficiency improvement when appropriate hype-parameters for trees are set for the training task. Ensemble different trees for the said two tasks provide the best performance of the total estimator for the variable of interest in different domains.

5. Conclusion

This study focused on classifying the non-sampled units into different strata using a classification tree algorithm and predicting the value of the study variable for the unobserved part of the population using a regression tree algorithm. Due to their ease of interpretation, and visualization tree-based algorithms are considered good alternatives to classical regression and classification models. The tree-based algorithms also deal with prediction and classification problems when the parametric relationship between the study variable and the predictors is ambiguous. Due to these attractive features, the tree DGCART method is proposed for estimating stratum-specific parameters. With random forest decision trees one can make predictions from different random samples of covariates rather than selecting the best ones and enhance the precision of the estimators proposed. Bagging in random forests also provides a direct estimate of prediction variance that can be considered in future studies. Similar studies, where stratum-specific estimates are needed, can benefit from the current study’s representation of how various input factors might be used to forecast a target value and utilized in the estimation stage.

The DGCART algorithm is especially useful in obtaining estimates of different indicators in specific demographic, socio-economic and geographic subpopulations in health related surveys where the indicator of interest has a high proportion of missing observations. Where missing part of the actual sample can be considered as the non-sampled part.

Acknowledgments

The authors are grateful to the handling editor and the reviewers for their valuable comments for significant improvement in previous version of the manuscript.

References

1. Neyman J. "Contribution to the Theory of Sampling Human Populations." Journal of the American Statistical Association 33.201 (1938): 101–16.
- View Article
- Google Scholar
2. Chang Kuang-Chao, Liu Jeng-Fu, and Han Chien-Pai. "Multiple Inverse Sampling in Post-Stratification." Journal of Statistical Planning and Inference 69.2 (1998): 209–27.
- View Article
- Google Scholar
3. Breidt F. J., & Opsomer J. D. (2008). Endogenous post-stratification in surveys: Classifying with a sample-fitted model.
- View Article
- Google Scholar
4. Djerf K. (1997). Effects of post-stratification on the estimates of the Finnish Labour Force Survey. JOURNAL OF OFFICIAL STATISTICS-STOCKHOLM-, 13, 29–40.
- View Article
- Google Scholar
5. Lennert-Cody C. (2001). Effects of sample size on bycatch estimation using systematic sampling and spatial post-stratification: summary of preliminary results. In IOTC proceedings (Vol. 4, pp. 48–53).
- View Article
- Google Scholar
6. Godambe V. P. (1995). Estimation of parameters in survey sampling: Optimality. The Canadian Journal of Statistics/La Revue Canadienne de Statistique, 227–243.
- View Article
- Google Scholar
7. Onsongo W. M. (2018). Nonparametric Estimation of Finite Population Total (Doctoral dissertation, JKUAT-PAUSTI).
- View Article
- Google Scholar
8. Deville J. C., Särndal C. E., & Sautory O. (1993). Generalized raking procedures in survey sampling. Journal of the American statistical Association, 88(423), 1013–1020.
- View Article
- Google Scholar
9. Zheng H., & Little R. J. (2003). Penalized spline model-based estimation of the finite populations total from probability-proportional-to-size samples. Journal of official Statistics, 19(2), 99.
- View Article
- Google Scholar
10. Mienye Domor, Wang Zenghui, and Sun Yanxia. Prediction Performance of Improved Decision Tree-Based Algorithms: A Review. 2019.
- View Article
- Google Scholar
11. Imberg H., Yang X., Flannagan C., & Bärgman J. (2022). Active sampling: A machine-learning-assisted framework for finite population inference with optimal subsamples. arXiv preprint arXiv:2212.10024.
- View Article
- Google Scholar
12. Ahmed S., & Shabbir J. (2021). A novel basis function approach to finite population parameter estimation. Scientia Iranica.
- View Article
- Google Scholar
13. Jay Breidt F. Jean D Opsomer. "Model-Assisted Survey Estimation with Modern Prediction Techniques." Statist. Sci. 32 (2) 190–205, May 2017. https://doi.org/10.1214/16-STS589.
- View Article
- Google Scholar
14. Righi P., Bianchi G., Nurra A., & Rinaldi M. (2019). Integration Of Survey Data And Big Data For Finite Population Inference In Official Statistics: Statistical Challenges and Practical Applications. Statistica & Applicazioni, 135–158.
- View Article
- Google Scholar
15. Kikechi C. B., Simwa R. O., & Pokhariyal G. P. (2017). On local linear regression estimation in sampling surveys.
- View Article
- Google Scholar
16. Ünal M., & Dağdeviren H. N. (2019). Geleneksel ve tamamlayıcı tıp yöntemleri. Eurasian Journal of Family Medicine, 8(1), 1–9.
- View Article
- Google Scholar
17. Mienye I. D., Sun Y., & Wang Z. (2019). Prediction performance of improved decision tree-based algorithms: a review. Procedia Manufacturing, 35, 698–703.
- View Article
- Google Scholar
18. Fallah A., Kalbi S., & Shataee S. (2013). Forest stand types classification using tree-based algorithms and spot-Hrg data. Forest, 1(3).
- View Article
- Google Scholar
19. Aljamaan H., & Alazba A. (2020, November). Software defect prediction using tree-based ensembles. In Proceedings of the 16th ACM international conference on predictive models and data analytics in software engineering (pp. 1–10).
- View Article
- Google Scholar
20. Pujara P., & Chaudhari M. B. (2018). Phishing website detection using machine learning: A review. International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 3(7), 395–399.
- View Article
- Google Scholar
21. Kumar Das S., Kumar Mishra A., & Roy P. (2019). Automatic diabetes prediction using tree based ensemble learners. International Journal of Computational Intelligence & IoT, 2(2).
- View Article
- Google Scholar
22. Dagdoug M., Goga C., & Haziza D. (2023). Model-assisted estimation through random forests in finite population sampling. Journal of the American Statistical Association, 118(542), 1234–1251.
- View Article
- Google Scholar
23. Chambers R. L., & Clark R. (2012). An introduction to model-based survey sampling with applications. Oxford University Press.
24. Maternal mortality 2019 [updated 19 September 2019; cited 2021 10–02]. Available from: https://www.who.int/news-room/fact-sheets/detail/maternal-mortality.
- View Article
- Google Scholar
25. Casterline J. B. (1989). Collecting data on pregnancy loss: a review of evidence from the World Fertility Survey. Studies in Family planning, 20(2), 81–95. pmid:2655191
- View Article
- PubMed/NCBI
- Google Scholar
26. Asim M., Karim S., Khwaja H., Hameed W., & Saleem S. (2022). The unspoken grief of multiple stillbirths in rural Pakistan: an interpretative phenomenological study. BMC women’s health, 22(1), 45. pmid:35193576
- View Article
- PubMed/NCBI
- Google Scholar
27. Macfarlane A. (1977). The psychology of childbirth (Vol. 16). Harvard University Press. National Institute of Population Studies (NIPS) [Pakistan] and ICF. 2020. Pakistan Maternal Mortality Survey 2019.

[ref1] 1. Neyman J. "Contribution to the Theory of Sampling Human Populations." Journal of the American Statistical Association 33.201 (1938): 101–16.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Chang Kuang-Chao, Liu Jeng-Fu, and Han Chien-Pai. "Multiple Inverse Sampling in Post-Stratification." Journal of Statistical Planning and Inference 69.2 (1998): 209–27.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Breidt F. J., & Opsomer J. D. (2008). Endogenous post-stratification in surveys: Classifying with a sample-fitted model.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Djerf K. (1997). Effects of post-stratification on the estimates of the Finnish Labour Force Survey. JOURNAL OF OFFICIAL STATISTICS-STOCKHOLM-, 13, 29–40.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Lennert-Cody C. (2001). Effects of sample size on bycatch estimation using systematic sampling and spatial post-stratification: summary of preliminary results. In IOTC proceedings (Vol. 4, pp. 48–53).
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Godambe V. P. (1995). Estimation of parameters in survey sampling: Optimality. The Canadian Journal of Statistics/La Revue Canadienne de Statistique, 227–243.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Onsongo W. M. (2018). Nonparametric Estimation of Finite Population Total (Doctoral dissertation, JKUAT-PAUSTI).
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Deville J. C., Särndal C. E., & Sautory O. (1993). Generalized raking procedures in survey sampling. Journal of the American statistical Association, 88(423), 1013–1020.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Zheng H., & Little R. J. (2003). Penalized spline model-based estimation of the finite populations total from probability-proportional-to-size samples. Journal of official Statistics, 19(2), 99.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Mienye Domor, Wang Zenghui, and Sun Yanxia. Prediction Performance of Improved Decision Tree-Based Algorithms: A Review. 2019.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Imberg H., Yang X., Flannagan C., & Bärgman J. (2022). Active sampling: A machine-learning-assisted framework for finite population inference with optimal subsamples. arXiv preprint arXiv:2212.10024.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Ahmed S., & Shabbir J. (2021). A novel basis function approach to finite population parameter estimation. Scientia Iranica.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Jay Breidt F. Jean D Opsomer. "Model-Assisted Survey Estimation with Modern Prediction Techniques." Statist. Sci. 32 (2) 190–205, May 2017. https://doi.org/10.1214/16-STS589.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref14] 14. Righi P., Bianchi G., Nurra A., & Rinaldi M. (2019). Integration Of Survey Data And Big Data For Finite Population Inference In Official Statistics: Statistical Challenges and Practical Applications. Statistica & Applicazioni, 135–158.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref15] 15. Kikechi C. B., Simwa R. O., & Pokhariyal G. P. (2017). On local linear regression estimation in sampling surveys.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref16] 16. Ünal M., & Dağdeviren H. N. (2019). Geleneksel ve tamamlayıcı tıp yöntemleri. Eurasian Journal of Family Medicine, 8(1), 1–9.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref17] 17. Mienye I. D., Sun Y., & Wang Z. (2019). Prediction performance of improved decision tree-based algorithms: a review. Procedia Manufacturing, 35, 698–703.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref18] 18. Fallah A., Kalbi S., & Shataee S. (2013). Forest stand types classification using tree-based algorithms and spot-Hrg data. Forest, 1(3).
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref19] 19. Aljamaan H., & Alazba A. (2020, November). Software defect prediction using tree-based ensembles. In Proceedings of the 16th ACM international conference on predictive models and data analytics in software engineering (pp. 1–10).
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref20] 20. Pujara P., & Chaudhari M. B. (2018). Phishing website detection using machine learning: A review. International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 3(7), 395–399.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref21] 21. Kumar Das S., Kumar Mishra A., & Roy P. (2019). Automatic diabetes prediction using tree based ensemble learners. International Journal of Computational Intelligence & IoT, 2(2).
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref22] 22. Dagdoug M., Goga C., & Haziza D. (2023). Model-assisted estimation through random forests in finite population sampling. Journal of the American Statistical Association, 118(542), 1234–1251.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref23] 23. Chambers R. L., & Clark R. (2012). An introduction to model-based survey sampling with applications. Oxford University Press.

[ref24] 24. Maternal mortality 2019 [updated 19 September 2019; cited 2021 10–02]. Available from: https://www.who.int/news-room/fact-sheets/detail/maternal-mortality.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref25] 25. Casterline J. B. (1989). Collecting data on pregnancy loss: a review of evidence from the World Fertility Survey. Studies in Family planning, 20(2), 81–95. pmid:2655191
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref26] 26. Asim M., Karim S., Khwaja H., Hameed W., & Saleem S. (2022). The unspoken grief of multiple stillbirths in rural Pakistan: an interpretative phenomenological study. BMC women’s health, 22(1), 45. pmid:35193576
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref27] 27. Macfarlane A. (1977). The psychology of childbirth (Vol. 16). Harvard University Press. National Institute of Population Studies (NIPS) [Pakistan] and ICF. 2020. Pakistan Maternal Mortality Survey 2019.

Figures

Abstract

1. Introduction

2. Existing model-based estimation method

3. Proposed model-based estimation method

4. Bootstrap studies

Case 1

Case 2

5. Conclusion

Acknowledgments

References