SamplingStrata: An R Package for the Optimization of Stratified Sampling

When designing a sampling survey, usually constraints are set on the desired precision levels regarding one or more target estimates (the Y ’s). If a sampling frame is available, containing auxiliary information related to each unit (the X’s), it is possible to adopt a stratified sample design. For any given stratification of the frame, in the multivariate case it is possible to solve the problem of the best allocation of units in strata, by minimizing a cost function subject to precision constraints (or, conversely, by maximizing the precision of the estimates under a given budget). The problem is to determine the best stratification in the frame, i.e., the one that ensures the overall minimal cost of the sample necessary to satisfy precision constraints. The X’s can be categorical or continuous; continuous ones can be transformed into categorical ones. The most detailed stratification is given by the Cartesian product of the X’s (the atomic strata). A way to determine the best stratification is to explore exhaustively the set of all possible partitions derivable by the set of atomic strata, evaluating each one by calculating the corresponding cost in terms of the sample required to satisfy precision constraints. This is unaffordable in practical situations, where the dimension of the space of the partitions can be very high. Another possible way is to explore the space of partitions with an algorithm that is particularly suitable in such situations: the genetic algorithm. The R package SamplingStrata, based on the use of a genetic algorithm, allows to determine the best stratification for a population frame, i.e., the one that ensures the minimum sample cost necessary to satisfy precision constraints, in a multivariate and multi-domain case.


Introduction
Let us suppose we need to design a sample survey, having available a complete frame contain-ing information on the target population (identifiers plus auxiliary information).If our sample design is a stratified one, we need to choose how to form strata in the population, in order to get the maximum advantage of the available auxiliary information.In other words, we have to decide how to combine the values of the auxiliary variables (henceforth, the X variables) in order to determine a new categorical variable, called stratum.To do so, we have to take into consideration the target variables of our sample survey (henceforth, the Y variables): if, to form strata, we choose the X variables most correlated to the Y , the efficiency of the samples drawn by the resulting stratified frame may be greatly increased.In order to handle the whole auxiliary information in a homogeneous way, we have to reduce continuous data to categorical (by calculating equal frequency intervals, or using a k-means clustering technique, or other suitable methods).Then, for every set of candidate auxiliary variables X, we have to decide (i) what variables to consider as active variables in strata determination, and (ii) for each active variable, what set of values (in general, what aggregation of atomic values) have to be considered.Each choice determines a particular stratification of the target population, i.e., a possible solution to the problem of best stratification.Here, by best stratification, we mean the stratification that ensures the minimum cost of the sample, necessary to satisfy precision constraints, set on the estimates of the target variables Y (constraints expressed as maximum expected coefficients of variation in different domains of interest).Therefore, the validity of a particular stratification can be measured by the associated minimum cost of a sample, whose estimates are expected to satisfy given precision levels.In general, the number of possible alternative stratifications for a given population frame may be very high, in some cases even innumerable.In these cases it is not possible to enumerate them in order to find the best stratification.Instead, adopting the evolutionary approach, the use of a genetic algorithm enables to explore the range of possible solutions in order to find a near-optimal solution after an affordable number of iterations.The implementation of the genetic algorithm in the package SamplingStrata (Barcaroli, Pagliuca, and Willighagen 2014) makes use of a modified version of the functions available in the genalg package (Willighagen 2014).
The remaining paper is organized as follows: Section 2 describes in brief the general approach followed for the implementation of the package (for a more detailed illustration of methodological aspects see Ballin and Barcaroli 2013); Section 3 illustrates how to employ the package in practical situations; Section 4 evaluates the performance of the genetic algorithm method by comparing it, in the univariate case, to the methods implemented in package stratification (Baillargeon and Rivest 2012a); Section 5 concludes the paper.
The total cost associated with a specific sample (depending on the allocation of units in the strata, and on the per unit interviewing cost, that may vary from stratum to stratum), and the expected precision related to each target estimate, can be associated to the first two components in an interchangeable way: in fact, it is possible to minimize the total cost under a set of precision constraints, or to maximize precision levels under given budget constraints.In both cases, optimization is performed on the basis of input parameters such as the variances of the target variables and the number of population units in each stratum.While in general it is not difficult to assign population units to the different strata, as this only depends on the auxiliary information in the frame (which is available by definition), much more complex is to get the information necessary to estimate the variability of target variables in each stratum.In fact, this information is not available at unit level (otherwise we would not carry out a specific survey on these variables), so it is necessary to estimate their variability by harnessing different possible sources: 1. census data; 2. data from previous rounds of the same survey; 3. data on proxy variables.
In the first case, the lower the time gap, the higher the reliability of the estimate.In the second, together with the time gap also the sampling errors on the estimates should be taken into account.Finally, in the third case also the correlation between target and proxy variables must be considered when evaluating the quality of the estimates.An effort should be made to model the relationships between target variables and all the available information, including the auxiliary one in the sampling frame, in order to increase this quality.Henceforth we assume that estimates of acceptable reliability of the variability of target variables in strata are available.Should this assumption not apply, the method here proposed would not be applicable.
A first important distinction can be made between the case where the precision of only one target variable is taken into consideration in the objective function or in the constraints (univariate case), or when more than one of them are considered (multivariate case).A further complication may be given by the necessity to consider different domains to which estimates (and related precision levels) have to be referred to (multi-domain case).
A second, more important, distinction is related to the decision variables.Many optimization methods are based on decision variables that state how many population units have to be selected in each stratum: in other words, strata in the population are assumed as given, and the optimization consists in determining the best allocation of sampling units in population strata.Under this setting, the optimization problem can be solved using an application of the Cauchy-Schwarz inequality (Cochran 1977) or Lagrangian multipliers (Varberg and Purcell 1997).Well known solutions in the multivariate case are the ones given by Bethel (1989) and Chromy (1987).
However, the way the population is stratified is of the greatest importance with respect to the optimization of the sample design.The relationships between the survey target variables and the stratification variables are at the basis of the stratified sampling.In order to take maximum advantage of these relationships, choices regarding the way we define population strata should enter into the optimization process together with the allocation choices.Up until recently, on the contrary, optimization of stratified sampling has been considered as a two-step process: first, a stratification is chosen, by exploiting all the auxiliary information available on sampling units, or only a subset, selected on the basis of known correlations between target and stratification variables; then, given the chosen stratification, the problem of allocation is solved (Dalenius and Hodges 1959).The Lavallée and Hidiroglou method for the stratification of a population which is skewed with respect to a unique stratification variable (Lavallée and Hidiroglou 1988) can be considered an exception, as it allows to determine both, strata boundaries and best allocation, but only in the univariate case.Also the method proposed by Keskinturk and Er (2007), which makes use of the genetic algorithm approach, suffers from the same limitation.
The approach, which is implemented in the R (R Core Team 2014) package SamplingStrata available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=SamplingStrata, permits a joint optimization of both population stratification and sampling units allocation, in the multivariate and multi-domain case.It is based on the following assumptions: 1. the optimal stratification of a population frame depends on the particular sample survey that has to be planned; 2. the optimality of the solution can be measured against its cost, expressed in terms of the number of units to be sampled (together with the per unit interview cost), required to satisfy precision constraints, set on estimates of totals or means of the target variables; 3. the multivariate and multi-domain case must be contemplated in order to ensure generality; 4. the availability of auxiliary information in the population frame permits to define a space of alternative stratifications: this space should be rigorously generated and, in principle, the best solution could be found by exhaustively evaluating each stratification with regard to the cost of the associated sample; 5. as in practical situations it is not possible to enumerate the space of stratifications (because of its dimension), a heuristic is necessary in order to explore this space without (or with a negligible) loss of optimality: from this point of view genetic algorithms have been proven to be particularly efficient.

Best allocation for a given stratification
In this section we briefly recall the approach proposed by Bethel (1989) in order to find the minimum cost and the best allocation of a sample in given strata, in the multivariate case.Given H strata, let N h and S 2 h,g (with h = 1, . . ., H and g = 1, . . ., G) be respectively the population and the variances of the G different target variables Y 's in each stratum h.Assuming a simple random sampling of n h units without replacement in each stratum, the variance of the Horvitz-Thompson estimator of the total ( Ŷg ) is given by Let us consider now the following cost function: In this function we can distinguish a fixed component (C 0 , not dependent on sample size and its allocation), and a variable one (the sum of the products between the cost for interviewing one unit in the stratum (C h ) and the allocation of units in that stratum (n h )).If we define upper limits on the expected sampling variances defined by Equation 1 by setting then to find the best solution to the problem of sample allocation in a stratified design is equivalent to finding the vector of the allocations n h that minimizes Equation 2given Equation 3.
An algorithm that is proved to converge to the solution was provided by (Bethel 1989).
The package SamplingStrata provides the function bethel that allows to determine the best allocation given two different inputs: 1. information on the distribution of the Y 's in the strata; 2. precision constraints on the Y 's.
In this particular implementation, precision constraints are expressed in terms of maximum expected coefficient of variation (CV) for each Y : By so doing, we are able to remove the dependence on the scale (or range) of the values associated with the different Y 's.
So the problem becomes: The Bethel algorithm implemented in the bethel function can be used not only for its original purpose (to determine the best allocation on the basis of the precision constraints), but also to evaluate any given stratification of the population frame.In other words, given two different stratifications adopted for the same population frame, we should prefer the one for which the solution identified by the Bethel algorithm has a smaller cost.

Space of stratifications for a given frame
Let us consider a sampling frame, that is a set of N records containing information related to N units belonging to the reference population.The available information can be grouped into two distinct sets of variables: the variables allowing the identification of units, in order to be able to contact them while carrying out the survey; the variables useful to optimize the sample design (the auxiliary variables X).
In our setting, we assume that: 1. a set of M auxiliary variables X m (m = 1, . . ., M ) are available; 2. only categorical (nominal or ordinal) auxiliary variables are considered: when continuous variables are present in the set, they have to be converted into categorical ones by means of a transformation algorithm (for example, by applying a k-means clustering algorithm, see Hartigan and Wong 1979); 3. with each (categorical) auxiliary variable we can associate a domain set given by the vector d m = {x 1 , . . ., x km }, where an integer value is assigned to each value in the domain.
Under these assumptions, the most detailed stratification for the frame is given by considering the Cartesian product CP The maximum number of strata is equal to K = M m=1 k m − I * , where I * is the number of non-valid or missing combinations in the frame.
We call atomic strata the result of the Cartesian product, i.e., the strata obtained by crossclassifying the units using all the values of all the auxiliary variables, and indicate the corresponding set as L = {l 1 , l 2 , . . ., l K }.
Starting from this set of atomic strata, it is possible to derive the set of all possible partitions P 1 , P 2 , . . ., P B , where each partition is defined as a collection of sets T 1 , T 2 , . . ., T q (1 ≤ q ≤ K) where each T i is a subset of L.
Accordingly to the theory of partitions (Hankin and West 2007) the following conditions must hold: Each partition is equivalent to a given stratification of the frame.The set of all possible partitions can be considered as the space of stratifications.
For example, let us consider a set of atomic strata L = {l 1 , l 2 , l 3 }.The different partitions that can be generated by this set are: The cardinality of the set of all the possible partitions is given by the Bell number :

Choosing the best stratification by applying the genetic algorithm
In principle, having fixed precision constraints on a set of target estimates for a given sample survey, it is possible to choose the best stratification for a population frame where a set of auxiliary variables are available, by executing the following steps: 1. determine the set of atomic strata by cross-classifying sampling units using all the values of all auxiliary variables; 2. calculate distributional parameters (mean and variance) of target variables if information related to Y 's is available for each unit in the frame (or, if not, by using proxy information by other sources) in atomic strata; 3. solve the allocation problem for the atomic strata and associate the cost of the solution found; 4. generate all possible partitions from the set of atomic strata (in order to generate all the possible partitions it is possible to use the package partitions; Hankin 2013); 5. for each generated partition: calculate distributional parameters (mean and variance) of target variables in current strata by aggregating the corresponding information available in atomic strata; solve the allocation problem for the current partition and associate the cost of the solution; 6. choose the best stratification as the one given by the partition with the minimal associated cost.
Unfortunately, this procedure, that is based on an exhaustive enumeration of all possible partitions, is in most cases not feasible, as the number of partitions to be evaluated is too high.In fact, considering the formula for the calculation of the Bell number reported in Equation 7, this number grows very rapidly with regard to the dimension of the set of atomic elements (for example, B 3 = 5, B 4 = 15, B 10 = 115,975 and B 100 ∼ 4.76 × 10 115 ).
The function optimizeStrata available in package SamplingStrata allows to explore the space of stratifications without being obliged to exhaustively enumerate it, by using a search technique known as the genetic algorithm.
Genetic algorithms belong to the class of evolutionary algorithms that make use of techniques based on concepts derived by biology, such as inheritance, mutation, crossover, fitness and selection (DeJong 2006).
In order to apply the genetic algorithm to the problem of finding the best stratification, the following setting has been adopted: 1. a given stratification is considered as an individual in a population (or generation of individuals); 2. an individual is characterized by a genome that is optimized in the course of the evolution; 3. the genome is represented by a vector whose dimension is given by the number of atomic strata (K): with each position in this vector an atomic stratum is associated; 4. to each element in the vector an integer value lying in the interval [1, K] is assigned randomly: atomic strata that share the same integer value, collapse in an aggregate stratum; 5. the fitness of each individual is evaluated by solving the system reported in Equation 5(using the Bethel algorithm); 6. in the passage from one generation to the next, the fittest individuals are privileged: a percentage of those with highest fitness are directly moved to the next generation, the others are randomly selected with probability proportional to their fitness, in order to let them procreate children; 7. each child is procreated by applying crossover to their parents (a swap of the genes contained in the two genomes), and applying mutation to the resulting genome.
At the end of the evolution (the chain of generations), the individual with the absolute best fitness will be chosen: the genome of this individual represents a stratification in which all or some of the atomic strata have been aggregated.

A general procedure for the use of package SamplingStrata
The optimization of the sampling design starts by considering the available population frame, defining the target estimates of the survey and establishing precision constraints on them.
It is then possible to determine the best stratification and the optimal allocation.Finally, the sample can be drawn from the frame stratified accordingly to the optimal stratification.Formally, these are the required steps: 1. analysis of the frame data: identification of available auxiliary information; 2. manipulation of auxiliary information: in case auxiliary variables are of continuous type, they have to be transformed into categorical variables; 3. construction of atomic strata: on the basis of the categorical auxiliary variables available in the sampling frame, the set of atomic strata can be obtained by cross-classifying the units by using all the values of all the auxiliary variables; 4. characterization of each atomic stratum with the information related to the target variables (mean and standard deviation for each Y , estimated by using available information: by census, previous surveys or proxy variables data); 5. choice of the precision constraints for each target estimate, possibly differentiated by domain; 6. optimization of stratification and determination of required sample size and allocation; 7. analysis of the resulting optimized strata; 8. association of new labels to sampling frame units, each of them indicating the new strata resulting from the optimal aggregation of the atomic strata; 9. selection of units from the sampling frame with a stratified random sample selection scheme; 10. analysis of the solution found.
In the following, we will illustrate each step starting from a real sampling frame, the data frame swissmunicipalities that is available in the R package sampling (Tillé and Matei 2012).

Analysis of the frame data and manipulation of auxiliary information
As a first step, we have to define a frame data frame containing the following information: a unique identifier of the unit (no restriction on the name, for instance id ); the (optional) identifier of a pre-defined stratum to which the unit belongs; the values of M auxiliary variables (named from X1 to XM); the (optional) values of G target variables (named from Y1 to YG); the values of the domain of interest for which we want to produce estimates (named domainvalue).
By executing the following statements in the R environment: R> data("swissmunicipalities", package = "sampling") we get the swissmunicipalities data frame, that contains 2,896 observations (each observation refers to a Swiss municipality in 2003).Among the others, we can find the following variables: First, we define the identifier of the frame: Let us suppose to plan a survey whose target estimates are the totals of the population by age class in each Swiss region.In this case, the Y 's variables will be: Y1: number of men and women aged between 0 and 19, Y2: number of men and women aged between 20 and 39, Y3: number of men and women aged between 40 and 64, Y4: number of men and women aged 65 and over.
Consequently, the following statements are executed: We suppose that the values of these variables in the frame have been obtained from past surveys (for instance, from a census), or from administrative data: it should always be taken into account that they could be out of date, or not completely reliable.
Finally, we have to set the values of the domainvalue variable, which is mandatory.As we want to obtain estimates for each one of the seven regions, we set: R> frame$domainvalue <-swissmunicipalities$REG R> frame <-as.data.frame(frame) Now, the frame data frame looks like: This is the format required by the package.

Construction of atomic strata
The strata data frame reports information regarding each stratum in the population.There is one row for each stratum.The total number of strata is given by the number of different combinations of X values in the frame.For each stratum, the following information is required: 1. the identifier of the stratum (named STRATO), concatenation of the values of the variables X's; 2. the values of the M auxiliary variables (named X1 to XM ) corresponding to those in the frame; 3. the total number of units in the population belonging to the stratum (named N ); 4. a flag (named CENS) indicating if the stratum is to be censused (= 1) or sampled (= 0); 5. a variable indicating the cost of interviewing a single unit in the stratum (named COST); 6. for each target variable Y , its estimated mean and standard deviation (named respectively Mi and Si ); 7. the value of the domain of interest to which the stratum belongs (named DOM1 and corresponding to the variable domainvalue in the frame data frame).
If in the frame data frame the values of the target Y 's variables (from a census, or from administrative data) are also present, it is possible to automatically generate the strata data frame by invoking the buildStrataDF function.Let us consider again the frame data frame that we have built in previous steps.We can apply to it the buildStrataDF function: R> strata <-buildStrataDF(frame)
It is worth noting that the total number of different atomic strata is lower than the expected dimension of the Cartesian product of the X's (which is 4,374): this is due to the fact that not all combinations of the value of the auxiliary variables are present in the sampling frame.
Variables COST and CENS are initialized to 1 and 0, respectively, for all strata.It is possible to give them different values: 1. for variable COST, it is possible to differentiate the cost of interviewing per unit by assigning real values; 2. for variable CENS, it is possible to set it equal to 1 for all strata that are of the 'take-all' type (i.e., all units in those strata must be selected).
On the contrary, if there is no information in the frame regarding the target variables, it is necessary to build the strata data frame starting from other sources, for instance a previous round of the same survey, or from other surveys.

Choice of the precision constraints for each target estimate
The cv data frame contains precision constraints that are set on target estimates.This means to define a maximum coefficient of variation for each variable and for each domain value.Each row of this frame is related to precision constraints in a particular domain of interest, identified by the DOM1 value.In the case of the Swiss municipalities, we have chosen to define the following constraints: In this way, we have set a maximum of 5% to the coefficients of variation expected for variables Y1, Y2, Y3 and Y4, in each of the 7 different domains (Swiss regions) in domain level DOM1.
Of course we could differentiate the precision constraints region by region.It is important to underline that the values of domainvalue are the same than those in the frame data frame, and correspond to the values of variable DOM1 in the strata data frame.
If we want to determine the total size of the sample required to satisfy these precision constraints, considering the current stratification of the frame (the 641 atomic strata), we can do this by simply using the function bethel (it is worth noting that the format of the constraints data frame for the bethel function is different from the one accepted by the optimizeStrata function, as in bethel it is not possible to differentiate precision levels in the various subdomains) : R> errors <-cv[1, 1:5] R> allocation <-bethel(strata, errors) R> length(allocation) [1] 641

R> sum(allocation)
[1] 893 This is the required amount of units to be sampled when the frame stratification is most detailed.In general, after the optimization, this number is greatly reduced.

Optimization of frame stratification
Once the cv and the strata data frames have been prepared, it is possible to apply the function that optimizes the stratification of the frame, that is optimizeStrata.This function operates on all subdomains, identifying the best solution for each one of them.Among the parameters to be passed to optimizeStrata, the most important are: 1. cv: the (mandatory) data frame containing the precision levels expressed in terms of maximum acceptable coefficients of variation that refer to the estimates on target variables Y 's of the survey; 2. strata: the (mandatory) data frame containing the information related to atomic strata; 3. initialStrata: the initial upper limit on the number of strata for each solution.Default value is nrow(strata), i.e., the number of atomic strata; 4. minnumstr: the minimum number of units that must be allocated in each stratum.Default is 2, that is the minimum value necessary to calculate sampling variance; 5. iter: the number of iterations (= generations) to be performed by the algorithm.Default is 20; 6. pops: the dimension of each generation in terms of individuals.Default is 50; 7. mut_chance (mutation chance): for each new individual, the probability that the value of a given chromosome (i.e., one bit in the solution vector), is changed.Default is 0.05; 8. elitism_rate: this parameter indicates the rate of fittest solutions that must be transferred from one generation to another.Default is 0.2.
In the case of the Swiss municipalities, optimizeStrata is performed with the following values for the parameters: 2. solution$aggr_strata: the data frame containing information on the optimal aggregated strata.
As we have set to '1' the cost of interviewing a unit in each atomic stratum, the cost of the best solution is given by the total size of the sample required to satisfy precision constraints.
In our case: Along the x-axis the executed iterations are reported, from 1 to the maximum, while on the y-axis the cost of the sample required to satisfy precision constraints is reported.The upper (red) line represents the average sample size in each iteration, while the lower (black) line indicates the best solution found up to the ith iteration.

R> sum(ceiling(solution$aggr_strata$SOLUZ))
[1] 365 It can be seen that there has been a noticeable reduction in the size of the sample, compared to the solution found in the case of atomic strata.
The execution of optimizeStrata implies an independent optimization in each one of the 7 different domains (regions): the optimization run for region 3 is reported in Figure 1.

Analysis of results
A given solution represents an aggregation of the atomic strata.In order to analyze how atomic strata have been aggregated, it is possible to apply the function updateStrata, that assigns the labels of the new strata to the initial ones in the data frame strata, and produces: 1. a new file named 'newstrata.txt'containing all the information in the strata data frame 1. to update the frame units with new stratum labels (combination of the new values of the auxiliary variables X); 2. to select the sample from the frame stratified accordingly to the solution found.
To do the first, we execute the following command: R> framenew <-updateFrame(frame, newstrata, writeFiles = TRUE) The function updateFrame receives, as arguments, the indication of the data frame in which the frame information is saved, and of the data frame produced by the execution of the updateStrata function.The execution of this function produces a data frame (framenew), and also a file (named 'framenew.txt')containing, for each unit, the label indicating to which aggregated stratum the unit belongs.The allocation of units is contained in the SOLUZ variable in the data frame solution$aggr_strata.It is now possible to select the sample from this new version of the frame: R> sample <-selectSample(framenew, solution$aggr_strata, writeFiles = TRUE) *** Sample has been drawn successfully *** 365 units have been selected from 183 strata ==> There have been 33 take-all strata from which have been selected 43 units The function selectSample produces two datasets: 1. 'sample.csv' containing the units of the frame that have been selected, together with the weights that have been calculated for each one of them; 2. 'sample.chk.csv'containing information on the selection: for each stratum, the number of units in the population, the planned sample, the number of selected units, the sum of their weights (that must equalise the number of units in the population).

Evaluation of the found solution
In order to be confident about the quality of the found solution, the function evalSolution allows to run a simulation, based on the selection of a desired number of samples from the frame to which the stratification, identified as the best, has been applied.
The user can invoke this function also indicating the number of samples to be drawn: R> evalSolution(framenew, solution$aggr_strata, nsampl = 1000, + writeFiles = TRUE) For each drawn sample, the estimates related to the Y 's are calculated.Their means and standard deviations are also computed, in order to produce the CV related to each variable in every domain.These CV's are written to an external CSV file: stratification, together with the genetic algorithm in package SamplingStrata, and verify the performance of each of them.The goal is to minimize the total sample size n under a given constraint on the CV of the unique target variable.For instance, considering the dataset USbanks, R> library("stratification") R> data("USbanks", package = "stratification") R> LHkozak <-strata.LH(x = USbanks, CV = 0.01, Ls = 5, + alloc = c(0.5,0, 0.5), takeall = 0, algo = "Kozak") R> LHkozak The application of the LH method (making use of Kozak's algorithm) yields a total sample size of 91.It is the best solution compared to the others obtained by applying the geometric and the cumulate root frequency.The results obtained by the different methods are reported in Table 1.
We now apply the genetic algorithm to the same problem:   be underlined that the GA is characterized by a general applicability to the multivariate and multi-domain cases, while all the other methods are strictly limited to the univariate single-domain case.

Conclusions
The approach implemented in the R package SamplingStrata allows to determine the best stratification of a population frame, i.e., the one that ensures the minimization of the sample cost.When the data collection cost is the same in each stratum, the total cost is directly proportional to the number of units in the sample, and in this case what is minimized is the sample size.Its application is convenient whenever the following conditions occur: 1. a number of different auxiliary variables X's are available in the population frame, so that different alternative solutions can be defined; 2. the number of different domains is not too high, and for each domain the first condition above is satisfied; 3. information directly or indirectly related to target variables Y 's is available for each unit in the population frame.
For instance, the above conditions hold in the case of agricultural surveys in the Italian National Institute of Statistics.In this statistical domain, the sampling frame contains the Decennial Agricultural Census data, and in many cases the survey target variables are a subset of those contained in the frame.In other cases, we may have a yearly survey on a census basis, and a monthly survey is carried out on a subset of the units surveyed yearly.
In these cases the Y variables are the same, and the method implemented in the package works ideally.If the Y variables are not directly available, their values in the frame could be estimated by means of a predictive approach, using previous rounds of the same survey to estimate models linking target and auxiliary variables.
In any case, strong assumptions based on (explicit or implicit) models are made, and this should be taken into account when designing the sample.
number of men and women aged between 0 and 19, Pop2040: number of men and women aged between 20 and 39, Pop4065: number of men and women aged between 40 and 64, Pop65P: number of men and women aged 65 and over, POPTOT: total population.

Figure 1 :
Figure 1: This graph illustrates the convergence of the solution in the course of the iterations.Along the x-axis the executed iterations are reported, from 1 to the maximum, while on the y-axis the cost of the sample required to satisfy precision constraints is reported.The upper (red) line represents the average sample size in each iteration, while the lower (black) line indicates the best solution found up to the ith iteration.

Figure 3 :
Figure 3: Strata resulting from the execution of the genetic algorithm.

Table 1 :
Sample sizes obtained by different methods applied to USbanks dataset.

Table 3 :
Results of application of all methods to the four datasets.