A combinatorial optimization approach to the selection of statistical units

In the case of some large statistical surveys, the set of units that will constitute the scope of the survey must be selected. 
We focus on the real case of a Census of Agriculture, where the units are farms. 
Surveying each unit has a cost and brings a different portion of the whole information. 
In this case, one wants to determine a subset of units producing the minimum total cost for being surveyed and representing 
at least a certain portion of the total information. 
Uncertainty aspects also occur, because the portion of information corresponding to each unit is not perfectly known 
before surveying it. 
The proposed approach is based on combinatorial optimization, 
and the arising decision problems are modeled as multidimensional binary knapsack problems. 
Experimental results show the effectiveness of the proposed approach.

1. Introduction. The scope of a statistical survey is the set of statistical units that should be surveyed. In the case of many large surveys, the scope cannot be the list of all possible units, because otherwise the cost or the complexity of the survey would be prohibitive. On the contrary, the scope should be selected, also on the basis of economical consideration. A typical example of this situation is the case of a Census of Agriculture, where, differently from a more traditional censuary vision, one needs to exclude from the survey the farms that are too small or otherwise irrelevant to the survey itself. A main problem, in similar cases, is establishing criteria or tracing somehow boundaries for dividing what should be included in the survey from what should be excluded from it before surveying the units.
From a conceptual point of view, there is a very large set of statistical units (e.g. farms, companies, etc.) that could be surveyed. Surveying each of them has a cost and represents a different portion of the whole statistical information under investigation (e.g., the state of the agriculture, the industrial production, etc.). Also, some coverage levels on that whole statistical information are assigned. Hence, one would like to choose only a subset of the whole set of units. This subset should have the minimum total cost for being surveyed but should represent a portion of the whole information large enough for respecting the above coverage levels.
An important additional difficulty is that the portion of information carried by each unit is not perfectly known before surveying it. This often happens because the units may have been last surveyed only during a previous Census, typically held several years before, and their characteristics may have changed in the meanwhile. We therefore need to establish reliable criteria for deciding whether to include or not a unit in the survey on the basis of the (possibly outdated) available data describing the unit.
This problem is often called Scope Selection, or statistical Universe Selection, see e.g. [9,10], and evidently contains a Combinatorial Optimization structure, with the optimal solution being one of the feasible subsets of the ground set of all the units. Defining a scope has connections with the general statistical task of Population Definition, see e.g. [13,22]. The problem of the selection of statistical units is also studied in [11], where the authors consider the measurement of the lowest night temperature and select with an efficient heuristic the subset of units that allows the best linear prediction of the temperature in the excluded units. That procedure can also be used for other types of measures, but there are a number of structural differences between the problem in [11] and our case. Indeed, in that work, strong correlation hypothesis should hold on the values of the measure under prediction (a covariance matrix constant in time should be obtainable), and the decision of including or excluding each unit is based on one single type of measures. On the contrary, in our case, each statistical unit is described by a set of different values that are generally not correlated to those of other units, and the decision of including or excluding each unit must consider all the different values simultaneously. On the other hand, other optimization models arising from the treatment of agricultural data are described in [3,4,19], or in [7,8] for population data. The problem of selecting units from a list can also be viewed as a particular case of the very important problem of quota sampling, for which several approaches and techniques of purposive sampling have been proposed, see also [16,18,23]. After the selection of a set of units, if uncertainty occurs in the data, one should also evaluate the risk of undercoverage [2]. While similar problems of statistical units selection have been solved in practice by means of a variety of ad hoc techniques, whose features typically depend on the specific application but which in any case heavily rely on the contribution of experts of the field, we propose a more general approach based on the use of a binary linear model solved by means of Combinatorial Optimization techniques. This innovative approach to the problem overcomes the particularity of ad hoc approaches, and moreover allows to take advantage of effective algorithms already developed in that field.
More precisely, by using binary variables associated with the above units, the described selection problem is here modeled as a multidimensional binary knapsack problem (see e.g. [21,25]). Since those models may reach in many cases very large dimensions, a Branch&Cut approach using a separation procedure based on covers [1,15,24] has been used. A solution to the above knapsack model will be referred to as an Optimal Selection. However, due to the above described uncertainty aspects, not just one but a sequence of Optimal Selection problems must be solved for selecting the Scope in practice. Indeed, in order to develop inclusion criteria based on thresholds, we need to evaluate some safety margins with respect to the risk of undercoverage for different inclusion thresholds. The procedure has been implemented in c++ and tested in cooperation with the Italian National Statistic Office (Istat) on real data from the Italian Census of Agriculture 2010. Results are very encouraging both from the computational and from the statistical point of view. The main contribution of this work is therefore an innovative and effective approach based on Combinatorial Optimization for solving a challenging large-sized and economically important real-world problem.
The work is organized as follows. Section 2 describes the basic model proposed for the selection of an optimal subset of statistical units and techniques to improve this formulation and solve the overall model by means of a Branch&Cut approach. Section 3 explains how the solution of the Optimal Selection problem under different conditions leads to the determination of the inclusion criteria based on thresholds. Finally, in Section 4, we provide extensive results on real-world data from the Italian Census of Agriculture.
2. Solving the Optimal Selection Problem. The model proposed for the above problem will be hereinafter explained by referring to the specific case of a Census of Agriculture. This is probably the most important case, because it has a great economic relevance and a very large dimension. Moreover, in the case of EU countries, gathered information must be published and provided to the EU level, where it constitutes a basis for assigning financial resources, for planning production, and for several other economical European policies. However, the proposed model is not intrinsically limited to that case, but can be used for similar cases of Scope Selection problems.
In a Census of Agriculture, there is a very large list U = {u 1 , . . . , u n } of all the existing statistical units that could be surveyed. Each unit u i represents a farm, which is described by the areas used for every cultivation (plus other data not relevant for this work). Units have therefore the following structure: where a ij is the area that farm i uses for cultivation j, with i = 1, . . . , n and j = 1, . . . , m, and a iT is the total area of farm i used for cultivations, technically called Utilized agricultural Area (UA). Unfortunately, at least for the majority of the cases, the data in the list U are those that were surveyed during the last census, often held several years before, or were obtained by other possible sources that may still be outdated. Hence, the available data may very well be different from the current farm situation. In particular, the single cultivation areas a ij may easily have changed, while total cultivation area a iT of the farm is more stable.
For every cultivation j that we are interested in, including some types of livestock, a certain coverage level q j ≥ 0 and ≤ 1 is required, with j = 1, . . . , m. Value q j represents the minimum portion of the total area of cultivation j that must be surveyed: if this total area is n i=1 a ij , we need to survey at least an area q j n i=1 a ij of cultivation j (e.g. survey at least 0.8 of the total cultivation of oranges, at least 0.5 of the total cultivation of apples, etc.). A required coverage level q T for the total cultivation areas is also given. For the case of EU countries, coverage levels are generally assigned by European regulations, e.g. [9]. The set of coverage levels is denoted by Surveying unit u i has a cost w i (that can be evaluated either in terms of expense, or complexity added to the whole survey, or other) and produce, for each cultivation j, an amount of statistical information that, in absence of further elements, is estimated being equal to the available value of the cultivation area a ij . By defining the cost of a set of units to be the sum of their individual costs w i , we want to choose a subset S ⊆ U of units producing the minimum total cost for being surveyed and simultaneously respecting all the above defined m + 1 coverage levels. Hence, the problem cannot be decomposed in subproblems addressing a single cultivation type at a time. In order to represent whether to include or not a unit, we introduce a set of binary decision variables {x i }, with i = 1, . . . , n, such that Now, cost minimization can be expressed by maximizing the total cost of the units that we do not survey (i.e. the saving). Respecting the coverage levels, on the other hand, can be expressed by imposing that the area that is excluded cannot be more than the maximum area we are allowed to exclude. This latter condition should be imposed both for each cultivation and for the total area. The described Optimal Selection problem can be modeled as the following multidimentional binary knapsack problem.
Multidimentional binary knapsack is a well-known combinatorial optimization problem [21]; in its optimization version it is NP-hard. Note that a complementary choice for the meaning of the x i variables (1 if unit u i is included in the scope, 0 otherwise), that may appear more straightforward, would have led to a multidimensional packing problem [21], that has the same complexity level. However, the proposed modeling choice will have less variables at 1, with consequent computational advantages. Model (1) has a number of variables equal to the number n of units in list U and a number of constraints equal to the number of coverage levels m + 1, so it may reach in practical cases a very large dimension. Therefore, solving such an integer linear program by means of a simple Branch&Bound approach can be excessively time consuming (see e.g. [12]), and we use a Branch&Cut approach based on the generation of covering inequalities, as follows. Given a single knapsack constraint (1), a set C ⊆ {1, . . . , n} is a cover if i∈C a ij > e j .

OPTIMAL SELECTION OF STATISTICAL UNITS 5
Given a cover C, we can write the following covering inequality expressing that not all variables x i , for i ∈ C, can be simultaneously one: Covering inequalities of the above type are valid for a problem containing the knapsack constraint 1,15,24]). Therefore, an inequality of type (2) can be added to model (1) for improving the formulation given by its linear relaxation. In the case of the described Census, the meaning of the above inequality is that we simply cannot exclude from the survey a set C of units having a total area that is too large. It is evidently impracticable to generate all inequalities in the form (2), since their number can be too large. Therfore, we generate covering inequalities within a Branch&Cut scheme (see e.g. [21]). This requires to solve recursively the so called separation problem, that is, given a pointx ∈ R n and a polytope K, either to prove thatx ∈ K or find an inequality, called cut or cutting plane, that is valid for K but cutsx away from K.
Denote by c the incidence vector of the generic subset C of {1, . . . , n}, i.e. the binary n-vector whose i-th element is 1 if i ∈ C, 0 otherwise. By recalling that a ij ∈ R and ≥ 0, set C is a cover for the j-th knapsack constraint of problem (1) if and only if its incidence vector c satisfies the following condition that, by introducing > 0 equal to the minimum possible difference in the a ij values, can be rewritten as Among all possible covers, we want a cover C such that the components ofx corresponding to elements of C sum to a value > |C| − 1, that meansx can be cut away by the cutting plane generated by C. Cover C represents in practice a set of units that cannot be simultaneously excluded from the survey, but are actually excluded in solutionx. The condition of cutting awayx can be expressed as follows: Putting together condition (3) and (4) we have the following optimization problem encoding our separation procedure for the j-th knapsack constraint of problem (1).
When model (5) is solved to optimality, we obtain vector c and the corresponding If v is < 1, there exists a covering inequality that is valid for K but cuts awayx from K, and c is the incidence vector of the cover C generating that cover inequality. On the contrary, if v is ≥ 1, we cannot obtain from the j-th knapsack constraint a covering inequalities cutting awayx.
In this latter case, we must try to obtain it from another one of the knapsack constraints of (1). If such a covering inequality cannot be obtained after all the knapsack constraints have been tested, it does not exist. Therefore, solving several problems in the form (5) may be needed. Although (5) is a binary problem with n variables, it can be solved quite easily, since any feasible solutionĉ of of (5) such that the corresponding objective valuev is < 1, even if sub-optimal, is the incidence vector of a coverĈ whose cover inequality cuts awayx from K. Therefore, we may accept those kind of solutions, and search them with the following procedure. Recall that every variable c i has a cost given by (1 −x i ) and a value a ij . We denote by RHS the right-hand side of the only constraint in (5).
Greedy Algorithm for the solution of problem (5) Input An instance of separation problem (5), defined by the cost n-vector (1 −x), the values n-vector a j and a value RHS. Output A binary feasible (and possibly optimal) solutionĉ to (5) 1 Order by increasing cost/value ratio the indices of the binary variables. 2 Following the above greedy order, putĉ i = 1 until the left-had side of the constraint becomes ≥ RHS (i.e. we have a feasible solution), andĉ h = 0 for all the rest of the indices. If the value of this solution isv < 1, there exists a covering inequality that is valid for K but cuts awayx from K, andĉ is the incidence vector of the cover generating it.
The above heuristic solution could be evaluated by using the lower bound given by the solution r of the linear relaxation of problem (5): we put r i = 1 until the left-had side of the constraint remains ≤ RHS (i.e. we have a maximal infeasible solution), then r (i+1) = (RHS − LHS)/a (i+1)j , and finally r h = 0 for all the rest of the indices. If the objective value corresponding to r is v r ≥ 1, the value v of the integer solution of (5) is ≥ v r , so we know that the j-th knapsack constraint cannot provide a covering inequalities cutting awayx. On the contrary, when v r < 1 but v ≥ 1, a covering inequalities cutting awayx may exist, but was not found by the procedure. Nonetheless, we try to obtain it from another one of the knapsack constraints of (1), and if none of them can provide it, we simply branch following the mentioned Branch&Cut scheme. This technique guarantees to reach the optimal binary solution of model (1).
3. Determining Reliable Inclusion Criteria. Since data are uncertain, an optimal solution x of model (1) cannot guarantee providing a set of units really respecting the required coverage levels. Indeed, as a trivial example, if the real cultivation areas of some of the selected farms have become smaller than what described by the available data a ij , the risk of undercoverage (i.e. failing the required coverage levels q j ) is present. Hence,we need to distinguish between: • solving the Optimal Selection problem, that is solving to optimality model (1); • solving the Scope Selection problem, that is finding the set of units that we use as scope in practice.
For solving the Scope Selection problem, we need to determine a priori inclusion criteria for selecting the set of units respecting the coverage levels. A priori means here criteria that, for each unit u i , could be checked before surveying u i . A basic and mostly adopted criterion is using thresholds. Given a threshold value t j , for each unit u i one could determine whether to survey it or not: we survey u i if a ij ≥ t j , we do not survey it otherwise. Since in the analyzed case the total utilized area (UA) is the more reliable among the available informations, it was preferred to establish a threshold t only on that value. The coverage levels, initially required by EU [9] and assigned for the whole Nation, were modified and slightly increased so as to determine more specific coverage levels assigned for each Region. Those new levels were determined by experts of the field according to specific features of the different regions, whose description goes beyond the aim of this work. The final established regional coverage levels, for Citrus plantations, Fruit trees cultivation, Olive cultivations, Arable land, Vineyard cultivation and UA, are reported in Table 1. A'-' denotes that the value is irrelevant because that cultivation is not used in that region. In order to satisfy the above regional coverage levels, one may in general consider different options. A first one could be solving the regionals Optimal Selection problems, in order to determine, for each region, a selected set of units U . After this, use U to determine the value a iT of UA corresponding to the smallest included farm of the region, then compute how reliable that value is, possibly modify it for having a safety margin, and use it as inclusion threshold (at the regional level).
Another alternative would be fixing, according to some predetermined decision, a number of threshold values on the total utilized area (UA), and solve the regional Optimal Selection problems for each of them. Given a threshold t, denote by U t the set {u i ∈ U : a iT ≥ t} and by U t the result of the Optimal Selection on U t (the set of units u i ∈ U t having x i = 0). Define R t = U t − U t the set of units that satisfy threshold t but are not in the solution of the Optimal Selection (i.e. the set of units u i ∈ U t having x i = 1). Sets U t and R t can now be used for computing statistical indicators that evaluate the safety margin with respect to the risk of undercoverage obtained by using t as inclusion threshold. After this, the threshold value t that, among the predetermined ones, corresponds to the best compromise between risk estimation and list reduction for the region, is selected and adopted as inclusion criterion (again at the regional level). The solution to the Scope Selection problem would therefore be U t This last option was preferred in the analyzed case, because it could provide more robustness in the procedure, in the sense of making it more stable and less prone to the changes that may have occurred in the data.
We also remark that a third alternative approach could be based on the computation, for each unit u i , of some upper bound UB ij and lower bound LB ij of the amount of information (i.e., the currently used cultivation area) that unit u i brings for cultivation j. When LB ij , we could reasonably expect that the coverage level q j is respected. Determining upper and lower bounds for the probability of an event logically related to a set of other events is a well-known problem in Statistics and Probability theory for which a number of techniques have been proposed (see, e.g., the seminal work [14], and the recent [6]). However, this option was not selected because of the heavy assumptions required for those computations.
We now describe the statistical indicators that were built by experts of the field for evaluating the safety margin from the risk of undercoverage corresponding to each threshold t. A basic index number is the percentage of farms taken in addition to U t when using threshold t, denoted by β(t) and computed as follows.
The larger the value of β(t), the more threshold t is able to provide a set U t that is bigger than the minimum set respecting the coverage levels, and consequently the more secure is the selection by using threshold t. Another index number, for each cultivation j, is the percentage of cultivation area taken as safety margin when using threshold t, denoted by γ j (t) and computed as follows.
Again, the larger the value of γ j (t), the more threshold t is able to provide an area ui∈Ut a ij that is bigger than the minimum area respecting the coverage levels, and consequently the more secure is the selection by using threshold t. Clearly, the higher the values of t, the smaller the above safety margins become, but the larger the savings in the survey are. Therefore, we need to chose the higher t still having acceptable values for the above indicators, so as to obtain the maximum savings with negligible risk.
Given a set of farms U t , define S j (U t ) = {u i ∈ U t : a ij > 0} to be the subset of farms in U t having cultivation j. Denote now by µ{S j (U t )} the average area of cultivation j over set S j (U t ). A third indicator, for each cultivation j, is the average number of units in R t that are needed to obtain a portion of information that is equivalent to the portion of information given by an average unit of U t , or, in other words, the average number of units in R t needed to replace a unit in U t (for example in case the latter one does not exist anymore). That will be denoted by ω j (t) and computed as follows.
The ω(t) is clearly ≥ 1, and the smaller the values, the more robust is the choice of threshold t. Some values of β, γ and ω for the considered real-world case are reported in the following Table 4. 4. Experimental Results. The described procedure has been implemented in C++ and tested for the treatment of data from the Italian Census of Agriculture 2010 (VI Censimento Generale dell'Agricoltura 2010). The experiments were conducted on a 16 cores server having 128Gb of RAM under Linux Operating System. The linear relaxations of (1) are solved by means of the open source solver Clp (Coin-or linear programming, available from https://projects.coin-or.org/Clp), which is a very good implementation of primal and dual simplex and barrier methods, written in C++ by a research group headed by Dr. John J. Forrest, from the IBM Watson Research Center, within a joint project among IBM, Maximal and Schneider called COIN-OR (COmputational INfrastructure for Operations Research, http://www.coin-or.org/index.html). This solver was selected because it appeared the most suitable open source LP solver in a previous study [5]. As described in the previous Section, a number of predetermined threshold levels on the total utilized area (UA) have been used. Their values were 0.0 (meaning that all farms are included, even the smallest ones); 0.1; 0.2; 0.3; 0.4 hectares. Table 2 reports the detail of this analysis for one sample Italian region (Marche). Evidently, when increasing the inclusion threshold t, the cardinality of U t decreases consistently (less farms satisfy that threshold). On the other hand, the cardinality of U t does not decrease. It generally remains the same, even if it may occasionally slightly increase. This happens because, intuitively, when searching an optimal solution U t within U t , the smaller is U t , the less choices we have for selecting the minimum-cost solution satisfying the coverage levels.
Consequently, the differences between U t and U t tend to become smaller, and the described β, γ tend to decrease. Therefore, when increasing t, there is a tradeoff between the reduction in the cost and complexity of the Census, caused by the decreasing of |U t |, and the rise in the risk of undercoverage, evaluated by the β and γ index numbers. In the case of a Census, the priority is having an acceptable risk level, so we are interested in maximizing the savings obtainable in correspondence to acceptable values of β and γ. Acceptable values for the β and γ indicators have been considered those respectively above 10% and 0.5%. On the contrary, values for the ω indicator should be as small as possible. Hence, for the case of Marche region, the best compromise is t = 0.4, corresponding to a very acceptable risk level (β and γ are considerably above the lowest acceptable values) but producing considerable savings: the cost of surveying 66,536 -60,309 = 6,254 farms, that is about 10%.  (1) to optimality; the value of threshold t selected among the predetermined values as the best compromise between list reduction and risk of undercoverage; the number |U t | of existing statistical units that are above threshold t; the number |U t | of units selected from U t when solving model (1) to optimality; computational time in seconds for the overall treatment of the region, including the solution of the five Optimal Selection problems and the evaluation of the described indicators (Time All); computational time in seconds for solving to optimality the single Optimal Selection problem (1) corresponding to t .  Table 4 reports, for each Italian Region, the values of some of the described statistical indicators corresponding to the selected threshold t . In particular, we report β, γ vineyard , ω vineyard , γ olive , ω olive , γ T .
Vineyard and olive were selected because they are particularly important, being probably the two most typical Italian cultivations, and being subject to several EU regulations. In the real Census application several other cultivations were also considered.  Table 5, finally, summarizes the Italian situation. It reports, for the case of threshold t = 0.0 (all U ) and for the threshold t reported for each region in Table 3, the total number of farms, their total utilized area UA, the number of farms obtained for the optimal solution, and their total UA. As showed in this last Table, when using as inclusion criterion the t values of Table 3, we have a reduction in the number of farms of 206,679, corresponding to a saving of 7.97%, that is worth of note (considering the large cost of a Census), and a reduction in the cultivation area (Total UA) of about 50,948 hectares, corresponding to a loss of only 0.39% of the total information, and this with a negligible risk of failing the required coverage levels, as observable from the values of β and γ in Table 4. From the same Table 5 we also observe that, if the cultivation data were updated and reliable, the set U 0 could have been surveyed directly, with a reduction in this case of 950,506 farms, corresponding to an even larger saving of 36.63%, and a reduction in the cultivation area (Total UA) of about 204,848 hectares, corresponding to a loss of only 1.55% of the total information, with the guarantee of respecting the coverage levels. These results were possible because of the existence, in the Italian territory, of a large number of very small farms that however constitute only a very small portion of the total national cultivation area. Such a set of farms would be quite expensive to survey but would not provide an amount of information justifying that spending, so their detection has allowed considerable savings. We moreover stress that the computational times required to solve those large-sized problems with the proposed procedure are extremely moderate.

5.
Conclusions. We proposed an innovative approach to the Scope Selection problem based on Combinatorial Optimization. The proposed multidimensional knapsack model can be solved to optimality in short times by means of a Branch&Cut algorithm based on the generation of cover inequalities. The procedure has been implemented and tested on the real-world case of an Italian national agricultural Census. By solving the Optimal Selection problem in different conditions, statistical indicators for the determination of reliable inclusion criteria based on thresholds have been computed. The proposed approach allows to considerably reduce costs and complexity of the survey while ignoring only a very small portion of the whole information that can be surveyed. The risk of failing the required coverage levels, i.e., the risk that such ignored portion was larger than the maximum admissible portion that we are authorized to ignore, is negligible.