Random Forests and the measurement of super-eﬃciency in the context of Free Disposal Hull

In the technical eﬃciency evaluation area


Introduction
Technical efficiency determination with respect to the performance of any kind of private firm or public organization has been a topic of interest for company managers, production engineers and policy makers since the beginning of the XX century ( Cobb & Douglas, 1928 ). In this framework, the objective is to evaluate the efficiency level of a group of observations, known as Decision Making Units (DMUs), which consume inputs to yield outputs in a blackbox analysis, since how the outputs are converted from the inputs, i.e., the state of the underlying technology, is usually unknown to any external observer ( Cook & Zhu, 2014 ;Lozano & Khezri, 2021 ). The modern techniques for gaging technical efficiency are all based on evaluating the distance from each DMU to the (upper) boundary of the technology, where the technology represents the set of 'producible' points in the input-output space. The points that belong to this boundary, also called production frontier, are associated with the best (real or virtual) practices in performance and characterize the technology. So, the determination of the frontier is usually key practice in addressing the problem of the determination of technical efficiency from a data sample of DMUs. * Corresponding author.
At the present time, standard techniques for determining production frontiers tend to be based on non-parametric and parametric methods. While the first approach needs to specify the mathematical expression of the production function parametrically, depending on a series of parameters to be estimated, and adopt a set of assumptions regarding the error and efficiency terms; the non-parametric counterpart is only based upon a set of few postulates and is more data-driven. Additionally, non-parametric techniques have the advantage of coping with the problem of analyzing multi-output situations in an easier way. Whereas nonparametric methods are naturally able to consider all the outputs in the model at the same time; parametric techniques need to resort to the concept of distance function and assume a series of assumptions ( Orea & Zofio, 2019 ). In this paper, we focus our attention on non-parametric methods. In this respect, some of the most famous non-parametric approaches for estimating production frontiers are Data Envelopment Analysis (DEA) ( Banker et al., 1984 ;Charnes et al., 1978 ) and Free Disposal Hull (FDH) ( Deprins et al., 1984 ). Under Data Envelopment Analysis, the production frontier is piecewise-linear, whereas under Free Disposal Hull, the production frontier is a step function. DEA is grounded on the definition of a production possibility set from the satisfaction of a few axioms. Convexity is one of them. Another important postulate is free disposability. Free disposability establishes that if an input-output vector is feasible, then any input-output vector that has a lower value for outputs and a greater value for inputs is also feasible. In some respects, it means that doing it worse is always feasible. Additionally, DEA is assumed to be deterministic. This means that the input and output values of the DMUs that belong to the data sample have been observed without noise. Consequently, the production frontier associated with the DEA production possibility set always (upper) envelops all the observations in the data sample. However, many estimators satisfy all these axioms and, consequently, we need more requirements to define a suitable estimator. In particular, 'minimal extrapolation' is usually assumed. This postulate is related to an estimator that envelops the data cloud as close as possible to the observations. Of course, this last postulate is also responsible for the overfitting problem associated with DEA: the estimated piecewise-linear surface fits the data sample like a glove. Additionally, Deprins et al. (1984) introduced the alternative approach known as Free Disposal Hull (FDH), which is a technique that is only based on free disposability and minimal extrapolation, while DEA assumes these two postulates and convexity. These features lead to an estimator of the production frontier based upon Free Disposal Hull that has a staggered shape and, similarly to Data Envelopment Analysis, suffers from overfitting.
The overfitting problem exhibited by DEA and FDH is related to the ineffectiveness of these techniques when the objective is inferential in nature. DEA and FDH are descriptive tools of the observed data sample at a frontier level. It can be assumed that a Data Generating Process (DGP) yielded the observations (the DMUs). For many scientists, the ability to predict accurately is a crucial measure of a model's worth. If the considered approach cannot predict well, it means that the model cannot capture the essential patterns of the original DGP. Such a model may have little importance, unless the research objective is merely descriptive for the observations. Indeed, a model that fits the data sample well, like DEA or FDH, will not necessarily estimate well for those units not observed in the sample. In statistical learning, the so-called out-ofsample performance is as important as the in-sample performance of the considered model. The prediction or generalization error of a model ought to be evaluated by resorting to a different test data sample as opposed to using the same data employed in the model's fitting process (for example, by applying cross-validationsee Browne, 20 0 0 ).
The endowment of DEA and FDH with certain features linked to prediction and inferential techniques has been relevant in the literature in the last decades. Banker & Maindiratta (1992) and Banker (1993) proved some relationships between the DEA estimator and Maximum Likelihood in statistics. Simar & Wilson (1998) and Simar & Wilson,20 0 0a , 20 0 0b ) adapted the bootstrapping methodology by Efron (1979) to the framework of FDH and DEA. Kuosmanen & Johnson (2010) and Kuosmanen & Johnson (2017) defined the Corrected Concave Non-Parametric Least Squares. Hall & Huang (2001) and Du et al. (2013) estimate production frontiers by applying Kernel-based approaches and local regression techniques. Parmeter et al. (2014) estimate technical efficiency by resorting to local linear generalized kernel regression. See also Henderson and Parmeter (2009) . However, none of these mentioned approaches solve the problem through machine learning methods, in spite of being one of the most natural theoretical frameworks to be considered, given the data-driven nature of Free Disposal Hull and Data Envelopment Analysis. Two recent exceptions are Esteve et al. (2020) and Valero-Carreras et al.(2021) . Esteve et al. (2020) adapted the Classification and Regression Trees (CART) to a standard production context through step functions, thereby introducing an alternative to FDH. Valero-Carreras et al. (2021) introduced Support Vector Frontiers (SVF) based on Support Vector Regression (SVR), competing against DEA.
One weakness of DEA and FDH is a complete lack of discriminatory power between observations being rated as efficient. This drawback is more evident when the relationship between the number of observations and the number of variables (inputs and outputs) is inadequate (i.e., the typical problem of the 'curse of dimensionality'-see, for example, Charles et al., 2019 ), because many units obtain the same relative status as technically efficient. The contribution by Andersen & Petersen (1993) was the first to respond this question in the literature: If all these units are apparently equal regarding their performance, is there any way to distinguish between them? In particular, these authors introduced the notion of super-efficiency. The idea behind this concept is to evaluate each DMU with respect to all other units in the data sample, i.e., the evaluated observation itself is excluded from the analysis. In other words, super-efficiency determines efficiency when the assessed observation is removed from the estimated technology. Additionally, this fact can yield infeasibility results for the observations under evaluation, depending on the assumptions made on the definition of the technology and the efficiency measures used (see Angulo-Meza & Lins, 2002 ). In addition to ranking technically efficient units, the notion of super-efficiency has also been applied for detecting outliers ( Banker & Chang, 2006 ), to guarantee an adequate incentive system in regulation theory ( Agrell & Bogetoft, 2017 ), or, more recently, to discriminate between cost efficient observations ( Kerstens et al., 2022 ).
In this paper, we build a bridge between machine learning and the measurement of super-efficiency by adapting Random Forest ( Breiman, 2001 ) to the production context. Random forest (RF) is an ensemble method in statistical learning for classification (for a binary response variable) or regression (for a continuous response variable) that creates a list of decision trees that are fitted on a random subsample of the original data base by considering a random selection of predictor variables. When the method is used for predicting, the individual predictions of each decision tree are aggregated through the arithmetic mean. RF usually yields a better predictive performance than what could be obtained from any of the constituent individual decision trees. In our approach, we will resort to the extension of the model by Esteve et al. (2020) , who adapted the famous Classification and Regression Trees technique ( Banker et al., 1984 ) for estimating production frontiers in microeconomics, giving rise to a technique known as Efficiency Analysis Trees (EAT) that outperforms the standard FDH approach. Read also Badiezadeh et al. (2018) , Charles et al. (2020) , Lee & Cai (2020) , Olesen & Ruggiero (2022) , Zhu (2019) , Zhu et al. (2018) and Misiunas et al. (2016) to appreciate the current interest of the Data Envelopment Analysis community on the relationship between efficiency and machine learning.
Nevertheless, as is well-known in machine learning ( Berk, 2016 ), estimators generated from (individual) decision trees frequently present low bias but high variance. Due to that, the techniques based on decision trees can be improved through more complex approaches, such as Random Forest ( Breiman, 2001 ). In general, considering a lot of randomized models and aggregating them by ensembles decreases the error of the estimator by reducing the spread. Accordingly, it is our intention in this paper to adapt RF to determine super-efficiency by aggregating the predictions from hundreds of trees fitted by the EAT technique, playing with the randomization of data and input variables.
The main contributions of this paper are, therefore, associated with the implications of the adaptation of Random Forest for determining super-efficiency for a set of DMUs. First, we propose a new technique that outperforms the standard super-efficiency model under FDH, regarding the discrimination between efficient units and the typical problem of infeasibility. It will be shown through a computational experience. This result is a consequence of the robustness of the new approach to the resampling of data and input variables. Second, a method for calculating the importance of the input variables, when the determination of super-efficiency is the focus in the model, is suggested. It is based on the formulation usually utilized in standard RF. Third, we will also show that the adaptation of Random Forest can be a solution for the well-known curse of dimensionality problem established in the literature.
The structure of the paper is as follows. In Section 2 , we briefly introduce the backgrounds: the Free Disposal Hull (FDH) methodology, the notion of super-efficiency, the Efficiency Analysis Trees (EAT) technique for estimating multi-output production frontiers and, finally, the standard Random Forest in machine learning. Section 3 is devoted to adapting Random Forest to the framework of super-efficiency evaluation by resorting to EAT through randomizing on data and input variables. The new technique will be called RF + EAT. Also, in this same section, performance of the adapted Random Forest is checked by a computational experience. Section 4 is devoted to studying the curse of dimensionality problem and showing how RF + EAT can be seen as a possible solution in practice. In Section 5 , we introduce the algorithm for determining a ranking of the importance of the model's input variables. Section 6 shows an empirical illustration of the new method. Section 7 establishes the main conclusions and future research lines of this paper.

Free Disposal Hull (FDH) and super-efficiency
We assume that n Decision Making Units (DMUs) has been ob- yielding of s outputs 1 y i = ( y i 1 , ..., y is ) ∈ R s + . A main notion in production theory is the technology, also called production possibility set. The technology is the set of technically feasible combinations of ( x , y ) . The technology is defined in general as follows: Free disposability of inputs and outputs is usually assumed on the set .

Definition 1.
satisfies free disposability if ( x , y ) ∈ , then ( x , y ) ∈ , with x ≥ x and y ≤ y. 2 In light of Definition 1 , free disposability establishes that if ( x , y ) is feasible, then any ( x , y ) that has a greater value for inputs and a lower value for outputs is also feasible. To see other postulates also assumed in microeconomics see Färe and Primont (1995) .
The technical efficiency level of a DMU i is defined as the distance between the vector ( x i , y i ) and the production frontier , defined as The Data Generating Process (DGP) that is behind this production context is as follows (see Daraio & Simar, 2007 ). It is assumed that we observe a (learning) sample ℵ = { ( x i , y i ) } n i =1 of an identically and independently distributed random vector ( X, Y ) with a certain unknown joint distribution with support , the technology; where X represents the inputs and Y represents the outputs. Two different approaches have mainly been developed in the literature regarding DGPs: the deterministic frontier model, which assumes that all the observations in ℵ belong to , and the stochastic frontier model, where random noise allows some observations 1 Non-bold is used for denoting scalars and bold is utilized for denoting vectors.
to be outside of . The standard FDH focuses on the deterministic frontier model. Consequently, in this paper, we will follow the same assumption.
A common technical efficiency measure in the literature is the output-oriented radial model, which coincides with the inverse of the well-known Shephard output distance function ( Shephard, 1953 ). The output-oriented radial model defines technical efficiency for the vector ( x k , y k ) as the maximum equiproportional increase in the outputs that is feasible, whereas inputs are fixed: (2) Nowadays, there are several methods for estimating or its production frontier and, consequently, the value of φ( x k , y k ) may be estimated by plugging the estimation of into expression (2). One of these methods is the Free Disposal Hull (FDH) technique by Deprins et al. (1984) . The FDH estimator only assumes free disposability and the minimal extrapolation principle. In contrast, Data Envelopment Analysis also assumes convexity. However, the convexity assumption is not always valid in practice (see Kerstens et al., 2019 ). The FDH estimator of the technology is: Fig. 1 shows an example of the FDH estimator with a typical non-decreasing stepwise shape.
Note, in Fig. 1 , that the estimated frontier by FDH satisfies free disposability and suffers from overfitting. This last feature is consequence of the minimal extrapolation principle.
Resorting to FDH estimator, the output-oriented radial efficiency score φ( x k , y k ) is determined by the following optimization model: indicating technical efficiency. Andersen & Petersen (1993) introduced a procedure to rank technically efficient DMUs (in our context, units satisfying the condition φ F DH ( x k , y k ) = 1 ). The so-called super-efficiency model allows an efficient unit to achieve an efficiency score less than or equal to unity by removing this observation from the set of DMUs that define the production possibility set. Mathematically speaking, in model (4), it is enough to force λ kk = 0 . In this way, the unit ( x k , y k ) no longer influences the left-hand side of the constraints of the optimization model. Accordingly, the super-efficiency score that we obtain for all the observations in the data sample can be used to rank the units and break the ties.

Efficiency Analysis Trees (EAT)
Efficiency Analysis Trees ( Esteve et al., 2020 ) is a machine learning method aimed at determining efficiency and production frontiers. It is based on the Classification and Regression Analysis Trees (CART) by Breiman et al. (1984) . The main idea behind EAT is the building of a tree structure that starts with a root node (containing all the observations), develops through intermediate nodes, which split the dataset, and ends at leaves (terminal nodes), which present an estimate for the outputs of the production process.  In EAT, a criterion is chosen to recursively generate binary partitions of the observations. At a parent node to be split, this criterion coincides with the minimization of the sum of the mean squared error (MSE) of the two generated child nodes. Specifically,

ARTICLE IN PRESS
.., n , EAT aims to estimate the value of the outputs through the inputs while satisfying free disposability. To do that, for each parent node to be split, EAT selects an input variable j, j = 1 , ..., m , and a threshold s j ∈ S j , where S j is the set of possible thresholds for variable j (usually it is defined as the set of the observed values of input j in the sample), such that the sum of the MSE of the two child nodes is minimized. Let t be the parent node to be split and let t L and t R be the left and right child nodes of t, respectively. EAT chooses the best combination ( x j , s j ) by minimizing: where n is the sample size and y r ( t L ) and y r ( t R ) are the estimates of the output y r , r = 1 , ..., s, for the data in nodes t L and t R , respectively. In EAT, these estimates are calculated as is the set of leaf nodes of the tree generated after executing the k -th split that Pareto-dominates node t L . The Pareto dominance of nodes was a notion introduced by Esteve et al. (2020) for guaranteeing the satisfaction of free disposability during the growing process of the tree. By analogy, similar formulas may be derived for the right child node. Once EAT generates a split, the process is repeated by randomly selecting a new node to be split. The stopping rule of the growing process of the tree is that n (t) ≤ 5 = n min for all leaf nodes, where n (t) is the sample size of node t. The set of leaf (terminal) nodes of a tree T EAT (ℵ ) , determined by the EAT algorithm from the dataset ℵ , is usually denoted as ˜ T EAT (ℵ ) .
By the splitting process, each terminal node t is also associated with a region of the input space: where a t j and b t j are corners of this region as illustrated in Fig. 2 . Exploiting this definition, we have that the estimation of the output vector corresponding to the input vector x , non-necessarily However, the process described above, linked to EAT, can also suffer from overfitting, as FDH. The tree grown can be too large and the estimates can be only useful for descriptive purposes. To overcome this problem, Esteve et al. (2020) proposed to prune the deep tree by using cross-validation, as in Breiman et al. (1984). The pruning process avoids the problem posed by overfitting. Let T EAT * (ℵ ) be the 'optimal' tree obtained through the algorithm and let d T EAT * (ℵ ) (x ) be the corresponding multi-dimensional estimator. In this way, we define the technology induced by d T EAT * (ℵ ) (x ) as: Under Efficiency Analysis Trees, the score φ( x k , y k ) may be determined by plugging ˆ T EAT * (ℵ ) into (2) in place of . In particular, Esteve et al. (2020) prove that φ( x k , y k ) should be determined by the following optimization model.
From a data visualization perspective, EAT is always able to graphically represent the output estimator by a tree structure, even under problems with a high number of inputs and outputs. Additionally, in low dimensions, it is possible to draw the estimator through a step function, like that shown in Fig. 3 . This characteristic shape makes the EAT estimator comparable with the traditional FDH estimator of production frontiers.
Comparing Figs. 1 and 3 , we can also compare the estimators generated by FDH and EAT. Note how, although both techniques yield step functions as estimators of the production frontier,   the EAT technique does not satisfy minimal extrapolation. Moreover, the computational simulations carried out in Esteve et al. (2020) showed that EAT outperforms FDH as estimator of the true underlying production frontier.

Random Forest (RF)
Random Forest (RF) is an ensemble learning method that works by building a multitude of decision trees at training time and aggregating the information of the individual trees in a final estimation value (the mean in the regression problem) ( Breiman, 2001 ). The training algorithm for RF applies a double technique of randomization: on the data (through bootstrapping) and on the selection of predictors. Given a learning sample ℵ of size n , RF repeatedly selects random samples of size n with replacement of the set ℵ . Then, the method fits trees to these samples but, to do that, it resorts to a modified tree learning algorithm that chooses, at each candidate split in the learning process, a random subset of the predictors. The reason for doing this is to try to decorrelate the individual trees that can be estimated from ordinary bootstrap samples: if a certain subset of predictors is very strong in predicting the response variable, these same predictors will be selected in many of the individual trees analyzed, causing them to become correlated. Aggregating (decorrelated) randomized models decreases the generalization error by reducing the variance. Mixing information from randomized models works better than fitting a single non-randomized model ( Berk, 2016 ).
The typical steps that must be carried out in Random Forest are shown in Algorithm 1 (see Kuhn & Johnson, 2013 ). First, the researcher chooses the number of trees to be fitted (the p parameter). In practice, it is usually enough to consider 500 or 1000. Second, p bootstrap samples must be generated; each one denoted as ℵ q , q = 1 , ..., p. For each bootstrap sample ℵ q , a tree T CART ( ℵ q ) is fitted by the CART algorithm. However, in each split, the algorithm no longer considers all the possible predictors and their thresholds. Instead, mtry predictors are selected each time at random and these become the variables considered for splitting the corresponding parent node into two child nodes. Although Random Forest is computationally intensive in nature, because it fits many trees, it is not necessary to prune each tree. The pruning process by cross-validation is substituted by the double technique of randomization, on the data and on the selection of predictors, and the final aggregation of the predictions of all the trees. In general, larger trees may lead to less bias in the prediction, while aggregating decorrelated randomized decision tree models through ensembles reduces the variance of the predictor.
Once we have fitted all the trees by Algorithm 1, the prediction of the value of the response variable from a vector of values of the predictors -non-necessarily an observed vector of values -is determined by averaging the individual prediction corresponding to each tree.
An attractive feature of ensemble methods, like Random Forest, which build models on bootstrap samples, is the use of the left-out samples ℵ\ ℵ q for estimating the generalization error in statistical learning. In particular, RF usually exploits the notion of Out-Of-Bag (OOB). The OOB estimate associated with ( x i , y i ) is calculated by assessing the prediction of the approach only using the individual

ARTICLE IN PRESS
From this definition, the generalization error is defined as the average of the OOB estimates calculated over all the observations in the learning sample ℵ : where ∈ ℵ q } and | . | denotes the cardinal of a set. K i (ℵ ) represents the set of indices among the p fitted trees, such that the corresponding subsamples do not include ( x i , y i ) .
The generalization error is useful, for example, for determining a measure of variable importance, which can be used for creating a ranking of predictors x 1 , ..., x m . To calculate the predictor importance of variable x j , one can proceed as follows. Generate a new database, ℵ j , identical to the original one ℵ , where specifically the values of variable x j were randomly permuted. Apply the Random Forest technique on the new 'virtual' learning sample ℵ j . Determine the value of the generalization error linked to this last Random Forest: er r RF + CART ( ℵ j ) . Finally, it is possible to calculate how much percentage increases the generalization error of the model when variable x j is shuffled as % In ) .

A random forest of Efficiency Analysis Trees to determine super-efficiency
In this section, we will extend the approach by Esteve et al. (2020) to the context of the use of ensembles to provide superefficiency scores in the field of Machine Learning. In particular, we introduce how to adapt the standard Random Forest (RF) tech-nique ( Breiman, 2001 ) for measuring super-efficiency in the context of production theory, satisfying certain standard postulates, such as free disposability. Instead of assembling trees fitted by CART (Breiman et al., 1984), our approach will use trees fitted by EAT ( Esteve et al., 2020 ). By the nature of RF, our proposal will be robust to data and input variable resampling.
Robustness has been a relevant topic in the literature on production theory. Regarding robustness on the observed data, Simar & Wilson (1998) and Simar & Wilson (20 0 0a , 20 0 0b ) were the first to adapt bootstrapping to DEA and FDH frameworks. In the case of these authors, the main goal was to approximate the sample distribution of efficiency scores. In our case, we will resort to a technique called Bagging in statistical learning, which stands for Bootstrap Aggregation ( Berk, 2016 ;LeBlanc & Tibshirani, 1996 ;Mojirsheibani, 1997Mojirsheibani, , 1999. The algorithm behind Bagging works as follows. The algorithm uses bootstraps samples and, for each bootstrap sample, predictors are related to response variables through a certain statistical approach, for example, by fitting Decision Trees. The procedure linked to Bagging finishes when the predictions of all the individual samples are collected/aggregated for providing a final prediction of the response variables from a vector of the predictors. In regression problems, where the response variables are continuous in nature, the individual predictions are usually aggregated by applying the arithmetic average. Averaging over groups of estimations can augment their stability. The big difference between Random Forest, which also exploits Bagging, and the 'pure' Bagging is that the former also utilizes resampling on the predictors, while the latter does not. In other words, the main difference between these two techniques is the choice of predictor subset size at each node to be split. Indeed, if Random Forest is built using all the available predictors, then it amounts simply to Bagging. As for the importance of robustness on the input and output variables in non-parametric efficiency measurement, since the beginning of DEA and FDH, researchers have been aware that the selection of the inputs and outputs to be considered in the efficiency JID: EOR [m5G;10:49 ] analysis is one of the crucial topics of model specification. In practice, the prior expertise of researchers may lead to the selection of some inputs and outputs considered as essential to represent the underlying technology. However, there can be other variables whose inclusion in the model the analyst is not always sure about ( Pastor et al., 2002 ). This situation has been addressed in different ways in the literature. On the one hand, some approaches focus on proposing a mechanism for selecting the suitable inputs and outputs from a set of variables. The main idea is to balance the experience of the researchers with information provided by the observations (see, for example, Banker, 1993Banker, , 1996Pastor et al., 2002 ). On the other hand, a different method is based on determining efficiency scores which are robust against the selection of the variables. In this line, we can refer to a recent paper by Landete et al. (2017) , where the authors systematically consider all the possible scenarios associated with all of the specifications of inputs and outputs that could be defined from the original set of inputs and outputs. The inclusion of an input/output in the set of selected variables is modeled through the probability of that variable being considered in the DEA model. In Landete et al. (2017) , the robust efficiency score for a given unit is then defined as the expected value of that random variable. The consideration of all combinations of inputs/outputs gives rise to an exponential number of problems that must be solved.

ARTICLE IN PRESS
As far as we are aware, this paper is the first one to focus on the introduction of a methodology for determining super-efficiency that is robust against data resampling and, at the same time, against the specification of input variables. In our approach, this double robustness will be achieved by adapting Random Forest. In contrast, Bagging shares the same set of predictors for fitting each individual tree from each bootstrap subsample, which could lead to the averaging process linked to Bagging not being as effective as desired. Geman et al. (1992) were the first to show how the (expected) generalization error of a learning algorithm can be decomposed into pure noise, bias and variance of the predictor. As Breiman (2001) pointed out, a sensible approach for decreasing the generalization error consists in driving down the prediction variance, provided the respective bias may be fixed or, at least, not be augmented excessively. In particular, ensemble methods, as Random Forest, is a way to do just that. The idea behind RF is to add randomization into the learning mechanism to produce different models from the original sample ℵ and then aggregating the estimations of the models to provide a final prediction associated with the ensemble ( Louppe, 2014 ;Louppe & Geurts, 2012 ).
As Louppe (2014) shows, the variance of the predictor, when the number of elements in the ensemble gets arbitrarily large, equals the product of the correlation between the predictions of any two of the randomized elements of the ensemble and the prediction variance of any randomized element, under the hypothesis of independence and identical distribution of the random variables that control the randomness of the learning algorithm. Moreover, the noise and the bias terms in the generalization error decomposition remain constant and identical to the noise and bias of any of the individual models. Therefore, the key in Random Forest is to resort to random perturbations to decorrelate the predictions of the individual models as far as possible since, in this way, the correlation between the predictions of two any randomized elements of the ensemble will take a negative value, which means that the variance of an ensemble will be strictly less than the variance of any individual model. See also Amit et al. (1997) , Cutler & Zhao (2001) , Dietterich & Kong (1995) , Geurts et al. (2006) , Ho (1998) , Rodriguez et al. (2006) , Breiman (1994), and Kwok & Carter (1990) , where different ensembles and decision trees have been proposed.
In traditional Random Forest, which is a methodology that belongs to the category of supervised Machine Learning techniques, researchers need to specify which is the dependent variable (target) and which are the independent variables. In our production context, the dependent variable will be the output produced while the independent variables will be the inputs consumed in the production process.
Next, we introduce the algorithm associated with the adaptation of the Random Forest technique ( Breiman, 2001 ) to the world of super-efficiency assessment, called here RF + EAT.
At the first stage, select the number of trees that will make up the forest: the parameter p. Then, generate p (bootstrap) random samples from the original data sample with replacement. Apply the EAT algorithm to each subsample, without pruning (but using the stopping rule n (t) ≤ n min ) . In this algorithm, n min is treated as a parameter that could be tuned, as happens with p (see Flowchart 1 ).
The following flowchart illustrates the new algorithm.
During the execution of the EAT algorithm, randomly select a subset of input variables from the original set every time the subroutine for splitting is invoked. To do that, use one of the following five rules of thumb: (1) Breiman's Rule: Rule 1 was suggested by Breiman (2001) and is the usual value utilized in standard Random Forest for the regression problem. The number of randomly selected inputs must be a third of the number of total inputs. As usual, the value obtained must be rounded down, and we apply the same for all the remaining rules. Note also that this formula does not depend on n (t) , the sample size of the (parent) node that will be split. In contrast, we suggest four other different rules that do depend on n (t) and all of them come from the literature on the establishment of a good relationship between the sample size and the number of variables in DEA. Indeed, the literature indicates some empirical rules regarding the number of DMUs versus the number of inputs and outputs. The Rule DEA1 comes from Homburg (2001) and Golany & Roll (1989) , who established that the number of DMUs must be at least twice the number of variables. The Rule DEA2 was derived from the papers by Friedman & Sinuany-Stern (1998) , Nunamaker (1985 , Raab & Lichty (2002) and Banker et al. (1989) , who suggested that the number of observations must be at least three times the number of variables. On the other hand, Dyson et al. (2001) pointed out that the number of units should be at least twice the product of the number of inputs and the number of outputs, which generated our Rule DEA3. Finally, another empirical rule of thumb is that by Cooper et al. (2007) , who stated n ≥ max { m · s, 3 · ( m + s ) } , which yielded our Rule DEA4. Obviously, Rules 2-5 are new, and have not been used previously in the literature on Random Forest in statistical learning.
In a generic way, each rule comes from a relationship, previously published in the DEA literature, established between the number of units to be evaluated and the number of inputs and outputs of the problem for an adequate analysis of the data. For example, in the case of rule DEA1, the following relationship is used as a starting point Roll, 1989 andHomburg, 2001): n ≥ 2 · ( m + s ) . In words, the number of observations should be at least twice the number of inputs and outputs. In this way, at node t , where there is a number of units to evaluate denoted by n(t), we can make use of the previous relation where n is substituted by n(t) and m is substituted by mtry ; and solve mtry , getting mtry ≤ n (t) 2 − s . Finally, we take the equality of both expressions to define the corresponding rule. Once we have executed Algorithm 2 , we get p fitted trees to be aggregated in order to obtain a value for the output given an input vector x ∈ R m + . So, we have T EAT ( ℵ 1 ) , ..., T EAT ( ℵ p ) tree structures derived from the application of the EAT algorithm on the p bootstrap subsamples ℵ 1 , ..., ℵ p . Given x ∈ R m + , the output level associ-

ARTICLE IN PRESS
As usual in Random Forest, a final value for the output given a vector x ∈ R m + is determined by averaging the individual trees: By (9), we can define an input-output set derived from the Random Forest technique, which will be used as reference set to measure super-efficiency, as follows: Invoking certain results in Esteve et al. (2020) , it is not hard to prove that ˆ RF + EAT satisfies the classical property of free disposability in production theory.

Proposition 1. ˆ RF + EAT satisfies free disposability .
Proof. Following Definition 1 , let ( x , y ) ∈ R m + s + and let ( x , y ) ∈ ˆ RF + EAT such that x ≥ x and y ≤ y. Free disposability is satisfied if ( x , y ) ∈ ˆ RF + EAT . We know that the technology induced by the application of the EAT algorithm on each bootstrap sub- Esteve et al., 2020 ). In this way, by (9), we have that However, the adaptation of the standard Random Forest technique violates the property of minimal extrapolation, which is satisfied by the standard FDH and is responsible for its classical problem of overfitting. Additionally, by nature, EAT generates an estimator of the production frontier that has a staggered shape and, consequently, the production possibility set induced from EAT does not meet convexity. The Random Forest adaptation that we introduced in this paper inherits the same feature.
Regarding the determination of super-efficiency, notice that the key of Random Forest is that individual trees are fit to bootstrapped subsets of the observations. This means that, on average, each fitted tree makes use of around two-thirds of the total units (see, for example, James et al., 2013 ), the remaining onethird of the observations not being used. This random exclusion process of units has consequences on the location of some obser- vations with respect to the set ˆ RF + EAT . Indeed, some units will be located out of this input-output set and the measurement of the 'distance' from them to ˆ RF + EAT will give rise to a super-efficiency score. In particular, following the output-oriented radial model, the super-efficiency score under the new approach may be defined

ARTICLE IN PRESS
The following result states how to calculate the value of φ RF + EAT ( x k , y k ) .
Proposition 2. Let x k ∈ R m + and y k ∈ R s } is a polytope and (b) if a linear function is bounded from above on a polytope, then it achieves its supremum on the polytope (see Mangasarian, 1994 , p. 130)

Proposition 3. If DM U k dominates DM U l , then
Proof. If x k ≤ x l , then by Theorem 1 in Esteve et al. (2020) , we have that d T As for the typical infeasibility problems that occur when the standard super-efficiency model is applied (see, for example, Chen, 2005 , andLee et al., 2011 ), one of the advantages of the new approach is that it avoids infeasibility results. By Proposition 2 , if , for all r = 1 , ..., s (by (9)). And, given that d r T EAT ( ℵ q ) (x ) is always well-defined from data and the EAT algorithm ( Esteve et al., 2020 ), we have that φ RF + EAT ( x k , y k ) always takes values in the interval [ 0 , + ∞ ) . We will illustrate this interesting property by means of several numerical examples throughout the paper.
Next, we show the results corresponding to a computational experience with the aim of comparing the standard super-efficiency model under FDH and RF + EAT. Accordingly, we resort to data simulation. In particular, a mono-output Cobb-Douglas function is used, based on a different number of inputs (6, 9, 12 and 15) and different sam ple sizes (50, 75 and 100 units). A uniform distribution between one and ten was utilized for the simulation of the input data, while a truncated normal distribution with a zero mean and a standard deviation equal to 0.4 was used for the simulation of the technical inefficiency term. We perform 100 trials computational instances for each combination of number of inputs and sample size. We use the standard deviation of the superefficiency scores for assessing the results since it could give an idea of the degree of discrimination between units that each method possesses. Additionally, the two parameters to be tuned in the RF + EAT algorithm are n min and p. Regarding n min , we checked different values, always greater than or equal to 5. As for the parame- JID: EOR [m5G;10:49 ]  ter p, we tried with p = 500 and p = 10 0 0 trees, obtaining similar results. Table 1 shows the main results of the simulations. The number of observations and the number of inputs appear in the first and second columns of the table, respectively. The next three columns show the standard deviation of the scores determined by the FDH technique, the super-efficiency FDH model and the RF + EAT approach. Moreover, we have also calculated the percentage of units that present infeasibility problems, showing the corresponding figures in the last three columns of the table. Additionally, seeking simplicity, we exclusively show the results derived from the application of Breiman's rule for defining mtry . We also checked the other four rules, getting similar figures ( Fig. 5 ).

ARTICLE IN PRESS
From Table 1 , we identify that the standard super-efficiency FDH model presents a higher discriminatory power between units (a higher standard deviation) than the classical FDH model, as expected. At the same time, the new method (RF + EAT) outperforms the standard super-efficiency FDH model, presenting the highest standard deviation values in all the scenarios con-sidered. Additionally, given a certain number of observations, the new technique is not strongly affected by the increase in the number of inputs, showing similar values in standard deviation. This is something that does not happen in the case of the FDH and super-efficiency FDH models, where the discriminatory power decreases as the number of inputs increases. Moreover, the super-efficiency FDH model presents infeasibility problems, which are more intense as the number of inputs increases, regardless of the number of observations considered in the analysis. Fortunately, RF + EAT does not present this type of drawback.
However, a weakness of RF + EAT in comparison with the superefficiency FDH model is the computing time spent. The simulation experience was performed on a PC with a 1.8 GHz dual-core Intel Core i7 processor and 8GB of RAM. As for the computational time spent, for a numerical example with m = 9 and 100 observations, the super-efficiency FDH model consumed 1.29 s to obtain the final model, whereas the RF + EAT technique, with five hundred trees ( p = 500), used 118.34 s. Finally, we show the corresponding MSE (mean squared error) curves and contour graphs of some of the above simulations (see Fig. 4 ). The MSE curves allow seeing how to converge the error of the model with respect to the parameter p .

Curse of dimensionality
Any practitioner might believe that as the number of predictors used to fit a model increases, the accuracy of the fitted model will increase as well. However, this is not necessarily the case. In general terms, adding additional variables that are highly correlated with the target variable will improve the quality of the fitted model, by reducing the generalization error. However, including variables that are not truly associated with the response will lead to a deterioration in the fitted model, and an increased in the generalization error. The reason is that adding 'noise' variables increases the dimensionality of the problem, aggravating the risk of overfitting (since noise variables may appear as relevant in the model as a consequence of chance related to the structure of the values of the response variable and the predictors on the training data) without any potential benefit in terms of real reduction of the generalization error. Thus, handling many variables may lead to improved models if the predictors are actually relevant to the problem, but will lead to worse results otherwise ( James et al., 2013 ). In particular, under Data Envelopment Analysis or Free Disposal Hull, if the ratio of the sample size to the number of variables (inputs and outputs) is low or moderately low, the standard efficiency models may yield a lot of technically efficient DMUs. This fact is very usual when FDH is applied. This absence of discrimination is known in the literature as the "curse of dimensionality". Although the super-efficiency model may be a possible solution for this problem, in this section, we will show that RF + EAT could be seen as a better solution to overcome this type of weakness.
The curse of dimensionality has attracted the attention of many researchers over the last few years in non-parametric efficiency analysis since, in numerous empirical contributions, a lack of differentiation between the assessed units is observed in the results. It is usually caused by an excessive number of variables with respect to the total number of observations. In the literature, some authors have opted to apply Principal Component Analysis (PCA) in different ways to improve discriminatory power in DEA (see, for example, Adler & Berechman, 2001 ;Adler & Golany, 2002 ;Araujo et al., 2014 ;Ueda & Hoshiai, 1997 ). One possibility is reducing the number of inputs and outputs into factors by PCA. If most of the variance in the data may be attributed to the first few factors, then the original inputs and outputs in DEA may be replaced by these factors without much loss of information. Another possibility is using PCA as a means of weighting input and output variables and aggregating them. Other researchers prefer the use of techniques associated with the direct reduction of variables rather than the aggregation of information by PCA. Probably, the first one in this line was Pastor et al. (2002) , who focused on analyzing the marginal role of a given 'candidate' variable, with respect to the efficiency measured by means of a DEA model. These authors developed a specific statistical test that allows the significance of the candidate's observed efficiency contribution to be evaluated. Ultimately, this technique can give an appropriate insight to decide whether to incorporate or delete a variable from the corresponding DEA model. Other related contributions are Ruggiero (2005) and Nataraja and Johnson (2011) . More recently, Charles et al. (2019) have introduced a new approach for increasing the discriminatory power of DEA. See also Shen et al. (2016) . Moreover, some empirical guidelines with respect to the number of variables and the number of observations have been introduced in the literature and are nowadays well-known. For example, Golany and Roll (1989) , Nunamaker (1985) , Banker et al. (1989) , Sinuany-Stern (1998) , Homburg (2001), Dyson et al. (2001) and Raab & Lichty (2002) . Our rules-of-thumb 2-5 were defined from these last relationships between the sample size and the number of inputs and outputs.
In this section, we will illustrate the usefulness of the RF + EAT approach for dealing with the curse of dimensionality through a database taken from the literature (see Sarkis, 20 0 0 ). From the results (see Table 2 ), we can conclude that the discriminatory power of the new technique, RF + EAT, is clearly better than the discriminatory power of FDH. An extreme case can be seen in Table 2 , where FDH identifies all the units as technically efficient (score equals one). In contrast, in the case of the RF + EAT technique, up to twelve DMUs are identified as technically inefficient (a score greater than one), depending on the used rule for determining the parameter mtry in Algorithm 2. Additionally, the results obtained JID: EOR [m5G;10:49 ]  by applying the five rules of thumb for randomly selecting a subset of input variables show, in general, similar patterns. Furthermore, it is worth mentioning that, in the case of the super-efficiency FDH model, more than half of the observations present problems of indetermination of their score due to infeasibilities. This fact makes it impossible to establish a ranking of all the units in the data sample. In contrast, the RF + EAT technique identifies super-efficiency values for all the observations. Finally, we compare our method with respect to the approaches by Ruggiero (2005) and Nataraja & Johnson (2011) once these techniques have been adapted for working under FDH. In this numerical example, it seems that the new proposed method based on Random Forest has a higher discriminatory power between evaluated units compared to the other previously published techniques (see Table 2 , last columns). However, our conclusions are based on a single numerical example taken from the literature, so a more detailed and exhaustive analysis is necessary for the future. Overall, the conclusions obtained from the numerical results in Table 2 should be interpreted with caution since the results obtained with each of the methodologies are not directly comparable and, furthermore, our conclusions are based exclusively on a single empirical example.

Evaluating the importance of the input variables
In this section, we introduce an input importance measure as determined from the RF + EAT method. We first present how the adaptation of Random Forest to the context of production theory may be utilized to evaluate the relevance of input variables. We then illustrate the use of this importance measure through an empirical example.
A relevant mission in machine learning is the estimation of one or several response variables based upon a set of predictor variables. However, another main objective is to recognize which predictors are the most relevant in the study; mainly with the objective of understanding the underlying process that generated the data ( Louppe et al., 2013 ). In this respect, standard Random Forest is a technique that allows such kind of measure to be calculated. In our context, we will again adapt this measure for dealing with the gaging of the relevance of inputs in a production context. The measure will allow us to calculate a ranking of relevance of inputs.
In some respects, calculating a measure of the importance of inputs is linked to the determination of the statistical significance of these variables in the field of technical efficiency measurement based on parametric techniques, where certain probability distributions are considered and allow the calculation of p-values and confidence intervals. In the non-parametric world, as in machine learning, it is usual not to assume any particular probability distribution on the generation of the data, which consequently leads to the impossibility of determining the typical sampling error measure in Statistics. Instead, trying to mimic the parametric world, machine learning techniques may alternatively yield a ranking of the importance of the predictors (inputs in our production context) for estimating the response variables (the outputs).
In Classification and Regression Trees, a measure of the importance of predictor variables already exists. It is based upon the notion of surrogate splits for a certain variable. Surrogate splits were introduced to consider for hidden relationships between the response variable and the predictors. For instance, it can occur that in building a tree, some variable x j is never selected in any node as the splitting variable since it yields a slightly worse MSE in comparison with another predictor x j . However, if x j is removed and a new tree is grown, then x j can appear as the most relevant predictor for some nodes, yielding a tree structure no worse than the original tree regarding accuracy. The measure of importance introduced by Breiman et al. (1984) does take into account these situations through surrogate splits.
Next, we adapt the measure of importance usually utilized in standard Random Forest to the production framework, see Section 2.3 . Breiman (2001) introduced a way of evaluating the role of each predictor in Random Forest that is based on randomization rather than surrogate splits. In particular, it is grounded on the decrease in prediction accuracy. In the regression problem, the usual approach consists in utilizing the rise in MSE for the Out-Of-Bag data, which plays a key role in the assessment of the quality of RF. Double randomization in RF ensures that masking effects are withdrawn.
Following Breiman (2001) , we suggest assessing the importance of input x j by determining the Mean Increase Error of the ensemble of trees when the values of x j are randomly permuted. To do that, we will resort to the use of the OOB samples. In particular, to

ARTICLE IN PRESS
In (11), and, as before, K i ( ℵ j ) represents the set of indices among the p fitted trees by EAT such that the corresponding subsamples do not include ( x i , y i ) . Additionally, in a similar way, one needs to determine the generalization error of the original random forest er r RF + EAT (ℵ ) . Finally, it is possible to calculate how much percentage increases the generalization error of the model when input x j is randomly permuted as: Next, we are going to illustrate how this approach works through an empirical example. In particular, we base our illustration on a database of 102 warehouses operating in the Benelux area in 2017 (see Kaps & de Koster, 2019 ). Following Balk, de Koster et al. (2021) , we use three inputs and four outputs. Inputs are: warehouse size in m 2 (floor space); number of full-time equivalent employees (FTEs); and number of stock keeping units (SKUs). Outputs are: number of order lines (order lines shipped per day); error-free order line percentage (error free%); order flexibility (per day); and number of special processes (handled per day). See Aparicio & Zofío (2021) for a description of the data sample.
Regarding the importance of variables when the determination of super-efficiency is the concern, the RF + EAT approach is able to yield a ranking as discussed above. Fig. 6 graphically illustrates the importance value of each input, calculated from expression (12). Following the results; floor space and the number of stock keeping units are very similar regarding their level of importance, the number of full-time equivalent employees being the most important input variable. In particular, the generalization error increases more than 80% when this last input is randomly permuted, which, in some respects, is equivalent to removing it from the model.

Empirical illustration
In this section, we illustrate the results of the new approach by resorting to a data set from the banking sector. Throughout the analysis, we check whether the efficiency scores obtained through FDH and super-efficiency FDH differ from the results obtained through the application of the random forest-based model. With this objective, we utilize a database of 31 Taiwanese banks in the year 2010 ( Juo et al., 2015 ). The production technology corresponds to the so-called intermediation approach by which financial institutions, through labor and capital, collect deposits from savers to produce loans and other earning assets for borrowers. The production technology is represented by three inputs and two outputs. Inputs include FINANCIAL FUNDS ( x 1 ), LABOR ( x 2 ), and CAPITAL ( x 3 ). The outputs are INVESTMENTS ( y 1 ) and LOANS ( y 2 ) . 3 Although it is common to use subsamples to validate the estimation results of the model, in the context of DEA and FDH, it is common to work with databases where the number of records is relatively low compared to the number of variables, as is the case in this empirical example. That is why it is not feasible in practice to make use of validation subsamples in the production context studied here, except in the case where the number of DMUs (database records) is high compared to the number of variables in the problem, which is not very common in DEA and FDH. Table 3 shows the results obtained when applying standard FDH and super-efficiency FDH methods and the new methodology presented in this article. For simplicity, we exclusively use the first of the rules from DEA to select the number of inputs that must be 3 We thank Juo et al. (2015) for sharing with us the data. JID: EOR [m5G;10:49 ] considered when performing the split procedure in each node of the trees that make up the Random Forest. As can be seen, the standard FDH technique shows how technically efficient 30 of the 31 banks considered in the analysis are. Only bank number 16 is identified as being technically inefficient, indicating that the two outputs could be increased at the same time by approximately 27%. In contrast, the new Random Forest-based methodology identifies 20 of the 31 banks as technically inefficient. The rest of the banks are identified as being super-efficient and, in particular, the efficiency score determined by the new method generates a ranking of units. Thus, bank 26 is identified as the most super-efficient unit among those analyzed under the RF + EAT technique (the same happens in the case of the standard super-efficiency FDH model). This type of ranking could be useful from a benchmarking point of view, since it allows identifying which units are the most interesting to study in detail and from which to learn their internal functioning and the management methods put into practice in the production process. Finally, a result that seems completely coherent to us is that bank number 16, which was the only one identified as technically inefficient by the standard methodology (FDH), appears with the highest score (1.577) of all those obtained making use of the new methodology. Notice that, in this example, the standard super-efficiency FDH model suffers again from the infeasibility problem (see DMU #1).

Conclusions and future works
Random Forest (RF) is an effective technique of machine learning for estimating one or a set of response variables from a set of predictors. Injecting randomness in the learning algorithm, regarding data and predictors, makes accurate estimators ( Breiman, 2001 ). In comparison with CART (Classification and Regression Trees), where an individual tree is grown, RF aggregates hundreds of randomized trees through ensembles reducing the generalization error by decreasing the variance term, keeping the bias more or less constant. In particular, RF overcomes certain drawbacks of individual decision trees: (1) Individual trees generally do not present a high level of predictive accuracy and (2) trees may be very non-robust, i.e., a small change in the data can cause a large change in the final structure of the fitted tree. RF demonstrates that combining randomized trees really achieves a better performance than individual trees ( James et al., 2013 ).
In the context of production frontier estimation, Esteve et al. (2020) have recently introduced Efficiency Analysis Trees (EAT), an adaptation of CART to estimate upper enveloping surfaces of the data cloud that, at the same time, yields production possibility sets that meet the free disposability postulate in microeconomics. This new approach competes against the standard Free Disposal Hull (FDH) given the natural shape of the estimator, which has a staggered form as FDH. However, EAT does not suffer the problem of overfitting exhibited by FDH but can inherit the weaknesses typical of decision trees; basically, having the possibility of improving its generalization error and being non-robust.
In this paper, we adapted the Random Forest technique to the production frontier estimation field by adjusting the standard RF algorithm, based on CART, for dealing with EAT. To do that, certain new rules for selecting the parameter associated with the number of input variables to be considered in each split were introduced, based upon well-known rules of thumb from the Data Envelopment Analysis literature. Our analysis showed that this adaptation of RF (with acronym RF + EAT) could be useful to determine the super-efficiency score of a set of production units and, accordingly, rank the observations. In the case of the RF + EAT, and as happens with standard super-efficiency models, the output-oriented radial measure can take values less than one for some units. In our approach, this result is because the frontier associated with the input-output set derived from the new method could not envelop all the observations from above. The process of bagging, which is part of RF + EAT, implies that many individual models (trees) are fitted by using different data samples, randomly selected from the original dataset. Consequently, each DMU only appears in a certain number of these random samples and the individual frontiers could not always envelop all the units. Hence, the final process of aggregation, which is applied in RF + EAT, could yield a score below one, especially in the case of units that show a pattern related to good performance.
Moreover, in this paper, we compared the standard superefficiency FDH model with the new approach, showing that the random forest-based technique has a greater discriminatory power and avoids infeasibilities. These facts give validity to the new approach (RF + EAT) for determining super-efficiency of a group of firms in a production context with multiple inputs and multiple outputs. Additionally, we showed, through a numerical instance taken from the literature, that RF + EAT is a possible solution of the curse of dimensionality blamed on standard FDH.
Finally, we point out some possible interesting avenues for further research. The first research line could be the extension of the RF + EAT approach for dealing with alternative super-efficiency measures as, for instance, the directional distance function (DDF) (see Chambers et al., 1998 ;Ray, 2008 ). Another line of research, from a more computational perspective, would be tuning the parameters that control the RF + EAT algorithm. In particular, we could check different settings in the context of the estimation of production frontiers (node size, number of trees, number of inputs sampled). A third research line could be the application of the RF + EAT technique to more real databases in different empirical frameworks. Another interesting line of research is how to adapt the new method to be able to incorporate different types of returns to scale. Nowadays, it is not clear how to adapt the estimation of the outputs in each node in the growing process of the trees in order to consider different types of return to scale. Additionally, in Machine Learning, data mining, filtering, pre-processing and clustering algorithms, increasing the accuracy of the model and eliminating any unrelated attributes and data are usual topics of interest. An analysis of these aspects in our context of production could also be a relevant future research line.