Analyzing the accuracy of variable returns to scale data envelopment analysis models

The data envelopment analysis (DEA) model is extensively used to estimate eﬃciency, but no study has determined the DEA model that delivers the most precise estimates. To address this issue, we advance the Monte Carlo simulation-based data generation process proposed by Kohl and Brunner (2020). The developed process generates an artiﬁcial dataset using the Translog production function (instead of the commonly used Cobb Douglas) to construct well-behaved scenarios under variable returns to scale (VRS). Using different VRS DEA models


Introduction
In order to save resources and to detect inefficient performers, efficiency evaluations are the central component of decisionmaking management.There are two main classes of efficiency analysis methods in the literature: parametric and non-parametric.Parametric approaches usually use the econometric ordinary least squares method, which shifts regression towards more efficient units to estimate the efficient frontier.This approach is primarily hampered by the assumption about the form of the production function. 1Contrary to this, non-parametric methods measure efficiency as the distance to an empirical frontier function whose shape is determined by the most efficient decision-making units (DMUs) of the observed dataset.This approach is, without a doubt, The lack of robustness in results and ambiguity regarding the precision of DEA models' estimates are deemed to be the major quality-related issues.Within the DEA literature, the accuracy and quality analysis of different DEA models have become an attractive area of research over the last two decades.To evaluate the quality of DEA estimates, the first challenge is the absence of true efficiency values.DEA estimates in real applications therefore cannot be investigated without these values.Researchers have applied Monte Carlo simulations to create artificial datasets based on certain assumptions and regimes ( Cordero et al., 2015 ) to address this issue.A random distribution function cannot be directly used to derive the scale effect values to reflect the VRS property, so generating well-behaved data is a complicated task.In the following, we summarize the studies conducted on the assessment of the quality of DEA models using Monte Carlo simulations over the last two decades in the interest of brevity.We also discuss the main characteristics of these studies, including the production function used, the number of scenarios, the number of replications, inputs, and outputs.Cobb-Douglas (CD) production functions were most employed by previous studies in the Data Generation Process (DGP) ( Holland & Lee, 2002 ;López, Ho & Ruiz-Torres, 2016 ;Resti, 20 0 0 ;Ruggiero, 2005 ;Simar & Wilson, 2002 ;van Biesebroeck, 2007 ).The reason for this can be attributed to the complexities of the alternatives imposing microeconomic regularity conditions like monotonicity and convexity.The limitations of CD for imposing the input substitution elasticity of one and fixed-scale economies have been pointed out by several researchers such as Siciliani (2006) and Perelman and Santín (2009) .The Translog 2 production function has emerged as a generalization of the CD that allows the generation of more testable production data.
Most studies only use one adjustment to account for the number of inputs ( López et al., 2016 ;Ruggiero, 2005 ).Generally, scenario generation has not been given sufficient attention.Most studies only vary three or fewer characteristics of the employed DGP.Next, previous studies have mainly focused on the properties of the basic DEA models, i.e., CCR and BCC, and comparisons between them and (in some cases) parametric methods ( Santín & Sicilia, 2017 ).However, model evaluations other than the basic ones are rather scarce.So far, only about one-third of previous studies have considered alternative DEA models, and none have utilized more than one model ( Kohl & Brunner, 2020 ).Another concern is the robustness of the results obtained in previous studies.Since the DEA estimations rely on randomly generated data, it is unquestionable that each scenario can be replicated.In this context, Krüger (2012) criticizes the low replication rate of many studies, which changes from 5 to 10 0 0. To our knowledge, the study by Kohl and Brunner (2020) represents the only attempt to date to assess the quality of DEA models by developing meaningful production scenarios using Translog production functions in a CRS setting.The authors develop a sophisticated DGP allowing them to hypothesize some general statements regarding parameters that affect the quality of DEA models through defining some performance indicators.Their results show that the Assurance Region (AR) and Slacks Based Measurement (SBM) models outperform the CCR model under the CRS setting.Kohl and Brunner (2020) primarily discuss the CRS, even though the BCC model remains widely used in most DEA applications ( Kaffash, Azizi, Huang & Zhu, 2020 ;Kohl et al., 2019 ;Mahmoudi, Emrouznejad, Shetab-Boushehri & Hejazi, 2020 ).
Last but not least, the literature on DEA focuses mostly on operations research, where the DEA is viewed as a non-econometric or non-statistical approach ( Banker, Natarajan & Zhang, 2019 ;Simar & Wilson, 2015 ).Thus, a DEA model constructed for assessment needs to move beyond simply explaining and predicting data in 2 Translog stands for transcendental logarithmic.
the most effective way possible.In the same way that statistical tests validate a statistical model developed to reproduce accurately the underlying data generation process, basic properties of production economics such as economies of scale and convexity, free disposability, the engineering logic of the production structure, the importance of identified peers to industry participants, etc., serve to validate the model ( Banker & Natarajan, 2011 ;Bogetoft & Otto, 2011b ).By identifying conditions under which DEA estimators are statistically consistent and likelihood-maximizing, Banker (1993) provided a formal statistical basis for DEA.Accordingly, DEA estimates are capable of providing interesting insights without heavily relying on statistical testing.However, most of the literature ignores the statistical properties of the estimators and lacks consistent statistical tests to compare the efficiencies between two samples.These researchers compare their improvements to the basic model and highlight properties such as a shift in the average efficiency scores or a better discrimination power.Even if a certain problem can be solved through development, there is no guarantee that the overall results (from a quality perspective, for example) will also be improved.The main flaw here is comparing differences in DEA estimations through the mean value of the efficiency scores rather than the distribution of them.However, in cases where the distribution of efficiency scores is skewed, the mean value becomes an ineffective measure of central tendency ( Weisberg, 1992 ). Several studies have been performed on comparing differences in DEA estimation3 distributions for two groups of DMUs through developing statistical tests including parametric and non-parametric ones.For example, Cummins, Weiss and Zi (1999) use a regression-type parametric test with a dummy variable indicating the groups, regressing the efficiency scores on the dummy variable.However, many researchers (e.g., Golany and Storbeck (1999) and Lee, Park and Choi (2009) ) believe that nonparametric tests such as the Mann-Whitney and Kruskal-Wallis tests are more appropriate since they do not make assumptions on the distribution of efficiency scores.One pioneering study in this direction has been conducted by Banker, Zheng and Natarajan (2010) .They develop two sets of parametric and three nonparametric tests and compare them against the F-tests introduced by Banker (1993) .They show that their developed tests outperform the F-tests in Banker (1993) when noise plays an important role in the data generating process.However, the F-tests in Banker (1993) remain effective if efficiency dominates noise.In our study, we integrate the idea of comparing two groups of DMUs with the performance indicators.
The purpose of this study is to address these issues by providing a method for evaluating the accuracy of DEA models under the VRS assumption.A sophisticated DGP must be designed to create well-behaved data for the DMUs to study the quality of DEA models.In the next step, we generate artificial data so that the true efficiency of each DMU can be compared with the estimations obtained from the different DEA models.Through this, we are able to evaluate the DEA models' quality.We then consider a variety of scenarios to arrive at generally sound conclusions.With these characteristics, it is possible to generate meaningful data through Monte Carlo Simulations.We use two aggregated benchmark values: benchmark value (B-Value) and benchmark rank (B-Rank).Combined with multiple performance indicators, these benchmark values cover all relevant properties of an efficiency estimator, such as identifying efficient and inefficient units and ranking the efficiency score of each unit in a set of DMUs.The B-Value and B-Rank provide additional insight into the performance of the procedure by using SBM, AR, the basic CCR DEA, and uniformly distributed random numbers (Rand).Based on our findings, we conclude that the environment of a DEA application influences its results significantly.We do this by casting doubt on the reliability of DEA results and analyzing the efficiency assessment process of the DEA model.We analyze the VRS settings as the most prevalent setting in the literature for DEA applications and try to find out whether the predominant BCC position is justified.Our study addresses the statistical properties of DEAs' estimators by applying a consistent statistical test to compare the estimations calculated based on different DEA models with the true efficiencies.The details of our analysis will be presented in subsequent sections.As a summary, this paper contributes the following to the pertinent literature: I.The main question this study seeks to answer is whether BCC's dominant position was indeed vindicated.(1999) , such as identifying the most efficient DMUs and ordering their efficiency scores within a sample.We acknowledge the need for a statistical foundation for DEA as pointed out by Banker (1993) , Banker et al. (2010) , and Simar and Wilson (2015) , and test the estimations of DEA models with their actual efficiencies by running statistical tests.III.In order to improve the general validity of our results, we advance the scenario variation significantly.In our study, each generated scenario represents an arrangement of varying values for different characteristics of the DGP (e.g., number of inputs, number of DMUs, the importance of input).
With 7776 scenarios generated based on the VRS setting, we attain the highest level of validity in the quality assessment of VRS DEA models in comparison to the literature.To determine whether the environment of the DEA study influences the accuracy of results, we also consider the coverage of different characteristics.By utilizing ten different characteristics with varying levels, we provide another significant contribution to the literature.IV.The general form of Translogs has the consequence of not being monotonic or globally convex like CDs.For generating well-behaved data under the VRS setting, we need to impose the necessary curvature requirements on a Translog, which is a challenging problem ( Greene, 2008 ).Then we propose a mathematical model that directly enforces monotonicity and curvature requirements and generates valid scenarios with VRS properties.Using our methodology, one can modify the input substitution in order to ensure a more sensible DGP.
According to the literature, a handful of studies, like Krüger (2012) , consider different input substitutions using Constant Ratio of Elasticity of Substitution Homothetic or Constant Elasticity of Substitution production functions.Through several adjustable parameters, the Translog production function offers greater control over setting input substitutions.Setting these parameters to generate valid scenarios (or wellbehaved data), however, is a complicated process.As a result, only a few studies use it in a limited form to generate the data.For example, Cordero et al. (2015) , who focus on generating data under decreasing returns to scale (DRS), or Perelman and Santín (2009) , who define the parameters arbitrarily.We advance the approach used by Kohl and Brun-ner (2020) for the CRS setting so that realistic scenarios under the VRS regime can be generated systematically.V.By decomposing the input substitution into two terms: substitutability and distribution of substitutions, we are able to guarantee the generation of realistic and well-behaved DMUs under the VRS, along with a variety of scenarios.We find a high correlation between the number of replications for each scenario and the number of DMUs from the perspective of the robustness of the results.A scenario with 450 DMUs may need 50 replications while a small size scenario (e.g., 50 DMUs) might need over 200 replications.We, therefore, define an elastic stopping condition for replications of each scenario based on the moving standard deviation (StD) of the benchmark value.Finally, we examine the impact of the characteristics considered in the generation of the distinctive scenarios (e.g., sample size) on the quality of estimations calculated using the different DEA models.
The rest of this study is structured as follows.Section 2 describes in detail the steps of developing a DGP, statistical tests, performance indicators, and study design.In Section 3 , the results of comparisons are presented and discussed in detail.Finally, the paper is concluded in Section 4.

Methodology
We describe all steps within the proposed framework thoroughly in the following subsections, in order to compare and analyze the accuracy of DEA models within a VRS context.Fig. 1 depicts the eight steps of the DGP for every DMU.

Performance indicators
Following the purpose of evaluation and comparison of different DEA models, we utilize five performance indicators defined by Kohl and Brunner (2020) (see Appendix A ) based on Pedraja-Chaparro et al. (1999) for Monte Carlo DEA analyses.The DEA's estimates are the core of any judgment on the quality.Therefore, for defining the performance indicators, we address the four main purposes of a DEA containing recognizing inefficient DMUs, ranking the efficiency of DMUs, assessing efficiencies and rooms for improvement, and investigating the overall efficiency of a company/organization.

Hypothesis tests for comparing efficiency
We compare the efficiency distribution of two groups of DMUs using DEA-based hypothesis tests in addition to the performance indicators.Constructing statistical tests allows us to evaluate the null hypothesis of no difference in the distributions of true efficiency ( θ) and estimated efficiency ( ˆ θ) obtained from the DEA models.The null hypothesis of no difference in efficiency distributions of true efficiency can be tested using the procedure proposed by Banker (1993) .The first step of this method is to determine whether the efficiency scores are normally or exponentially distributed.The true efficiency in our DGP is normally distributed.Now suppose both θ and ˆ θ are distributed as normal with param- eters ρ 1 and ρ 2 , respectively.Then, the test statistic can be calculated as ( j ( θ j ) 2 /n ) / ( j ( ˆ θ j ) 2 /n ) under the null hypothesis of no difference between them (i.e., H 0 : ρ 1 = ρ 2 ), and compared with the critical value of the F distribution with ( n , n ) degrees of freedom at the significance level of 5%.Banker et al. (2010) evaluate the performance of this test against the other parametric (e.g., Ttest) and non-parametric (e.g., Mann-Whitney's U test) tests used traditionally in the DEA literature ( Banker & Natarajan, 2011 ).Their simulation results indicate this test is adequate for detecting deviations from the efficiency frontier caused by a single inefficiency term.

Data generation process under VRS setting
This paper extends the sophisticated DGP proposed by Kohl and Brunner (2020) for the CRS setting to generate well-behaved production data with the VRS system.The DGP produces a single output (y ) based on the generated meaningful inputs ( x i , i ∈ M = { 1 , . . ., m } ) and true efficiency values ( θ j ) for each DMU in which the regularity conditions are met.According to this information, the technology can be shown by the graph set T = { ( x, y ) : x can produce y } .The hyperbolic output distance function can be introduced as the maximum equiproportionate expansion of an output vector and reduction of an input vector that places an observation within the boundary of a technology T , i.e., D O ( x , y ) = inf { θ : θ > 0 , ( x, y/θ ) ∈ T } , if the graph production possibility set satisfies the axioms described in Coelli, Prasada Rao, O'Donnell and Battese (2005) .We generate the well-behaved dataset by using the (logarithmic) Translog production function presented by Eq. ( 1) .This technology has become the gold standard for Monte Carlo simulations ( Bogetoft & Otto, 2011a ).
where, α 0 is the efficiency parameter (can be set as 0), y is the initial output, parameters α i and β ih show respectively the importance of an input i , and the substitution possessions of the production procedure between two inputs i and h .These parameters are defined to acquire a well-behaved production function within the boundaries imposed by the inputs ( x i ).We develop a sevenstep DGP for each DMU under the VRS setting (depicted in Fig. 1 ) by ensuring adherence to the properties defined by Coelli et al. (2005) for well-behaved VRS data.In our DGP, apart from generating the parameters α and β, true efficiency ( θ), input vector x (including the number of inputs ( m ), input range, and input correlation), and the regularity conditions (monotonicity, curvature, and quasi-convexity) are meticulously taken into consideration to generate valid scenarios.
The value of true efficiency ( θ ) is drawn from a truncated normal distribution and then multiplied by the raw output value.We include different true efficiency distributions in our DGP as an adjustable characteristic to examine whether the true efficiency level influences the accuracy of VRS DEA models.The truncation is always set at 1.0 for the upper-efficiency values.Different lower bounds can be set to imitate diverse economies of scale.By adjusting the mode and standard deviation (StD) of the true efficiencies, a comparable distribution shape can be preserved.We then calculate the final output ȳ by multiplying the initial output by the true efficiency value: ȳ j = θ j • y j .
Adjusting the number of inputs, the range of inputs, and the correlation among inputs all lead to the generation of the input vector x .Adjustments are generally straightforward, for example, changing the number of inputs and parameters of the uniform distribution function used for the level of inputs.The wide range of inputs indicates a more heterogeneous production environment.Instead, the small range of inputs suggests a very homogeneous dataset with entities of similar sizes.A correlation between the input values also seems logical as larger entities usually use more inputs than smaller ones.A Cholesky decomposition method described in Hazewinkel (1992) accounts for this fact when generating inputs.
An authentic VRS production data requires the change of scale effects with the size of the DMU.Therefore, an optimal size must be defined within the economically feasible region4 of production, at which the average product is maximized.For example, in the case of a single-input single-output production function, the average product is y 1 / x 1 where graphically represents the slope of the line (ray) that passes through the origin and that point.This point is known as the point of optimal scale (of operations) where units exhibit CRS, smaller units work under increasing returns to scale (IRS) and bigger ones work under the DRS setting ( Coelli, Rao & Battese, 1998 ).We represent units that have exactly the optimal scale of operations as x CRS .Then, the necessary conditions of VRS setting for returns to scale can be written as Eq. ( 2) by straightforward operations on Eq. ( 1) ( Balk, 2001 ).The scale elasticity value of DMU j of the output distance function defined in Eq. ( 1) ( Balk, 2001 ) is: where φ O j ( x j , y j ) represents the (output distance function based) scale elasticity value of DMU j at point ( x j , y j ) .Note that the symbol "!" above the equal and unequal signs means "must hold".If this value is greater than, equal to, and lower than 1, we respectively have IRS, CRS, and DRS. 5 The scale elasticity in Eq. ( 2) is decomposed into two terms The first term represents the importance of inputs and the second one sets input substitutability and substitution distribution.According to these two terms, we can define the sufficient conditions for satisfying the global VRS regime that still allows the implementation of substitution effects for each DMU j as: For the data generation process, we want to test different optimal sizes as well as the extent of the economics scale effects.
For that reason, we can reformulate i ∈ M α i > 1 as i ∈ M α i = 1 + ω, ω > 0 which satisfies the first sufficient condition we need to guarantee the global VRS regime.The parameter ω can be used to adjust the extent of scale effects.A small ω implies weak scale effects, while the revert is true for a large value.We can implement different adjustments for the input importance by altering the value of α.We here apply two different adjustments containing equal and equidistant importance.In both settings, we must hold i ∈ M α i = 1 + ω, ω > 0 to guarantee the implementation of the VRS regime.In the first adjustment (hereafter referred to as SYM), every input is identically important in the production function.This can be achieved by Eq. ( 3) .The definition provided in Eq. ( 3) for α i fulfills the condition of i ∈ M α i > 1 .It is proven in Appendix B (Proposition 1) .
The second setting (hereafter referred to as ASYM) generates a production function with inputs of varying importance yet equidistant (see Eq. ( 4) ).In this adjustment, the first input ( x 1 ) is always the one with the lowest influence on production, and the importance of the other inputs increases with their indices.Consider three inputs x 1 , x 2 , and x 3 , since x 1 has the smallest importance (smallest index) to the production process, one unit increase in it would lead to a lesser rise in the output level than one unit increase in either x 2 or x 3 does.Of these, x 3 would lead to the largest growth in output.Since we only consider abstract inputs that can be rearranged, there will be no misrepresentation of the results due to this regularity.The definition provided in Eq. ( 4) fulfills the condition of the VRS setting (i.e., i ∈M α i > 1 ) as proven in Appendix B (Proposition 2) .
The second term of Eq. ( 2) i.e., , which deals with β parameters should be less than or equal to zero to ensure the VRS regime.β represents the substitution of two inputs and must satisfy the symmetry condition ( Coelli et al., 1998 ).Note that the condition of lin- ear homogeneity of degree + 1 in outputs is automatically satisfied in a single-output case ( Coelli et al., 1998 ).Having in mind i ∈ M α i = 1 + ω, the second term of Eq. ( 2) must be exactly equal to − − ω, in other words, to achieve CRS at x CRS , i.e., the optimum technical efficient size.This property can be fulfilled by Eq. ( 5) where it is assumed that the optimum technical efficient size of all inputs is at the same point, x CRS (i.e., The β parameters are responsible for satisfying two main economic regularity properties: monotonicity (or non-decreasing) and concavity (or non-increasing) in all inputs ( Coelli et al., 2005 ).Taking into account these properties, β cannot be set freely.We decompose β into two terms: substitution distribution ( σ ih ) and substitutability ( ν), mathematically, β ih ∝ σ ih • ν, ∀ i, h .This decomposition advantages us in adjusting both characteristics substitutability and substitution distribution separately in our DGP as well as in examining their possible effects on the accuracy of DEA estimates.The substitution distribution ( σ ih ) deals with the fact that the inputs substitution might be identical between all inputs and it is responsible for the distribution of β.The substitutability ( ν) characteristic determines the magnitude of β to be able to consider fluctuating capabilities to substitute inputs.Since the final magnitude of β should be regulated by its substitutability ( ν), the substitution distribution ( σ ih ) are normalized between −1 and 1.
Referring to the symmetry condition, it must hold σ ih We can reflect the possible effects of the substitution distribution ( σ ih ) by defining two different settings: equal where the substitution between all inputs is equal ( Eq. ( 6) ); and unequal where we advance the pattern proposed by Kohl and Brunner (2020) to generate unequal yet symmetric values for β ih , ∀ i, h .In both equal and unequal settings, we need to satisfy the condition presented by Eq. ( 5) as well as the symmetry to guarantee the implementation of the VRS setting through the substitution distribution ( σ ih ).
For the equal substitution distribution, we have ) by construc- tion, as a result, we can rewrite Eq. ( 5) as Definitions provided in Eq. ( 6) respecting Eq. ( 5) are proven in Appendix B (Proposition 3) .
Imposing the equal or identical substitution distribution is simple and can be accomplished by defining Therefore, we can rewrite the definitions of β provided in Eq. ( 6) as follows: For modeling the unequal substitution scenario, we develop the pattern presented by Kohl and Brunner (2020) , to create symmetric but unequal values for β via formulas presented in Eq. ( 8) with Kohl and Brunner (2020) .These definitions also respect Eq. ( 5) as shown in Appendix B (Proposition 4) .
, ∀ i and Now, we turn to the substitutability of inputs controlled by parameter ν.Substitutability boundaries differ for certain inputs.
Again, the monotonicity of the production function is the source of the substitutability conditions.For single-output multi-input, monotonicity implies constraints on partial derivatives of distance functions.These constraints can be expressed by Eq. ( 9) .The mandatory curvature and monotonicity conditions of the production function are key factors in the characteristics of well-behaved production data ( Cordero et al., 2015 ;Perelman & Santín, 2009 ).
The partial derivatives of distance functions must satisfy one condition for monotony: for D O as a single output, all marginal products ( f i ) must be non-negative across all inputs ( x i ) as outlined by Eq. ( 10) .
Curvature guarantees that all marginal products must be declining, i.e., the law of diminishing marginal productivity ( Coelli et al., 2005 ).The condition can be satisfied by fulfilling Eq. ( 11) which is the second partial derivative obtained by applying the chain rule to Eq. (1) .
For quasi-convexity in inputs, the corresponding bordered Hessian matrix F ( x i ) ( Eq. ( 12) ) on inputs needs to be evaluated. where, ) , ∀{ i, h | i = h } , f i and f ii have been already defined by Eqs. ( 10) and ( 11), re- spectively.The isoquants are strictly quasi-convex on inputs if this bordered Hessian matrix is negative definite ( Coelli et al., 2005 ).
F ( x i ) is negative definite if the successive principle minors alternate in sing.Defining the i + 1 principle minor by The expressive DGP should ensure that an increase in inputs does not lead to a decline in output despite changing the substitutability of inputs.It echoes the concept of input-free disposability found in the vast majority of DEA models.Keeping the curvature and monotonicity constraints is critically dependent on the magnitude of β.Therefore, we present the mathematical programming approach as Model (13) to derive the optimum value of ν that allows modifying the substitutability between inputs.Having a minimum value of ν gives a nearly flat substitution curve, resulting in high substitutability, while a maximum value of ν results in low substitutability.
min / max ν (13a) Values of the first and second partial derivatives, i.e., s i and f ii , fluctuate with input levels then, we cannot generally guarantee that the isoquants are strictly convex ( Coelli et al., 1998 ).However, as explained by Coelli et al. (1998) , there are areas in the input space where Eqs. ( 10) and ( 11) are satisfied.Providing that these conditions can be satisfied for every data point for any proposed Translog function, the well-behaved area may be large enough to adequately represent the corresponding production function.Note that the constraints of Model (13) change according to the number of inputs as the bordered Hessian matrix changes.The curvature and quasi-convexity inequalities ( Eqs. ( 13c) and ( 13d )) are quadratic and nonlinear, respectively.These constraints make solving the optimization problem considerably more difficult.In the two-input single-output case ( i = 1 , 2 ), the model and the bordered Hessian matrix in the quasi-convexity (the third constraint, i.e., (13d)) can be rewritten by considering the definitions of β ii and β ih provided in Eq. ( 6) , as follows: min / max ν (14a) We reformulate the model to transform the nonlinear constraints into a minimal number of conjunctive linear constraints that have the same admissible marking area as the nonlinear one does.The first quasi-convexity condition ( Eq. ( 14f) ) is fulfilled since the first principal minor | F 1 | , is always negative.For i = 2 , the second principal minor | F 2 | ( Eq. ( 14g) ), can be writ- This expression should be positive to guarantee the necessary and sufficient condition of quasiconvexity in inputs.The term , is always positive by construction.Consequently, we can simply show that one sufficient condition to fulfill Eq. ( 14g) is that the term f 1 f 2 f 12 be nonnegative.From Eqs. (14b) and ( 14c ), we know that f 1 and f 2 are non-negative.Therefore, one sufficient condition to assure quasiconvexity is: The impositions of the α and β values play the main role in the design of scale elasticity as well as in the computation of scale efficiency scores.A well-behaved production function can be obtained with the proposed model by imposing desirable assumptions.There is no doubt that increasing the number of inputs also increases the number of regularity conditions to which the proposed mathematical model must submit.Nevertheless, the procedures described for the two-input sample can be adapted to cases with higher multi-input dimensions.By sizing up the dimension of the problem, the proposed model can be used to generate regular behaved data, which would otherwise become cumbersome.Now that all the characteristics are adjustable, a well-behaved DMU can be generated under the VRS setting.

Study design
The characteristics used in this study are listed in Table 1 along with their values/levels.After creating one scenario as an example, the obtaining dataset is assessed using four different outputoriented DEA models: CCR ( Charnes et al., 1978 ), BCC ( Banker et al., 1984 ), VRS AR6 ( Pedraja-Chaparro, Salinas-Jimenez & Smith, The number of DMUs in the generated scenarios can be modified by simply running the DGP for one DMU n times.The true efficiency score θ , as mentioned before, is drawn from the truncated normal distribution and multiplied by the raw output y j for each DMU.Using true efficiency distributions as characteristics, we examine whether the level of true efficiencies influences the accuracy of DEA models.In the true efficiency score distributions, the upper bound is always set at 1.0, but the lower bound can be customized based on three different values: low (0.25), medium (0.40), and high (0.55).These levels reflect the reality that poorefficiency DMUs cannot survive.Changing the modes and StD of true efficiencies will result in similar curves.Therefore, we use modes of 0.75 (low), 0.80 (medium), and 0.85 (high) and StDs of 0.27 (low), 0.25 (medium), and 0.23 (high).For each DMU, the value of m inputs is randomly selected from two uniform distributions: U[ 100 ; 1 , 100 ] and U[ 100 ; 10 , 100 ] .The ranges used here have been derived from a study conducted by Kohl and Brunner (2020) ; they compared various ranges to determine the most meaningful ones.In addition, the Cholesky decomposition is applied to impose the correlation coefficients of 0.0, 0.4, and 0.8 between the raw inputs as described in Hazewinkel (1992) .

Results and discussions
Our main objective is to evaluate the accuracy of four main DEA models and to determine the scale efficiency of generating scenarios based on the defined characteristics.The results are divided into three parts.First, we intend to make the results more understandable by introducing some numerical illustrations explaining the characteristics used for generating scenarios.Our next task is to present the results of our main computational study.This will enable us to figure out which models of DEA based on the VRS setting perform best and to explore the driving factors.Our final section provides guidelines on how to apply DEA models in VRS settings based on our computational results.

Numerical illustrations
For the two-input single-output case, we generate the wellbehaved production function based on the Translog output distance function described before.Considering the settings given in Table 2 , we calculate the values of α i , ν, β ih using Eq.(3) , Model ( 14), and Eq. ( 6) , respectively.For a given input vector (e.g., x = [ 100 ; 1 , 100 ] ), the obtained values are presented in Table 3 .In Appendix D , we provide the dataset generated for this instance.If we set x CRS i close to the minimum of our input range (100), the change of scale effects according to the size of the DMU starts at the beginning of the production function.This effect of x CRS i is shown in Fig. 2 (a) in which we represent the production function of 10 0 0 DMUs under two different values of 300 and 600 and the same setting for the other characteristics as reported in Table 2 .The effect of ω which is responsible for adjusting the extent of scale effects, for two different values of 0.2 and 0.4 is shown in Fig. 2 (b).As the value of ω increases, the curvature of the production function also increases.According to the minimum and maximum of ν, which allow the adjustment of the substitutability, high and low substitutability are recommended between inputs.Fig. 2 (c) shows the effect of substitutability on the production function.We see that the minimum value of ν produces almost a level surface without large raised areas or indentations, while the maximum value of it produces a curve-shaped surface.

Results of analyzing the accuracy of DEA models
In the following sections, we discuss the results of the evaluation of four VRS DEA models and the Rand data gathered from 7776 scenarios.In Table 4 , we report the minimum (Min), maximum (Max), mean, and StD values of the performance indicators over all scenarios.In addition, boxplots depict the main descriptive statistics of B-Values and B-Ranks for each model in Fig. 3 .We use the Rand model as a lower bound for our benchmarks.The average, maximum, and minimum number of replications required for each scenario are respectively 111, 270, and 50.We define the stopping criterion for the replication based on the moving StD of the B-Value for the DEA models.If the moving StD of the B-Value of all four DEA models is less than 0.001, the replication terminates.There are over 434,0 0 0 replications in all, and each replication is tested using all four DEA models.By construction, we impose VRS technology on the DGP so that the efficiency scores calculated with DEA models under the VRS setting should be better than those calculated with CRS DEA models.To compute scale inefficiency as well as evaluate the potential bias associated with computing efficiency scores under CRS when true technology is represented by VRS, we run the CCR model.CCR results emphasize the importance of using an accurate return to scale before conducting a practical DEA efficiency analysis.Consider, for instance, the mean B-Value of the CCR, which is equal to 0.295, and its VRS counterpart (BCC), which is almost double, 0.574.
The small value of Mean Absolute Error (MAE) suggests the estimated efficiency scores are on average close to their true counterparts, and therefore, high 1 − MAE values are preferred.According to Table 4 , the MAE cannot provide information about the deviation because of the small mean value of this indicator for Rand = 0 .824 ) which is very close to the VRS DEA models.In order to  On the basis of Fig. 3 , the accuracy of the VRS DEA models can be explained as follows.In the first place, the AR and SBM models perform significantly better than the BCC model, while it is the most popular model in DEA applications.BCC has a mean and StD of 0.649 and 0.201, respectively, indicating superior quality to CCR (mean of 0.574 and StD of 0.238) which is not surprising since our DGP is implemented using the VRS setting.However, it is a clear indication of the reliability of the results of the DGP and provides insight into the mechanism by which it operates.In terms of the StD of the B-Values, the SBM and AR exhibit less dispersion from the corresponding mean values than the basic CCR and BCC DEA models.In light of the high B-Values for AR and SBM, which are close to 1.0, it can be said that these two models provide (nearly) accurate estimates.This result becomes even more significant when considering that these results represent the average over at least 50 replications of each scenario.Conversely, an examination of the minimum B-Values sheds some light on the vulnerable performances of all four models in some scenarios.Both SBM and AR models that have a minimum B-Value of 0.150 are performing better than the basic DEA models.The B-Rank, whose best value is equal to 1.0, is in agreement with the majority of certain findings testified by the B-Value.This indicator is not only a measure of dominance at the average level of scenarios but also takes into account every performance indicator in each replication.Overall, the AR model (with mean and StD of B-Rank of 1.479 and 0.354, respectively) performs marginally better than the SBM model (mean and StD of 1.877 and 0.568) and significantly better than the basic DEA models.

Results of hypothesis tests for comparing efficiency
The results of the statistical tests evaluating the null hypothesis that there is no difference in the distributions of true efficiency and estimated efficiency determined by the four VRS DEA models are presented in this section.The test statistic and critical value are calculated for each scenario, and if the test statistic is greater than the critical value, the null hypothesis is rejected.In Table 5 , we report the distribution of the rejected scenarios.The value of 0.0 reported for the Rand can serve as a valid indicator of the robustness of the hypothesis tests conducted.This value is equal to 0 because both the true and estimated efficiencies by Rand are generated from the same distribution function.These findings also corroborate the main conclusions drawn from analyzing performance indicators.The total number of rejected scenarios in the AR and SBM models (870 and 829, respectively) is considerably less than in the basic DEA models.Moreover, only 10% (11%) of scenarios have efficiency scores that are different from their true efficiency as calculated by the SBM (AR) model.By examining the rejected scenarios in more detail (see Tables E4 and E5 in Appendix E ), it is apparent that the majority of them have fewer DMUs and more inputs.Moreover, these results underscore the importance of selecting the right RTS.This is because on average, the CCR DEA model fails to estimate the efficiency scores of 50% of scenarios generated under the VRS setting.BCC, which has been widely used in the DEA literature, is unquestionably outperformed by the AR and SBM models under the VRS setting.
In addition to k = 2 , we set k = 3 , 4 in the AR model to study the effect of the AR weight restrictions on the quality of efficiency estimates.The results of B-Value and B-Rank obtained from the AR model with k = 3 , 4 against k = 2 are presented in Fig. 4 .The medians are all at the same level.Therefore, it can be concluded that the AR models perform under setting different k 's almost identical.However, the box plots in these examples show relatively different distributions of B-Values and B-Ranks.The total number of rejected scenarios in the AR models with k = 3 , 4 are 944 (12.1%) and 930 (11.9%), respectively, which are still considerably less than those obtained from the basic DEA models.The main reason for the better performance of the AR model is that the weights of inputs and outputs obtained from the basic DEA models are freely chosen.Therefore, they can get zero, which means they are excluded from the production possibilities set.This issue is evaded in AR models by restricting the weights.

Analysis of characteristics considered in the DGP
The purpose of this section is to investigate the identification of trends and patterns prompted by the ten different characteristics considered in the DGP.In Appendix E , we provide the descriptive statistics of the aggregated performance indicators and hypothesis tests according to the various values/levels defined for each characteristic.Based on the main drivers of these results, several consistency patterns emerge.Studies indicate that the size of the dataset, i.e., the number of DMUs and inputs, has a significant effect on the accuracy of DEA models.As reported subsequently, the results of our study confirm that increasing the size of the dataset results in decreasing the mean B-Values and in increasing the rejections.These two characteristics, however, are not the only ones responsible for the distinct influences.The use of more inputs and a low number of DMUs both negatively affect the mean B-Value.This results in more rejected scenarios as well.The mean B-Value of the BCC DEA model (see Table E3 in Appendix E ) is reduced by 25% from 0.750 to 0.560 when we use 7 inputs instead of 2 and the number of rejections is almost doubled from 529 to 1134.
The lower bounds of 0.25 (low), 0.40 (medium), and 0.55 (high) for true efficiency levels reflect the fact that units with extremely poor efficiency cannot survive in the real world.B-Values and the number of rejected scenarios reported in Appendix E can be used to determine how true efficiency levels affect the quality of the DEA models.Increasing the lower bounds of true efficiencies causes a slight decline in the mean B-Values and a slight rise in the number of rejections in the DEA models.The quality of the DEA models is marginally diminishing by allocating a larger share of DMUs to the true efficiency frontier (efficiency score of 1.0).This may be partly explained by the fact that scaling down the lower bounds of the true efficiency results in a broader range of scores.Resulting in more DMUs are moving closer to the efficiency frontier.Due to this, the discrimination power of DEA models is reduced, while the negative effects are marginally present.When the importance of every input is different (ASYM), we see that the mean B-Values of all DEA models are to some extent less than when all inputs have equal importance in the production function.Accordingly, fewer scenarios are rejected under the SYM setting than under ASYM.According to our results, DEA estimations are not affected significantly by input importance.
Taking a look at the input substitution distribution, it is evident that when the input substitution is considered unequal, the performance of all DEA models is significantly better than when it is equal.In reality, substitution between all inputs utilized by DMUs does not need to be identical.The situation is different when in-puts differ in substitutability.The AR and SBM DEA models are almost insensitive to substitutability variations.The high input substitutability adversely affects the performance of basic DEA models (CCR and BCC).Another two characteristics that are crucial to the form of the production function are the efficient size ( x CRS i ) and the extent of scale effects ω.In Appendix E , we demonstrate that when the efficient size is near the lower bound of the input range, i.e., 300, the performance of the VRS DEA models is marginally reduced since the scale effect starts at the beginning of the production function.As expected, this reduction in performance is more apparent in the CCR model.When the extent of the scale effect is increased, the performance of the basic DEA models CCR and BCC is diminished as the B-Values decrease and the number of rejected scenarios increases substantially.Once again, AR and SBM models perform better when the curvature of the production function is increased by increasing the extent of scale effects.Across all models, it is evident that larger input ranges result in less satisfactory results.This is very well reflected in the substantial increase in rejected scenarios.The results also reveal the trivial influence of the correlation of inputs upon the results of all DEA models.In real life, it is likely that there is a strong correlation between inputs, and that a complete lack of correlation is unlikely.
In summary, this set of results leads to a soundly clear ranking of the DEA models: AR ≈SBM > BCC > CCR .As a result of comparing the superior SBM and AR models, it is evident that despite almost identical B-Values and the number of rejections, some differences exist on the performance indicator level.Additionally, the results of B-Rank confirm the dominance of the AR model over the SBM model.The SBM model, however, shows almost the same performance as the AR model.The usage of both AR and SBM models as standard VRS DEA models can therefore be endorsed.However, the quality of the AR model might (strongly) depend on the setting of the weight restrictions and special attention should be given.

Conclusions
In this paper, we propose a method based on Monte Carlo simulation to assess the quality of DEA model estimates.Our method involves generating data by using a flexible technology (Translog production function) that satisfies microeconomic regularity conditions such as convexity and monotonicity.Prior studies have lacked diversity in the DGPs, which is a serious handicap when evaluating the quality of DEA model estimations.We generate 7776 distinct scenarios under the VRS setting by defining a variety of characteristics.Our evaluations of the quality of estimates obtained from DEA models are based on five performance indicators, as well as DEA-based hypothesis tests.Furthermore, we demonstrate how a valid range of characteristics and parameters can be derived when the necessary and sufficient microeconomic conditions are all met.
To our knowledge, this is the first study that compares the quality of VRS DEA models to date.We show that the BCC model, which is the most commonly used VRS DEA model in the literature, is outperformed by AR and SBM models.According to the hypothesis test's results, we find that more than 30% of BCC model estimations differ from the distribution of the true efficiency, but this rejection percentage is 11% for AR and 10% for SBM models.It is noteworthy that the AR model emerged at the top without applying any special tuning to the virtual weight restrictions.However, it may be too complex to explicitly articulate weights in some applications.We, therefore, endorse the establishment of the SBM model as the standard VRS DEA model in which there are no prior conditions to be comprehended on weights since its performance is almost equal to that of the AR model.From our perspective, the dominance of the AR and SBM models can be explained by the presence of slacks.While the BCC model ignores slacks entirely in reporting the efficiency score, the SBM model calculates the efficiency score directly based on the slacks.Furthermore, the AR model prevents the emergence of slacks by assigning boundaries to the weights.We also examine the impact of characteristics used for generating scenarios on the quality of the DEA estimates.According to results, the most important factors affecting the quality of VRS DEA models are the number of inputs, range of inputs, distribution of input substitution, and scale effects.Our results may also be useful for decision-makers who might use them as a guideline for their own DEA studies to ensure acceptable results and accuracy.
Consideration of the single-output case is one of the limitations of our DGP.The methodology may therefore be generalized to meaningful multi-input multi-output cases in the future.The goal of our DGP is to generate artificial datasets that fulfill the monotonicity and curvature conditions for every single generated scenario.In order to secure a similar guarantee in the multi-output case, the formulation of the Translog production function needs to be updated as the importance of outputs and their substitutions must be considered.In particular, this means the necessary conditions for guaranteeing the VRS setting with respect to outputs such as imposing homogeneity of degree + 1 and convexity in outputs need to be modeled and considered.
Furthermore, the proposed DGP identifies the deviation of the output from the efficiency frontier as a single inefficiency term.A stochastic framework is another method of extending the DGP.The DGP can then be extended by defining the inefficiency score as the sum of two terms: inefficiency and noise.Another line of investigation would be extending our method for panel data with a time trend.Using this, we can assess and improve the accuracy of Malmquist productivity index calculations and their decomposition.In addition, we believe the methodology presented here can also be used to investigate other multi-input multi-output production functions, such as the one presented by Färe, Grosskopf, Noh and Weber (2005) .All of this may eventually make DEA models more practical by increasing their reliability and showing how accurate their estimations are to decision-makers.

Appendix A. Performance indicators
In Table A , θ j and ˆ θ j denote the true efficiency and efficiency score calculated by the DEA model for jth DMU ( j ∈ { 1 , . . ., n } ) , respectively.There are two important points to consider when defining performance indicators.First, DEA estimates ˆ θ j = 1 for some DMUs while their corresponding true efficiency scores obtained from the DGP might be less than (but close to) one, i.e., θ j = 0 .90 < 1 .0 since they are based on a random continuous function.Second, in the small-size samples, it is expected that only a few DMUs (or no DMU) with a true efficiency score of 1.0 have been produced.These two points preclude using a simple indicator that only evaluates whether DMUs with an estimated efficiency score of 1.0 ( ˆ θ j = 1 .0 ) also have a corresponding true efficiency score of 1.0 ( θ j = 1 .0 ).Our objective is therefore to determine whether the DEA models are capable of identifying the top-performing DMUs in a sample, although, not all of them have a true efficiency score of 1.0 but are close to it.In light of these two points, TOP and IN-EFF are performance indicators based on the quantiles of worstand best-performing DMUs, respectively.This study defines an efficient DMU as one that has at least as high a true efficiency value as a specific quantile ( Q (ε) ) of the distribution of true efficiency.
In the same manner, a DMU is inefficient if and only if its true efficiency is less than or equal to Q ( 1 − ε ) .For example, consider 50 DMUs ( n = 50 ) where ε = 0 .8 .In the ascending order of true efficiencies, Q (ε) = θ j where j = 40 .The same logic can be applied to Q ( 1 − ε ) .In this way, we can handle multiple efficiency distributions in the DGP as well as compare different scenarios.Ideally, parameter ε should be large enough to serve as a satisfactory limit for efficient DMUs.We also employ the CORRI to track the mean value of estimates in certain corridors around the true efficiencies, since MAE cannot provide information on the deviation.
The parameters δ and γ determine the tightness of the corridors and the number of corridors, respectively.As in Kohl and Brunner (2020) , we also use a corrugated line of δ = 0 .05 to test an estimated model's efficacy at most 5% points.This is in addition to the corresponding true score.Having generated the data (including inputs, outputs, and a true efficiency score) of a scenario and calculated the efficiency scores by DEA models, we constructed the performance indicators.To aggregate and represent all the performance indicators with a single score we use B-Value.To capture the influence of dominance, we also introduce a second aggregated indicator called B-Rank.In Table A , the last two rows give the formulas for these two aggregated indicators.

Table A
Performance indicators used for quality evaluation of DEA models ( Kohl & Brunner, 2020 ).

Indicator
Symbol Formula Mean absolute error MAE Appendix B. Propositions and proofs Proposition 1.The definition provided in Eq. ( 3) for α i fulfills the condition of i ∈M α i > 1 .
Proof.We need to prove that the definition provided for α i in Eq. ( 3) guarantees the implementation of the first condition of the VRS regime.Mathematically speaking, The definition provided in Eq. ( 4) for α i fulfills the condition of i ∈ M α i > 1 .Proof.We show that the definition provided for α i in (4) respects the first condition of the VRS, i.e., Proposition 4. Definitions provided in Eq. ( 8) fulfill Eq. ( 5) .
Proof.We call the unequal substitution distribution defined by Kohl and Brunner (2020) , i.e., , ∀{ i, h | i = h } .From the proof provided by them, we know that σ ii + h ∈ M \{ i } σ ih = 0 , ∀ i . 7Now, by replacing these two expressions in Eq. ( 5) and operating, we have:

Appendix C. DEA models under evaluation
The basic DEA model (known as CCR) was introduced by Charnes et al. (1978) .They define a measure of efficiency by maximizing the ratio of the weighted sum of outputs over the weighted sum of inputs for each DMU.Consider input vector X = ( x 1 o , . . ., x mo ) and output vector Y = ( y 1 o , . . ., y so ) , then the relative efficiency of DMU j ∀ j = 1 , . . ., n can be formulated as T E j = s r=1 u r y r j / m i =1 v i x i j where, u r and v i are the weights of output r and input i , respectively.The mathematical formulation of the CCR DEA model can be presented as follows:   where, θ o shows the technical efficiency of DMU o and λ j are the intensity variables.In the CCR DEA model, the assumption of CRS is underlined.The BCC (Banker-Charnes-Cooper) model ( Banker et al., 1984 ) is the most representative extension of the CCR DEA model in which VRS technology is accommodated.The BCC DEA model can be formed by adding the convexity constraint n j=1 λ j = 1 to the CCR DEA model (C).
There might be many zeros in the optimal weights of the CCR and BCC models, indicating that the evaluating DMU may have a weakness in the factors (inputs and outputs) compared to the efficient DMUs.Having no control over the boundaries of optimal weights leads to the emerging AR DEA model, which constrains the weight of special inputs/outputs relative to others ( Thompson, Singleton, Thrall & Smith, 1986 ).In the literature (see Allen, Athanassopoulos, Dyson and Thanassoulis (1997) and Pedraja-Chaparro et al. (1997) ), several approaches have been developed to restrict the weights of DEA models.There are two dimensions for the weight restrictions in the AR models.The first dimension relates to the weights to be constrained i.e., raw or virtual weight restriction.The second dimension is about the limits placed on the weights which can have either absolute or relative weight restrictions.A raw weight restriction limits just the weight within the primal multiplier model, whereas a virtual weight restriction limits the product of weight and input, i.e., v i • x i j .The ratio between two weights is affected by relative restrictions, as opposed to absolute restrictions, which affect only one weight.The focus of our analysis is on relative weight restrictions following Pedraja-Chaparro et al. (1997) .Since we deal with different input elasticities, we chose to apply virtual weight restrictions.To determine whether the quality of the AR model is affected by weight restrictions, we consider relative virtual weight constraints by setting k = 2 , 3 , 4 in our computational study (see Section 3.3 ).
Both CCR and BCC DEA models are radial where inputs are proportionally reduced, and outputs are proportionally expanded.This assumption can be restrictive.For example, when labor, capital, and material are employed as inputs, some of them may not change proportionally and may be substituted.A further shortcoming of radial models is that they do not consider slacks when reporting efficiency scores.There are often loads of non-radial slacks left.These limitations lead to the expansion of non-radial models.SBM DEA is a non-radial model that deals directly with slacks in reporting efficiency scores ( Tone, 2001 ).The non-oriented SBM DEA model under the VRS setting is a non-linear model that can be reformulated as a linear counterpart by using Charnes-Cooper transformation approach ( Charnes & Cooper, 1962 )

Fig. 2 .
Fig. 2. Effect of x CRS i (a), ω (b), and ν (c) on the form of the production function.

Fig. 3 .
Fig. 3. Boxplots of B-Values and B-Ranks obtained from the models.
v i ≥ 0 , ∀ r, i (C.8) To do this, we analyze and compare the BCC model estimates with two other DEA models: AR and SBM.Comparisons with the basic model for BCC DEA and uniformly distributed random numbers (i.e., Rand) reveal also the accuracy of the procedure.II.Two approaches are used to conduct the comparison: benchmark scores based on multiple performance indicators and DEA-based hypothesis tests.Benchmark scores cover many aspects of a measure of efficiency introduced by Pedraja-Chaparro, Salinas-Jiménez and Smith

Table 1
Defined characteristics for generating scenarios.

Table 2
An example scenario for the two-input single-output case.

Table 3
Results of the two-input single-output instance.

Table 4
Statistical values of performance indicators calculated for each model under the VRS setting.