Copula approach to select input/output variables for DEA

Article history: Received: 5 April 2016 Accepted: 15 August 2016 Available Online: 25 October 2016 Determination of the input/output variables is an important issue in Data Envelopment Analysis (DEA). Researchers often refer to expert opinions in defining these variables. The purpose of this paper is to propose a new approach to determine the input/output variables, it is important to keep in mind that especially when there is no any priori information about variable selection. This new proposed technique is based on a theoretical method which is called “Copula”. Copula functions are used for modeling the dependency structure of the variables with each other. Also we use the local dependence function which analyzes the point dependency of variables of copulas to define the input/output variables. To illustrate the usefulness of the proposed approach, we conduct two applications using simulated and real data and compare the efficiencies in DEA. Our results show that new approach gives values close to perfection.


Introduction
Data Envelopment Analysis (DEA) is a data oriented non-parametric method, introduced by Charnes et al. [1] to evaluate the relative efficiency of organizational units called as decision making units (DMUs) according to selected input and output variables [3]. Manufacturing units, departments of schools or hospitals, a set of firms or even practising individuals can be given as examples to DMUs [7]. Nowadays, institutions producing service or product, are obliged to have an effective performance because of intense competitive conditions and limited sources. Although efficiency of the DMU's depends crucially on input/output variables to be selected correctly, there is no any guidelines or systematic procedure to classify the variables. Many researchers often make such determination subjectively or using expert opinions. Some of them use the variables based directly on other's selection results according to calculated efficiency [8]. The issue of variable selection seems to be a subject rarely studied in the literature. Considering the studies over the last 15 years, it can be seen that Pastor et al. [9] defined a measure, namely ECM (efficiency contribution measure), to evaluate the input or output variables. In 2003, Jenkins and Anderson [10] proposed a systematic statistical approach to reduce the number of already defined input and output variables. Ruggiero [11] addressed the subject of input selection using regression analysis. Edirisinghe and Zhang [12] proposed a Generalized DEA approach to determine the type of variable via maximizing the correlation between DEA-based score of financial strength and stock market performance. In 2009, Morita and Avkiran [3] proposed a selection method utilizing diagonal layout design. Finally, Madhanagopal and Chandrasekaran [13] developed a genetic algorithm approach for the selection of input/output variables. The approach we propose in this study takes into consideration distribution of data and the point dependency between candidate variables, which makes it distinctive amongst the others. Copulas and the local dependence function were used to achieve the correct discrimination of input/output variables. We assume that there is no any information or expert opinion to decide the input/output variables. Performance assessment of our proposed method was made by means of simulation, followed by a real data application. In this way, we were able to make comparison of the efficiencies of the new approach and those of the one based directly on expert opinion selection method for this study. The main goal of this Copula approach to select input/output variables for DEA 29 paper is to propose a new theory based approach to determine the input/output variables. The paper is organized as follows: Section 2 gives a brief summary of the DEA models. In section 3 we define copulas (especially Farlie-Gumbel-Morgenstern "FGM" copula) and local dependence function. In this section an algorithm is also developed to determine the input/output variables. In Section 4, a simulation and a real data applications are conducted and results are evaluated. The last section gives the conclusions.

Data envelopment analysis (DEA)
Data envelopment analysis (DEA) is a linear programming based technique to calculate the relative efficiencies of a set of decision making units (DMUs) that have similar inputs and similar outputs. Also it is a non-parametric method as it does not require any assumption about functional form. This technique aims to measure how efficiently a DMU uses the resources available to generate a set of outputs [1]. DMUs can include manufacturing units, departments of big organizations such as universities, schools, bank branches, hospitals, power plants, police stations, tax offices, prisons etc. The basic frontier model was developed by Charnes et al. [1] known as the CCR model, but now widely called as CRS (constant returns to scale) and was extended by Banker et al. [14] to include variable returns to scale (VRS). So the basic DEA models are known as CCR and BCC referred to as the VRS. DEA models have two orientations: input-oriented and output-oriented. With input-oriented DEA, the linear programming model is constituted so as to determine how much the input use of a firm could contract if used efficiently in order to achieve the same output level. In contrast, with output-oriented DEA, the linear programme is constituted to determine a firm's potential output given its inputs if it operated efficiently as firms along the best practice frontier [15]. The input oriented model contracts the inputs as far as possible while controlling the outputs. The output oriented model expands the outputs as far as possible while controlling the inputs [16]. The original fractional CRS model Eq.(1) evaluates the relative efficiencies of n DMUs j=1,…,n each with m inputs and s outputs denoted by x1j,x2j,..,xmj and y1j,y2j,..,ymj respectively [1]. This is done so by maximizing the ratio of weighted sum of output to the weighted sum of inputs: The CRS models (dual and primal) with input orientation are still the most widely known and used DEA models despite the numerous modified models that have appeared [17].

Copula, local dependence function and selection procedure
In general, copula is a function which helps re-define the joint distribution function using marginal distribution functions in I 2 when the random variables are dependent. In recent years, copulas have been involved in many studies such as statistics, economics, finance and risk management, dependence measuring, modeling, and serial dependence in time series [18].   [4].

Local dependence function (LDF)
Let X and Y be random variables with marginal distribution functions F(x), F(y) and marginal probability density functions f(x) and f(y) respectively. The following function is obtained from the expression of the Pearson correlation coefficient by replacing mathematical expected values E(X) and E(Y) conditional expected values E(X|Y=y) and E(Y|X=x) [6]. y) can be referred as a local dependence function which characterizes the dependence between X and Y at the point (x,y). After simple mathematical transformation X=EX-E(X|Y=y) and Y=EY-E(Y|X=x) the equation Eq. (5) can be written as Local dependence function has the following properties =0. [6].

FGM copula
Farlie-Gumbel-Morgenstern (FGM) distribution was introduced by Morgenstern in 1956 with Cauchy marginal. This class was examined by Gumbel for exponential marginal and further generalized by Farlie in 1960 [19].
Let (X,Y) be absolutely continuous random variables.
The general distribution function is defined as where A(x)0, B(y)0 as x1, A(x) and B(y) satisfy certain regularity conditions ensuring that the Eq. (6) is a distribution function with absolutely continuous marginal F(x) and G(y) [20]. FGM copula is a positive quadrant dependent (PQD) copula and the local dependence function is as follows; where α is the association parameter [5].

Selection procedure
In this part of the study, we present a new algorithm to determine the input/output variables. The following algorithm explained in detail leads to the selection process.
Step-1: Determine the distribution of variables and construct the appropriate copula function (e.g. if the distribution is uniform the FGM is suitable).
Step-2: Construct the LDF of determined copula according to the method as described in Section 3.1.
Step-3: Determine the type of quadrant dependency of the copula mentioned in Section 3.3., to use to decide which group is selected as input/output.
Step-4: Calculate the LDF values for pairwise variables. The variables which have Max|H(X i ,Xj)|, selected as reference variables (xr1 and xr2) and assign them two separate groups (Group-1 and Group-2). Allocate the rest of variables to the groups according to the following procedure: then i x is assigned to the Group-1 else it is assigned to the Group-2.
Step-5: Make predetermination about whether the study is input or output oriented.
Step-6: Determine the sign of LDFs which is calculated between the each of the rest of the variables and the reference variables for two separate groups constructed in Step-4.

a. Suppose that study is input oriented and copula is PQD,
i. If one of the groups has only positive LDFs and other has only negative, then choose the variables in group which has positive LDFs as input variables.
ii. If all of LDFs are positive in each group, then select the variables in group which has MaxH(Xi,Xj) as input variables. If all of LDFs are negative in each group, then select the variables in group which has Min|H(X i ,Xj)| as input variables.
iii. If groups contain different sign LDFs, then select the variables in group which has Min|H(Xi,Xj)| as input variables.
b. Suppose that study is input oriented and copula is NQD, i. If one of the groups has only positive LDFs and other has only negative, then choose the variables in group which has negative LDFs as input variables.
ii. If all of LDFs are positive in each group, then select the variables in group which has MinH(Xi,Xj) as input variables. If all of LDFs are negative in each group, then select the variables in group which has Max|H(X i ,Xj)| as input variables.
iii. If groups contain different sign LDFs, then select the variables in group which has Min|H(Xi,Xj)| as input variables.
c. Suppose that study is output oriented and copula is PQD, i. If one of the groups has only positive LDFs and other has only negative, then choose the variables in group which has positive LDFs as output variables.
ii. If all of LDFs are positive in each group, then select the variables in group which has MaxH(Xi,Xj) as output variables. If all of LDFs are negative in each group, then select the variables in group which has Min|H(X i ,Xj)| as output variables.
iii. If groups contain different sign LDFs, then select the variables in group which has Min|H(Xi,Xj)| as output variables.

d. Suppose that study is output oriented and copula is NQD,
i. If one of the groups has only positive LDFs and other has only negative, then choose the variables in group which has negative LDFs as output variables.
ii. If all of LDFs are positive in each group, then select the variables in group which has MinH(Xi,Xj) as output variables. If all of LDFs are negative in each group, then select the variables in group which has Max|H(X i ,Xj)| as output variables.
iii. If groups contain different sign LDFs, then select the variables in group which has Min|H(Xi,Xj)| as output variables.
Step-7: Run the DEA model and calculate the efficiencies.

Application
In this part of the study, a simulation and a real data example have been conducted and DEA have been performed using the package program DEAP 2.1. The data for simulation study were generated from Uniform (0,1) and FGM copula was selected to represent them. We used the Yoluk's [2] data set for the real data example.

Simulation study
In simulation part of the study, we use five variables and 20 decision making units. Table 1 shows the min, max and absolute values of LDFs.

Real data application
In this part of the study we use Yoluk's [2] hospital data given in Table 3. The efficiency analysis was performed as input oriented and the first three variables were taken as input, the last four variables were taken as output variables. We assume that we have no any information about the variables and are not able to get an expert opinion. Firstly, we tested the goodness of fit for all variables to the Uniform distribution and found no departures from the hypothesized distribution. At the second stage the LDF values for FGM copula were computed on these variables standardized to Uniform (0,1) and results

32
O. Alpay, E. Akturk Hayat / Vol. 7, No.1, pp.28-34 (2017) © IJOCTA were given in Table 4. Variables have been classified with these values according to the proposed algorithm. The efficiency scores of our approach and Yoluk's results are given in Table 5.  Variables in Group 1 should be selected as input variables. As seen from the table, the copula approach has almost done exact classifying and has determined input/output variables correctly.

Conclusion
Data Envelopment Analysis is the most frequently used method to evaluate the efficiencies of DMUs. Although the determination of input/output variables is one of the most important problem of DEA, there have been limited attempts in the literature to solve this issue. Getting an expert opinion, which is the most preferred method, may actually be a subjective approach and it may also become an expensive way to identify the variables as input/output. The methods to determine the types of variables should be more objective and cost-effective than the current methods. In this study, we proposed a method which has not previously been available in the literature to solve this problem. We took into account of the copula function, which expresses the relation between dependent random variables in statistics, and the local dependence function, which measures the point dependency of variables. Our basic assumption is that the distribution of the variables is known in advance or can be determined. In the simulation study, we showed how the algorithm of our method seperates a bulk of synthetic variables as input and output variables. In the real data application, we selected a set of variables which was already defined as the input and output in the previous analysis of the same data in the literature. According to the new process, the suggested input/output variables were almost consistent with those of other studies [2]. Also efficiencies were the same for effective DMUs and gave less efficiency score for the ineffective DMU as seen in Table 5. The results showed that this method has several advantages. If there is no pre-information, input/output variables can be determined objectively. There is no need for expert opinion, so it is a costeffective method. However, this method depends on the variables which have known distributions and sometimes it becomes a disadvantage. Although there is Archimedean copula family in the copula theory, we prefered a basic copula function (FGM) in this study because Archimedean copulas use a generator function [21] and there is only one study [22] for Local Dependence Function related with Archimedean copulas and its functional structure is not suitable with our algortihm. As a conclusion, this is a new kind of approach which has a theoretical structure, to identify the input/output variables and it can be improved for future studies and adapted to data which have different distribution functions.