A NEW STEPWISE METHOD FOR SELECTION OF INPUT AND OUTPUT VARIABLES IN DATA ENVELOPMENT ANALYSIS

Data envelopment analysis (DEA) is one of the widely accepted optimization technique uses to measure the relative efficiency of organizational units where multiple inputs and outputs are present. The significance of DEA results depends on the variables selected for DEA modelling. One of the main challenges in data envelopment analysis modelling is of identify the significant input and output variables for DEA modelling. In this study, we propose an enhanced stepwise method to identify the significant and insignificant input and output variable by reducing the iterations process in stepwise method. The statistical significance of the input and output variables evaluated using the statistical methods: Least significance difference (LSD), and Welch’s statistics. The proposed method applied to the Indian banking sector and the results have shown that the proposed model significantly identified the significant and 704 SUBRAMANYAM, DONTHI, KUMAR, AMALANATHAN, ZALKI insignificant input and output variables with least loss of information.


INTRODUCTION
Data envelopment analysis is an optimization technique used to measure the relative efficiency of organizational units called decision-making units (DMUs), where multiple numbers of input and output variables are present during efficiency evaluation. The idea of technical and allocative efficiency was first originated by Farrell (1953) and subsequently, the fundamental models were developed by Charnes et.al (1978Charnes et.al ( , 1984. These models create a frontier using the available input and output variables. The DMUs on the frontier line are called efficient (best practices) with efficiency score 1 and others are termed as inefficient with the efficiency score between 0 and 1.
The major advantage of DEA is that it is data-driven and the relative weights of the variables need not be known a priori. The fundamental DEA models were applied in different fields like production management, banking, agriculture etc. to evaluate the relative efficiency of the branches and organizational units [3,7,10,17,18].
Modelling of DEA depends on researchers' perspective due to the lack of the standard procedures for the selection of input and output variables. The modelling of DEA never discusses how to identify the relevant input and output variables and assumes that the variables are known a priori.
Since, the DEA analysis relies heavily on the selected input and output variables there are huge variations among the efficiency scores of DMUs from one researcher to another researcher. Due to the inclusion of more number of input and output variables, the dimensionality of the production possibility set will increase, and proportionally it leads to the poor discriminatory power of the DEA models. The general guidelines in DEA about the number of input and output variables is that the total input and output variables must be less than one third of the total DMUs [6,13,16].
Most of the researchers applied DEA models with the assumption that the input and output variables are known a priori. If the more number of input and output variables are present during 705 INPUT AND OUTPUT VARIABLES IN DATA ENVELOPMENT ANALYSIS DEA modelling, some of the variables may not have significant impact on the efficiency scores of DMUs. Such variables don't have any role in DEA modelling but may decrease the discretionary power of DEA models. To avoid this situation, some of the researchers proposed variable selection methods based on statistical approaches. The rationale to develop these procedures is to identify the insignificant variables for improving the discriminating power of DEA. The usefulness of any DEA model will depend on the variables that are selected during the efficiency evaluation.

REVIEW OF LITERATURE
In DEA modelling, usually the input and output variables are highly correlated due to their interrelationships. Removing the variables using simple correlation and regression may not be possible. There are some studies discussed about the importance of correlation and regression methods, and Principle Component Analysis. In this approach the variables which are highly correlated with existing model variables are merely redundant and are omitted from further analysis [9,13,16]

Stepwise Method: Backward Elimination Method:
The basic CCR model explained in section (3), utilized to run the proposed step-wise algorithm.
While applying any statistical methods there are some basic assumptions. We considered the following assumptions for developing the proposed procedure as:

Assumptions:
1. The data for all input/output variables is available and are always greater than 'zero'.
2. Always there must be at least one input and output variable in the data exploration.
3. Only one input/output variable be removed at a time from the data exploration.
4. The input/output variable eliminated from the data exploration will not be included.

The efficiency scores follow normal distribution.
Under the necessary assumptions stated above, the following stepwise procedure proposed to select the significant variables and to remove the insignificant variables from the data exploration.
Step1: Run full model with all available input/output variables and store the efficiency scores in a set 'OTE'.
Step2: Drop all input and output variables one by one (if no variables fixed) and run the models.
Store the efficiency scores in another set 'Ek'. Where 'k' takes the values from 1 to K=I+O.
(k= 1, 2, ... , K) Step3: Use Fisher's protected "Least Significant Difference (LSD)" method to test the significance difference between the means at 10% level of significance. The LSD uses the formula: (0.10) = 0.10 * , n is number of observations, 2 is the mean square error.
Step4: Retain all the input/output variables which are significant at 10% level of significance. No further significance test required for this variable(s).
Step5: Calculate the percentage of average change and variability change for full and reduced models respectively. Here, M and SD represents the mean and standard deviation respectively.
Step6: The variable which is statistically insignificant with least loss of information (i.e., least average and variability change) will leave from the data exploration.
Step7: Repeat step-1 to step-7 until all variables are statistically significant in the data exploration.
The final variables in DEA modelling are purely based on researcher's discretion.
Step8: Under the assumption of normality, use Welch's t-statistic to test the significance difference between the average scores of full and reduced model.

INDIAN BANKING SYSTEM
Indian banking system is one of the strong and stable industry comparing to any other countries' banking system. This sector plays a major role in the growth of Indian economy. The 'Reserve Bank of India (RBI)' is the monitoring authority of all banks in India and regulates the banking business of all the banks according to the needs of the Indian economy. Due to the globalization, more number of private and foreign sector banks started working in India [8,14,15]. In India, the banking management system is broadly classified into three different categories based on the ownerships as Public, Private, and Foreign Sector banks. Gauging the efficiency of any commercial bank is important to the investors, policymakers and for a layman to know whether the banks are working in efficient environment or not due to the profound competition in banking business.
The DEA models were applied by number of researchers in evaluating the efficiency of banks and bank branches. Most of the studies evaluated the efficiency by assuming that the input and output variables are known a priori [8,14,17]. There is no general agreement on the modelling of DEA due to the availability of more number of input and output variables in banking business [9,15,16]. To identify a parsimonious model, number of researchers proposed different methods for reducing the input and output variables. The data is collected from the RBI Bulletins for the financial year, 2018-19 for all the public and private sector banks working in India.

INPUT AND OUTPUT VARIABLES
The study assumes production approach with the variables (i) Total number of employees working in each bank (ii) The fixed assets of each commercial bank and (iii) Total expenditure of banks as input variables and (i) Deposits, (ii) Investments (iii) Advances (iv) Interest Income and (v) Other income as output variables.

EMPIRICAL ANALYSIS
The present study conducted using the data of 42 Indian commercial banks comprising of 20 public and 22 private sector banks for the financial year 2018-19. The correlation matrix among the selected input and output variables is represented in table (1). The table reveals that there is a high correlation among the input and output variables in DEA. Step   In iteration-2, the stepwise method started by dropping one variable at a time. The only one variable 'Employees' seems to be statistically significant and this variable fixed for further analysis.
Among the insignificant variables, the 'advances' has less impact comparing to all other variables.
Therefore, the output variable 'advances' dropped from the data exploration. Next iteration started with 2-inputs and 4-output variables. In this stage, CCR model started with two fixed input variables namely, number of employees and total expenditure, and tested for the significance of four output variables. All output variables are statistically significant except the variable 'Investments'. The average change and the variability change is also less comparing to other variables. Therefore, the output variable 'Investments' dropped from the data exploration.

STATISTICAL SIGNIFICANCE
To test whether there is any statistical significance difference between the full and reduced models with the null hypothesis 0 : 1 = 2 , the Welch's statistics applied and shown that there is no statistical significance (p > 0.10). The final input and output variables will become as: All the above input and output variables are statistically significant at 10% level of significance.
It is worth to assume 10% since we are dealing with highly correlated input and output variables which are involved in data envelopment analysis exploration.

SUMMARY AND CONCLUSIONS
The efficiency scores of DEA will be more accurate when the relevant significant input and output variability are used to identify the insignificant variables to leave from the data exploration. The proposed stepwise method applied for 42 commercial banks with 3-input variables and 5-output variables. During the data exploration using stepwise method, the variables fixed assets, advances and investments become insignificant and are eliminated from the data exploration. We observed that there is no statistical significance difference between the means of full and reduced models. It means, statistically, both the models are providing almost same information on the efficiencies of banks.