Using a DEA–AutoML Approach to Track SDG Achievements

: Each country needs to monitor progress on their Sustainable Development Goals (SDGs) to develop strategies that meet the expectations of the United Nations. Data envelope analysis (DEA) can help identify best practices for SDGs by setting goals to compete against. Automated machine learning (AutoML) simpliﬁes machine learning for researchers who need less time and manpower to predict future situations. This work introduces an integrative method that integrates DEA and AutoML to assess and predict performance in SDGs. There are two experiments with di ﬀ erent data properties in their interval and correlation to demonstrate the approach. Three prediction targets are set to measure performance in the regression, classiﬁcation, and multi-target regression algorithms. The back-propagation neural network (BPNN) is used to validate the outputs of the AutoML. As a result, AutoML can outperform BPNN for regression and classiﬁcation prediction problems. Low standard deviation (SD) data result in poor prediction performance for the BPNN, but does not have a signiﬁcant impact on AutoML. Highly correlated data result in a higher accuracy, but does not signiﬁcantly a ﬀ ect the R-squared values between the actual and predicted values. This integrative approach can accurately predict the projected outputs, which can be used as national goals to transform an ine ﬃ cient country into an e ﬃ cient country


Introduction
Sustainable Development Goals (SDGs) are a global plan for developing sustainability in order to eradicate poverty while preserving the environment and quality of life of all living things around the world without leaving anyone behind. According to the 2019 Sustainable Development Report, none of the countries is currently pursuing a path to achieve the 17 goals [1]. Therefore, each country needs to monitor its progress in order to develop and promote strategies that meet the expectations of the SDGs [2]. Benchmarking, by setting goals to compete with, can help identify best practices and then apply them reliably to improve a country's performance towards reaching their SDGs. Data envelope analysis (DEA) is a benchmarking technique in which the optimal combinations of effort (input) and performance (output) result from a linear optimization problem. DEA uses the data from each decision-making unit (DMU) involved in the efficiency assessment to establish an efficient frontier and then determines the positions of each DMU with respect to the frontier. Every country must set future goals in order to achieve their SDGs; however, DEA lacks predictive capability [3][4][5][6]. Some studies have tried to integrate DEA with other methods, such as machine learning, to predict future situations.
Machine learning is a field of artificial intelligence that applies statistical techniques to give computers the ability to learn from data without being explicitly programmed. It is easy to use 1.
implementing a stratification DEA to evaluate the ES and ET, and to examine the PO that will turn an inefficient DMU into an efficient DMU; 2.
applying AutoML to predict the three outcomes of the DEA using classification algorithms to determine the ET and regression algorithms to predict the ES and PO; 3.
using a back-propagation neural network (BPNN) to produce a series of the same results to validate the AutoML outputs by comparing the rates of precision and accuracy along with the number of DMUs within an acceptable percentage error (PRED);  4. employing DEA by using the predicted PO from the AutoML and BPNN to evaluate the ES and compare them to validate the integrative approach.

Stratification DEA
DEA is an efficiency analysis technique that is widely used in industry to compare the efficiency of organizational units. Charnes et al. [17] developed a basic DEA model in 1978, called the CCR model after the initials of its developers, which accepts the assumption of constant returns to scale (CRS). Banker et al. [18] introduced the BCC model in 1984, which accepts the assumption of variable returns to scale (VRS). Although DEA users can consider multiple inputs and outputs, only similar DMUs are considered in a benchmark, and all DMUs must have the same inputs and outputs to ensure that applying the DEA produces a significant result. These factors are often not comparable; therefore, the inputs and outputs are multiplied by meaningful weights. A feature of the DEA, compared to other efficiency analysis techniques, is that the weights of the inputs and outputs are determined in the model without user intervention. This work used the BCC output-oriented model to measure efficiency because the DMUs of different sizes can be processed.
Barr et al. [19] proposed the idea of peeling an onion to layer and classify DMUs using a stratification DEA. Stratification is an arrangement in superimposed layers and classes. It is a process of dividing the general population into homogeneous subgroups prior to operation, and separating data from different sources to analyze patterns. Stratification can bring advantages in statistics if the population can be divided into meaningful groups called strata [20,21]. Meaningful means that the layer is relatively homogeneous with regard to one or more properties, which also influences the expression of the property that is ultimately interesting and that differs from one another as clearly as possible. When the subpopulation in a data analysis varies in a general population, it may be advantageous to select data within each subpopulation or stratum. The stratum must be mutually exclusive, i.e., each element of the population is assigned to a unique layer; furthermore, none of the general population elements can be left out. The strata of the stratified dataset are expediently selected such that they are essentially homogeneous in terms of the characteristics relevant for the selection of the strata of the population and differ from one another in terms of the expected values

Stratification DEA
DEA is an efficiency analysis technique that is widely used in industry to compare the efficiency of organizational units. Charnes et al. [17] developed a basic DEA model in 1978, called the CCR model after the initials of its developers, which accepts the assumption of constant returns to scale (CRS). Banker et al. [18] introduced the BCC model in 1984, which accepts the assumption of variable returns to scale (VRS). Although DEA users can consider multiple inputs and outputs, only similar DMUs are considered in a benchmark, and all DMUs must have the same inputs and outputs to ensure that applying the DEA produces a significant result. These factors are often not comparable; therefore, the inputs and outputs are multiplied by meaningful weights. A feature of the DEA, compared to other efficiency analysis techniques, is that the weights of the inputs and outputs are determined in the model without user intervention. This work used the BCC output-oriented model to measure efficiency because the DMUs of different sizes can be processed.
Barr et al. [19] proposed the idea of peeling an onion to layer and classify DMUs using a stratification DEA. Stratification is an arrangement in superimposed layers and classes. It is a process of dividing the general population into homogeneous subgroups prior to operation, and separating data from different sources to analyze patterns. Stratification can bring advantages in statistics if the population can be divided into meaningful groups called strata [20,21]. Meaningful means that the layer is relatively homogeneous with regard to one or more properties, which also influences the expression of the property that is ultimately interesting and that differs from one another as clearly as possible. When the subpopulation in a data analysis varies in a general population, it may be advantageous to select data within each subpopulation or stratum. The stratum must be mutually exclusive, i.e., each element of the population is assigned to a unique layer; furthermore, none of the general population elements can be left out. The strata of the stratified dataset are expediently selected Sustainability 2020, 12, 10124 4 of 26 such that they are essentially homogeneous in terms of the characteristics relevant for the selection of the strata of the population and differ from one another in terms of the expected values in order to differentiate these characteristics as much as possible. This stratification technique was integrated into DEA as a classification tool in many works [22][23][24][25][26][27]. This work therefore used this technique in combination with DEA to improve the analysis of the results.

Automated Machine Learning
Machine learning requires a programmed mechanism full of algorithms, each of which can be adjusted more precisely with so-called hyperparameters. An important task of a data scientist is to find the right algorithm for the respective problem and to set it correctly. In fact, the area of responsibility is considerably larger. In a typical machine learning application, users must use the appropriate methods for data preprocessing, feature extraction, and feature selection in order to make datasets usable for machine learning. Following these preprocessing steps, users then need to perform algorithm selection and hyperparameter optimization in order to maximize the predictive performance of the machine learning model. Since many of these steps require time-consuming and human skills on the part of the user, AutoML was designed as an artificial intelligence-based solution to meet this challenge. AutoML simplifies and accelerates the machine learning workflow and enables data scientists with limited specific programming skills to create machine learning systems. As a result, it is no longer just the expertise of data scientists and programmers to create their own machine learning systems. Even with little expertise, a high-quality machine learning model can be trained and deployed using powerful AutoML tools. Not only will the field of activity of data scientists no longer exist, but their focus will shift to more specialized or sophisticated analysis techniques. In a complex machine learning process with difficult problem-solving tasks; however, not all steps can be automated [12,28,29]. Tasks like a basic understanding of the problem or complex data engineering tasks often still need to be performed by people.
As with traditional manual machine learning, AutoML can be used in many areas to create, train, optimize, and deploy machine learning models to solve problems. Typical tasks that are performed with the models implemented with AutoML are classification tasks. For example, it was used in the health sector [28,30,31], the corporate sector [32], the environmental sector [33,34], the energy sector [35,36], and others [37][38][39]. There are many AutoML tools and solutions available today to help data scientists. While some of them are intended for local use, others allow use over cloud-based platforms. Commercial software solutions such as Google Cloud AutoML and Azure Automated ML, which focus on business applications, enable the use of various AutoML services through a cloud-based platform.
There are also open source tools that use standard machine learning algorithms, such as Auto-WEKA [40], auto-sklearn [41], and TPOT [42], or tools that apply deep learning algorithms like Mcfly [43], which focuses on multivariate times series classification, or auto-Keras [44], which focuses on image classification. Olson et al. [45] introduced a tree-based pipeline optimization tool called Tree-based Pipeline Optimization Tool (TPOT) to automate pipeline design. Optimization models from TPOT use genetic programming, an optimization technique inspired by the mechanism of biological evolution in a population that works by selecting the individuals best suited to a constraint, by crossing over and adding random variations. The individuals of the population in this case are models defined by trees that contain only the most efficient pipelines in a development. During development, the population of trees is evolved by changing existing nodes, adding new nodes, or deleting them. This tool is one of the most popular open source AutoML systems because it can automatically build machine learning pipelines with satisfactory accuracy [13]. Balaji and Allen [46] compared three open source AutoML solutions and found that auto-sklearn is best for classification but worst for regression, while TPOT is best for regression and the second for classification problems. In contrast to auto-sklearn, TPOT offers a white box operation because the Python code of the best model is generated in the form of a scikit-learn pipeline. TPOT was chosen as the AutoML solution in this work because it can handle Sustainability 2020, 12, 10124 5 of 26 both regression and classification problems well and the structure of the pipeline generated in TPOT is described by a general tree-based pipeline.

Integration Between DEA and Machine Learning
Zhou et al. [47] showed that the trend towards combining DEA with other analytical methods, such as a neural network (NN), is increasing significantly and that future research should focus more on such combinations in order to improve the accuracy of the analyses and the explanation of the results. The integration between DEA and NN was proposed in a bank branch performance evaluation study to overcome the drawbacks and promote the benefits of both methods [48]. The approach was used not only for performance evaluation and prediction, but also for optimization purposes. As a frontier method, DEA also calculates the PO, which turns an inefficient DMU into an efficient DMU. An NN predicts efficient frontiers instead of an ES, and then the PO is predicted. The efficient frontier indicates the number of efficient DMUs, which accordingly have an ES of 1 or 100%. We adapted this concept by integrating DEA and AutoML instead of NN to perform these tasks and fill the previously mentioned gap. The previous integrative frameworks can be summarized in six categories: first evaluate the ET using a stratification DEA to preprocess the NN learning datasets and then predict groups using each stratified learning datasets with an NN [56].
This work combined the fourth and fifth framework to assess and predict the ES, ET and PO. DEA was used as a first step in evaluating the ES to understand the current state of the DMUs and identifying the PO to establish the goal of improving the performance of an inefficient DMU. Original inputs and outputs were used in this step. The second step was to follow the concept proposed by Barr et al. [19] and the stratification technique proposed by Seiford and Zhu [20] to determine the ET of each country based on their performance. ET1 was accredited when the assessed countries achieved a 100% efficiency in the first round. These countries were then removed from the analysis and the DEA model re-executed. This process was repeated and stopped when the DEA's rule of thumb was met. The rule of thumb was established to ensure that the DEA model can produce a result with a good discrimination between efficient and inefficient DMUs. The number of DMUs should be more than twice the total number of input and output variables to improve the discriminatory power [57,58]. For example, if the total number of input and output variables is 20, the number of DMUs in the final round evaluation should be more than 40. The remaining inefficient DMUs are automatically assigned to tier n + 1 if n is the last lap number. We then employed BPNN and AutoML to predict three targets, including the ES, ET and PO, which were the main results of the DEA experiment. Original inputs and outputs were used, along with the ES and ET resulting from DEA, to predict the ES and ET. To predict the PO, which is the efficient frontier projection, only the original inputs and PO of the DEA experiment were used as the learning dataset. We used the root mean square error (RMSE), PRED, and the coefficient of determination (R-squared) to assess the performance of the regression algorithm, and the precision and accuracy rates to evaluate the performance of the classification algorithm. Finally, DEA was employed to assess the ES using the original inputs and PO predicted from the BPNN and AutoML. Normally, the ES calculated from the projection frontier should be 100%, since the projection frontier is the optimal value for inputs and outputs. In this step, the multi-target regression algorithm was validated. We also applied the distance formula using Pythagorean theorem to find the distance Sustainability 2020, 12, 10124 6 of 26 between the efficiency points of a DMU and the maximum efficiency, which is 100% in each SDG. This allowed us to compare the performance of all DMUs in terms of the overall perspective.

Data Description of the Sustainability Study in the BRI Region
The Belt and Road Initiative (BRI), originally called One Belt One Road, was presented by the Chinese government in fall 2013 [59]. The entire project now affects more than 60% of the world's population, around 35% of the global economy, and deals with around half of the global energy consumption and carbon emissions [60]. It is therefore important to focus on the progress of sustainability development in this region. An empirical study of this work aims at the area of prosperity, which consists of SDGs 7-11 [61,62]; however, only SDG7, SDG8, and SDG9 focus on the area of prosperity and relate to the energy, economic, and infrastructure developments [63] that correlate with the strategy of the BRI [64]. Lim et al. [65] also stated that SDG7, SDG8, and SDG9 are important to support overall growth, especially in economic terms.
The promotion of sustainable growth in the green economy and the creation of decent jobs while respecting human rights and global borders are important for all developing, emerging, and industrialized countries. SDG7 is a guarantee for affordable, reliable, sustainable, and modern access to energy. SDG8 includes sub-goals for economic growth, productivity growth, and decent job creation. It therefore calls for a global improvement in resource efficiency in consumption and production and strives to decouple economic growth from environmental destruction. Investing in scientific and technological research on sustainable infrastructure can stimulate economic growth, create jobs, and promote prosperity. SDG9 is about building resilient infrastructure and promoting sustainable industrialization and innovation. Research on energy efficiency and economic growth, as well as investments in energy infrastructure and clean energy technologies, should be encouraged.
This empirical study includes a performance assessment of the BRI countries in SDG7, SDG8, and SDG9 to find the best practices for improving performance development in the fields of energy, economics, and infrastructure. Based on the data available from 2011 to 2017, 80 countries in the BRI region were included in this sustainability study. The inputs and outputs of each SDG were extracted from the denominators and numerators of the indicators. Lamy and Millet [66] expressed energy efficiency as ratios depending on the measured aspect, and the authors also extracted the inputs and outputs from the denominators and numerators of the ratios. All indicators are defined by the United Nations and Table 1 shows the description and variables extracted from the indicators in SDG7, SDG8, and SDG9. Based on the available data, we were able to measure some indicators in Targets 7.1-7.3, 8.1-8.2, 8.4-8.5, 8.10, 9.1-9.2, 9.4, and 9.c using the variables shown in Table 2. Energy efficiency is one of the main targets of SDG7, which is to double the global improvement in energy efficiency, ensure access to energy and affordability, and increase the share of renewable energies. The energy intensity of Indicator 7.3.1 is used to measure energy inefficiency; however, the aim of Target 7.3 is to improve energy efficiency. Energy intensity is defined as the energy supplied to the economy per unit of value of economic output. A lower ratio indicates that less energy is used to create an output unit. We therefore reversed it to calculate it as GDP per energy consumption for the evaluation of energy efficiency. The unemployment rate of Indicator 8.5.2 and the carbon emissions of Indicator 9.4.1 are undesirable outputs, and there are four methods to deal with undesirable outputs in DEA [67]. We used the multiplicative inverse transform to process these. In summary, there were two inputs and four outputs for SDG7, four inputs and outputs for SDG8, and three inputs and seven outputs for SDG9. The correlation matrix in Table 3, which measured the R-squared, Spearman's rank correlation coefficient (ρ), and statistical significance of all variables, shows that all variables were significantly correlated. In addition, access to electricity, the adult population, and manufacturing employment had the highest mean R-squared, and air freight, renewable energy consumption, and commercial bank branches had the lowest mean R-squared. Access to electricity, domestic material consumption, and manufacturing employment had the highest mean ρ, and the consumption of renewable energy, air freight, and passengers had the lowest mean ρ.     ⁺⁺ Correlation is significant at the 0.01 level (2-tailed).

Data Description of the COVID-19 Pandemic Study
A UN report highlighted the need to learn from the current global health crisis called COVID-19 and to implement the SDGs more consistently and faster than before [68]. COVID-19 is currently the subject of intense research; however, there is limited literature using DEA in studies. Shirouyehzad et al. [69] evaluated the effectiveness of COVID-19 management of the outbreak response using DEA. The authors used population and International Health Regulations Core ++ Correlation is significant at the 0.01 level (2-tailed).

Data Description of the COVID-19 Pandemic Study
A UN report highlighted the need to learn from the current global health crisis called COVID-19 and to implement the SDGs more consistently and faster than before [68]. COVID-19 is currently the subject of intense research; however, there is limited literature using DEA in studies. Shirouyehzad et al. [69] evaluated the effectiveness of COVID-19 management of the outbreak response using DEA. The authors used population and International Health Regulations Core Capacity Scores as inputs and confirmed cases as undesirable outputs to measure the effectiveness of contagion control, and used confirmed cases, recovered cases, and deaths as the inputs, outputs, and undesirable outputs, respectively, to evaluate the effectiveness of the medical treatment. Ghasemi et al. [70] applied DEA to measure government performance during the COVID-19 outbreak. In the first step, the authors rated the effectiveness of preventing the spread of the coronavirus using population and population density as inputs and confirmed cases as the undesirable output. In this study, an index called the stringency index was used to analyze the prevention of spread. The authors rated the effectiveness of preventing coronavirus deaths in the second step, using the first step score, population, and population aged 65 and over as the inputs, and deaths as the undesirable output. The stringency index is one of the policy indices in the Oxford COVID-19 Government Response Tracker (OxCGRT) analysis, which shows how the infection situation is developing in different countries in relation to mitigation and containment measures [71]. In total, the researchers examined 17 indicators and aggregated the indicators under four parameters, including the overall government response index, the stringency index, the containment and health index, and the economic support index.
Good health is the goal and result of SDG3, which comprises a total of 28 indicators. In addition, nine indicators measure mortality rates and five indicators from Target 3.3 measure new and active cases caused by diseases. Half of the world's population does not yet have access to basic health services [1]; therefore, government support in the health sector is included in SDG3. Since this empirical study is an assessment of management performance during the COVID-19 pandemic, we have adjusted all variables to focus only on the COVID-19 pandemic. In summary, there were three inputs including active cases, which were the weekly base load work, the containment and health index, which represented the base severity of the lockdown restrictions and closings in each country, and the newly confirmed cases, with one desirable output being active cases in the next three weeks, and two undesirable outputs being the newly recovered cases and new deaths in the next three weeks. The results of this part illustrate the effectiveness of the medical treatment for COVID-19.
SDG11 also targets security and sustainable communities. Target 11.5 points to the number of people affected by disasters; additionally, Target 11.b also focuses on policies and plans to manage and reduce disaster risk. These targets highlight issues related to public and potential disasters. The COVID-19 pandemic is a new type of infectious disease hitting the 21st century, affecting a wide range of people. To control the spread of COVID-19, many governments are applying social distancing and local lockdown restrictions. These restrictions affect the behavior of people in activities and travel [72,73]. Hadjidemetriou et al. [74] stressed that the reduction in human mobility had a significant impact on the confirmed COVID-19 deaths. Badr et al. [75] also found that the decline in mobility patterns correlated strongly with the slowdown in the growth rates of COVID-19 cases. Zhou et al. [76] mentioned that the extent of the impairment of mobility could help delay the peak number of transmission cases. Community mobility is therefore a key strategy for governments to deal with COVID-19. The Google COVID-19 Community Mobility Report shows a change in the mobility of people to categorized locations compared to base days. In this study, three places were categorized, namely, workplaces, residential places, and public places. The daily visits to public places were combined by people visiting four categorized locations, namely, retail and recreation, grocery and pharmacy, transit stations, and parks. Not only is the Google Mobility Report useful, the Google trend also shows during COVID-19. Pier et al. [77] evaluated Google searches for otolaryngology-related terms in the United States and found a significant change during COVID-19. Kurian et al. [78] also measured a correlation between COVID-19 cases and Google Trends in the United States and concluded that they were highly correlated. In addition, Google Trends could be used to monitor a new outbreak area. Husnayain et al. [79] analyzed search terms related to COVID-19, hand-washing, and face masks in Taiwan on Google and indicated that Google Trends could be used to monitor public nervousness about COVID-19. Effenberger et al. [80] reported that the public interest represented by Google Trends correlated with the number of new COVID-19 cases. In summary, there were four inputs, namely, days from the first confirmed case, new tests, trending inquiries about "covid" reflecting the awareness of people in each country, and the OxCGRT's overall government response index, which served as the basis for the severity of government policies in each country, with three desirable outputs reflecting the results of the restrictions-daily visits to workplaces, residential places, and public places-and one undesirable output, which was newly confirmed cases representing the unwanted outcome of all responses from governments. This part shows a performance evaluation of the preparation for coping with the current infectious disease COVID-19.
The incubation period of COVID-19, which is the time between infection and the appearance of the first symptoms, averages five to six days; however, it can take up to two weeks for a symptom to appear [81]. The newly confirmed case dataset was therefore created with a delay of one week after the week evaluated. After an incubation period, fever, muscle pain, and a dry cough may occur. The disease often manifests itself in a general, severe feeling of illness and back pain [82]. The WHO-China Joint Mission on COVID-19 reported that the median time from first symptoms to recovery from mild cases is approximately two weeks [83]. This information was used to create a lag time for the newly recovered case dataset; therefore, the newly recovered case dataset was two weeks after the newly confirmed case dataset. Wang et al. [84] developed an algorithm to predict the days from hospitalization to death, and the result was 13 days. This result was used to create a delay time for the new death dataset; therefore, the new death dataset was also two weeks after the newly confirmed case dataset. In summary, the newly confirmed case dataset was collected from week t + 1, and the newly recovered case and newly death datasets were pooled from week t + 3 if the week evaluated was week t.
Based on data availability, 61 countries were assessed from 12 April to 23 May or from Weeks 16 to 21, 2020, using all of the variables listed in Table 4. Only Uganda was death-free during the study period. Although Taiwan was the fourth county to experience COVID-19 in this study, they had the lowest active cases at Week 24. Taiwan also had the lowest cumulative number of recovered cases, newly confirmed cases, and deaths during the investigation period when the country without deaths was excluded. Belarus had the lowest scores on the overall government response index and the containment and health index, followed by Taiwan. The United States had the highest active cases at Week 24. They also had the highest cumulative number of recovered cases, newly confirmed cases, and deaths during the period studied. In contrast, Costa Rica had the lowest cumulative COVID-19 tests in this study. Although El Salvador had the highest score on the overall government response index, they had the lowest trending inquiries about "covid", which was a factor that reflected people's interest. Restricted mobility to public places changed the most in Bolivia, while Panama was the country where mobility to workplaces changed the most, and second place after Singapore in mobility to residential places for all weeks. The correlation matrix in Table 5 shows that the variables in the COVID-19 pandemic study were less correlated than those in the BRI sustainability study. Only the newly confirmed case variable correlated significantly in rank with all variables. In addition, the newly confirmed cases, as well as active cases in week t + 3 and week t had the highest mean in both R-squared and ρ. Trend inquiries about "covid", days from the first confirmed case, and daily visits to workplaces had the lowest mean value in both R-squared and ρ.   ⁺⁺ Correlation is significant at the 0.01 level (2-tailed), ⁺ Correlation is significant at the 0.05 level (2tailed), ⁻ Correlation is not significant.

The BRI Sustainability Study
This section presents a study that compares and assesses sustainability progress in relation to SDG7, SDG8, and SDG9 in order to support policymakers in developing strategies to achieve the targets. The SDG7 performance assessment considered two inputs, namely energy consumption and the national population, and four outputs, namely access to clean fuels, access to electricity, renewable energy consumption, and GDP. For SDG8, we defined total workforce, adult population, and total employment as the inputs, GDP and domestic material consumption as the desirable outputs, and the number of unemployed as the undesirable output. National population, total employment, and gross value added were inputs; the number of air freight and passengers, employment, value added in manufacturing, internet users, and mobile phone subscriptions were the desirable outputs; and carbon emissions were an undesirable output in the performance assessment of SDG9. Annual national population was the variable that served as input for all SDGs, as people are the key factor for sustainability. Employment also played an important role in SDG8 and SDG9, showing that employment is essential for economic development. SDG7 and SDG8 also shared an important output, GDP, as these SDGs mainly focus on the area of prosperity.

Result of the DEA Experiment
++ Correlation is significant at the 0.01 level (2-tailed), + Correlation is significant at the 0.05 level (2-tailed), -Correlation is not significant.

The BRI Sustainability Study
This section presents a study that compares and assesses sustainability progress in relation to SDG7, SDG8, and SDG9 in order to support policymakers in developing strategies to achieve the targets. The SDG7 performance assessment considered two inputs, namely energy consumption and the national population, and four outputs, namely access to clean fuels, access to electricity, renewable energy consumption, and GDP. For SDG8, we defined total workforce, adult population, and total employment as the inputs, GDP and domestic material consumption as the desirable outputs, and the number of unemployed as the undesirable output. National population, total employment, and gross value added were inputs; the number of air freight and passengers, employment, value added in manufacturing, internet users, and mobile phone subscriptions were the desirable outputs; and carbon emissions were an undesirable output in the performance assessment of SDG9. Annual national population was the variable that served as input for all SDGs, as people are the key factor for sustainability. Employment also played an important role in SDG8 and SDG9, showing that employment is essential for economic development. SDG7 and SDG8 also shared an important output, GDP, as these SDGs mainly focus on the area of prosperity.

Result of the DEA Experiment
We measured the effectiveness of SDG7, SDG8, and SDG9 from 2011 to 2017 in 80 BRI countries, and the results are shown in Table A1 in Appendix A. There were seven countries that performed efficiently in all periods studied, including Austria, China, Italy, Luxembourg, Malta, Russia, and Saudi Arabia. These countries were in the 30 countries with the highest GDP per capita in 2017. Six of these countries were also in 30 countries with the highest value added per employment in manufacturing and the highest energy consumption per capita. These results are consistent with the selection of inputs and outputs as the national population, employment, and GDP were the main drivers for SDG7, SDG8, and SDG9.
In the first step of the DEA experiment, we measured the ES of the first evaluation round. When compared over the years, the mean ES was relatively stable across all the SDGs, as shown in Table 6a. Only SDG8 had the lowest mean and lowest minimum, but always the highest standard deviation (SD) in ES. The mean ES was around 55% and the minimum ES less than 15% each year. SDG8 mainly focuses on economic growth and labor productivity to alleviate poverty; therefore, the progress in economic development of each country in the BRI region seems to be quite small and very different. In addition, Luxembourg was one of the three largest GDP per capita countries in the world, while Mozambique was among the five lowest GDP per capita countries in the world, according to the United Nations in 2017. SDG7 had the highest mean and the highest minimum, but the lowest SD in ES. The mean ES was 97% and the minimum ES was around 50%. This means that most of the countries performed pretty well at providing people with clean and affordable energy and reducing their energy intensity. Although the mean ES in SDG9 was lower than in SDG7, it was still relatively high and satisfactory. The mean ES was 94% and the minimum ES was 36%. This shows that the BRI region considered sustainable industrialization, innovation, and infrastructure equally. In summary, the trends in the mean ES and the minimum ES in SDG7 and SDG8 showed increasing trends, while SD showed a decreasing trend, meaning that countries in the BRI improved performance to get closer. Although SDG9 showed an uptrend from the mean, the minimum showed a downtrend and SD showed a negative uptrend. These showed that the inequality in SDG9 was increasing. Table 6. DEA experiment summary in the sustainability study of the BRI: (a) yearly efficiency score; (b) number of decision-making units (DMUs) in each efficient tier.  We analyzed the sustainability performance of each country and, in a second step, grouped them based on their performance. In SDG7, SDG8, and SDG9, the total number of input and output variables were six, eight, and ten, so the number of DMUs to analyze in the final round should be greater than 12, 16, and 20, respectively, according to the DEA's rule of thumb. As shown in Table 6b, all periods studied produced the same number of ET. The results of this ET assessment were comparable to those of the ES assessment because they shared the same characteristics. The number of DMUs in each ET was similar from year to year. SDG8 had the largest number of ET corresponding to six tiers; furthermore, the number of DMUs in each ET from ET 1-5 was quite comparable. Only ET5 contained half of the units compared to the other tiers. The trends in ET1, ET3, and ET5 increased, but ET4 and ET6 had downtrends while ET2 was stable. These showed the same characteristics in relation to ES, leading to the conclusion that SDG8 had the highest variance in ES and the trend of ES increased. SDG7 and SDG9 had the same number of ET, namely, three tiers. ET1 in SDG7 made up almost 70% of the total DMUs, ET2 had about a quarter, while ET3 covered less than 10%. Although the trends in ET1 and ET3 increased, ET2 showed a downward trend. The large number of DMUs in ET1 made the mean ES high, and the narrow distribution in ET2 and ET3 resulted in a low SD in SDG7. Around 60% of the DMUs in SDG9 were in ET1, less than 40% in ET2, and almost 5% in ET3. The proportion of DMUs in ET1 and ET2 was certainly not as different as in SDG7, and the proportion in ET3 was very low. This clearly showed that there were two classified groups in SDG9. While ET1 was in the uptrend, ET2 and ET3 built downtrends. The number of DMUs in ET1 of SDG9 was lower than in SDG7, so the mean ES of SDG9 was also lower than that of SDG7. The higher number of DMUs in ET2 and ET3 of SDG9 resulted in a higher SD than that of SDG7.

Result of the AutoML Experiment
This section shows an experiment predicting the ES, ET, and PO in 2017. We used the collected data and the results of the DEA from 2011 to 2016 to create the learning dataset for training and testing in the AutoML approach. We also employed BPNN to predict those targets using the same learning dataset and compare its performance against AutoML. Clementine software was used to perform traditional BPNN because it is widely recognized and used in many academic studies [85][86][87][88][89]. The ES and ET in 2017 were not that different from previous years, so the DEA's results could be a capable learning dataset for BPNN and AutoML to get accurate results. Table 7a shows the comparison of the regression evaluation in the ES prediction. The mean ES predicted by BPNN and AutoML were quite similar; however, the SD in the AutoML results were higher than in the BPNN. The maximum ES is normally 100%, but the maximum ES of the BPNN in SDG7 and SDG9 were lower. These two SDGs characteristically had a low SD in ES, so this could affect the algorithms. The minimum ES of BPNN in SDG7 and SDG9 obviously differed from the actual values, but both BPNN and AutoML were able to predict close to the actual minimum ES in SDG8. The RMSE was measured to compare the accuracy of the BPNN and AutoML; moreover, AutoML outperformed BPNN in all predictions. The RMSE of BPNN was the lowest in SDG7 and lower in SDG9 than in SDG8. Although the AutoML's RMSE was also lowest in SDG7, it was higher in SDG9 than in SDG8. Not only RMSE, but also the number of DMUs with a percentage error less than 20% (PRED(0.20)), 10% (PRED(0.10)), and 5% (PRED(0.05)) from AutoML were higher than BPNN in all SDGs. PRED in SDG7 and SDG9 of both methods were quite similar with the exception of PRED (0.05) in SDG9. PRED and R-squared of BPNN in SDG8, which had the highest SD in ES, were much lower than those of AutoML. AutoML's predicted ES correlated better with the actual score than BPNN's because all of the AutoML's R-squared values were much higher than the BPNN's in all SDGs. We can say that AutoML predicted the ES better than the popular BPNN technique in this BRI sustainability study.
We then predicted ET using the inputs, outputs, and ET of the DEA experiment as the learning dataset. The precision and accuracy of the outputs predicted from the classification AutoML algorithms were also better than the outputs from the BPNN, as shown in Table 7b. BPNN classified all DMUs in SDG7 with a very low SD as ET1, but AutoML was able to identify all ETs and classified the DMUs based on their performance. Although the BPNN could predict greater than 50% accuracy in SDG7, AutoML could predict greater than 90% accuracy. SDG8 contained six tiers and AutoML again recognized them all and achieved an accuracy of over 85%. It was unlikely that BPNN could only set four tiers, resulting in an accuracy of around 30%. BPNN classified more than half of the DMUs as ET3, but only 13 DMUs that matched. BPNN could detect only ET1 and ET2 in SDG9, but only 59% of the DMUs were correct. Although SDG9 had a higher SD than SDG7 in ES, AutoML still performed well and resulted in an accuracy of 85%. These show that AutoML could also beat BPNN in this classification problem. Table 7. Performance comparison of prediction methods in the sustainability study of the BRI: (a) regression evaluation metrics in the efficiency score prediction; (b) precision and accuracy in the efficient tier classification; (c) regression evaluation metrics in the frontier projection prediction.  We predicted the PO using BPNN and AutoML to study the efficient frontier projection that turned an inefficient DMU into an efficient DMU. Only the original inputs and PO of the DEA experiment were used as the learning dataset. Table 7c shows the evaluation metrics of both methods for comparing the prediction performance. All of AutoML's RMSE and PRED values were better than the BPNN's ones, with the exception of the manufacturing value added in SDG9. These results were consistent with the previous finding that AutoML's regression algorithm could overwhelm the BPNN. Next, we used these predicted POs from the BPNN and AutoML along with the original inputs to measure performance in 2017 using the DEA application. This measurement would compare the performance in the multivariable prediction algorithm, since a machine learning model is basically good at supporting a single-target prediction and ignoring the multiple relationships between target variables. This study aims to optimize all targets together and not just a single target, as all variables are related to each other. Nevertheless, the multiple regression feature was missing from the selected TPOT library, so AutoML predicted the PO as a single target and ignored the relationship of the other outputs when fitting the predictive model. Figure 2 shows the comparison of the ES and the distance to the efficient frontier of all DMUs between BPNN and AutoML. For this experiment, the BPNN's ES in diamond shape was mostly closer to 100% than that of AutoML in the line shape. The mean ES of all SDGs in BPNN was 0.998, while in AutoML it was 0.988. In addition, the mean minimum ES of BPNN was 0.991 and for AutoML it was 0.872. Although the BPNN could generate more efficient DMUs in each SDG, AutoML could generate more efficient DMUs in all SDGs. While all DMUs of SDG8 in BPNN achieved an ES of 100%, only 25 DMUs in all SDGs could perfectly reach the efficient frontier with zero distance. There were 32 DMUs from AutoML that could perfectly distance zero from the efficient frontier; however, the mean distance in AutoML was still higher than that in the BPNN. Since the original TPOT library was unable to predict regressions with multiple outputs, this may result in a more accurate prediction, but poorer mean ES and distance. experiment, the BPNN's ES in diamond shape was mostly closer to 100% than that of AutoML in the line shape. The mean ES of all SDGs in BPNN was 0.998, while in AutoML it was 0.988. In addition, the mean minimum ES of BPNN was 0.991 and for AutoML it was 0.872. Although the BPNN could generate more efficient DMUs in each SDG, AutoML could generate more efficient DMUs in all SDGs. While all DMUs of SDG8 in BPNN achieved an ES of 100%, only 25 DMUs in all SDGs could perfectly reach the efficient frontier with zero distance. There were 32 DMUs from AutoML that could perfectly distance zero from the efficient frontier; however, the mean distance in AutoML was still higher than that in the BPNN. Since the original TPOT library was unable to predict regressions with multiple outputs, this may result in a more accurate prediction, but poorer mean ES and distance.

The COVID-19 Pandemic Study
This section provides a study on the integrative approach to addressing the COVID-19 pandemic in the context of the SDGs to understand the situation and develop guidelines for management and

The COVID-19 Pandemic Study
This section provides a study on the integrative approach to addressing the COVID-19 pandemic in the context of the SDGs to understand the situation and develop guidelines for management and risk mitigation. To assess the effectiveness of medical treatment in SDG3, three inputs, one desirable output and two undesirable outputs, were applied to the DEA. Performance in preparing to deal with COVID-19, which was defined as a segment in SDG11, was assessed based on four inputs, three desirable outputs, and one undesirable output. The newly confirmed case variable was used in both SDGs, and only this variable correlated significantly in rank with all variables. Some variables were measured at weeks t + 1 and t + 3, when the week evaluated was week t, due to incubation, convalescence, and hospitalization to death. Table A2 in Appendix A shows the results of the DEA in this study. The results are the ES in medical treatment associated with SDG3 in Target 3.3 and in preparedness for COVID-19 related to SDG11 in Targets 11.5 and 11.b, as well as the distance between the assessed DMU and the DMU with the best performance. Based on the data available, three countries including Belarus, Taiwan, and Uganda performed best overall in the fight against COVID-19 over all periods studied. Belarus, Taiwan, Uganda, and the United States showed the best effectiveness of medical treatments, while Bolivia, Belarus, Costa Rica, Latvia, Morocco, Panama, El Salvador, Taiwan, Uganda, and Uruguay were best prepared for COVID-19. When looking at the ratio of newly recovered cases to newly confirmed cases during the period studied, Taiwan had the highest number. It also had the lowest ratio of newly confirmed cases to new COVID-19 tests, which was a result in the area of prevention, and it also reduced the number of active cases well. Uganda, the only country with no fatalities, also performed well in terms of the ratio of newly confirmed cases to new COVID-19 tests. Belarus did well for the ratios of deaths to active cases and deaths to newly confirmed cases. Morocco and Uruguay had done well in reducing the number of active cases due to their good prevention practices. Table 8a shows that the mean ES in both SDG3 and SDG11 were unstable from year to year. SDG3 produced a smaller mean and a smaller minimum, but a higher SD in ES. SDG3 in this study focuses only on the medical management of COVID-19 to measure recovery, deaths, and remaining active cases. It ranged from the country with the highest confirmed and recovered cases and deaths in the world to countries with no deaths; therefore, the medical treatment situation was different in each country. The results in SDG3 show that many DMUs could handle this situation well, but more DMUs deteriorated at Week 20 due to the lowest mean ES over that period. The maximum of the minimum ES in SDG3 was below 15%, which was quite low compared to SDG11, at almost 40%. SDG11 aims to assess the effectiveness of the preparations that most countries had made regarding social distancing and local lockdown restrictions; therefore, the SD was lower than that of SDG3. The mean ES for all periods were nearly 80%, except for Week 20, which was less than 75%. This means that most countries were pretty well prepared for COVID-19. In summary, the mean ES and SD of SDG3 were fairly stable over the periods studied, while the minimum ES showed a small downward trend. This means that medical treatment performance did not change significantly over the period studied and the worst countries did not improve. The mean ES of SDG11 showed a slight downward trend, the minimum ES showed an upward trend, but the SD showed a downward trend. These describe that there were more countries that were poorly prepared for COVID-19, but the worst countries tried to improve and narrow the gap with more efficient countries. Based on the DEA's rule of thumb, plus six and eight of the total input and output variables in SDG3 and SDG11, the final round of DEA should have more than 12 and 16 DMUs for analysis, respectively. We used the stratification DEA to classify the ETs and the results varied over time, as shown in Table 8b. In addition, they were relatively identical to the results of the ES assessments. SDG3 had more ETs than SDG11 because the variance in SDG3 was higher than in SDG11. ET1 had the largest number of DMUs in Weeks 18, 19, and 21, while ET2 had the largest number in Weeks 16 and 17. Week 20 was completely different from the others as all ETs had similar numbers of DMUs; moreover, this week had the lowest mean ES. This lowest ES could come from the largest number of DMUs in ET5. Week 21 was also special as ET1 was more than twice the size of the other ET, resulting in the highest mean ES. In addition, ET1, ET4, and ET5 generated upward trends, while ET2 and ET3 generated downward trends. These express that most countries performed well in terms of medical treatment, but the gap between countries with high and low efficiency had widened. There were four ETs in SDG11; additionally, the number of DMUs in ET4 was lowest each week. Results varied slightly from Weeks 16 to 20, with ET2 being greatest and ET3 greater than ET1. In contrast, the number of DMUs in ET1, ET2, and ET3 at Week 21 were comparable. The trends in ET1 and ET3 were positive, while ET2 and ET4 were declining. These show that most countries performed well in preparing for COVID-19 and that performance improved.

Result of the AutoML Experiment
This part shows an experiment to predict the ES, ET, and PO at Week 21 using the collected data and DEA results at Weeks 16 to 20 as the learning dataset. When comparing the results of the regression score in ES prediction between the BPNN and AutoML, the final conclusion was the same as in the BRI sustainability study, as shown in Table 9a. The mean ES predicted by the BPNN was very different from the AutoML in SDG3, but was quite similar in SDG11. In contrast, the predicted ES of both methods in the BRI study was quite similar. Only the maximum ES predicted by the BPNN in SDG3 was 100%, but AutoML in the BRI study was able to achieve 100% in all SDGs. The minimum ES predicted by BPNN in SDG3 was completely different from the actual values, but the minimum ES predicted by BPNN and AutoML in SDG11 was close to the actual values. Not only were the RMSE of AutoML in SDG3 and SDG11 lower than that of BPNN, but all of AutoML's PREDs were also better than those of the BPNN, which were the same as in the BRI study. Nonetheless, the AutoML's PREDs were lower in this study than in the BRI study. PREDs and R-squared in SDG11 were better than in SDG3, which could be due to the lower SD. The ES predicted by AutoML correlated better with the actual score than that predicted by the BPNN, and the R-squared of the AutoML in this study was similar to the BRI study. We can summarize that AutoML performed better than BPNN in this COVID-19 prediction experiment. The ET of each SDG was predicted by the BPNN and AutoML using the inputs, outputs, and ET of the DEA experiment as a learning dataset to compare the performance in the classification algorithm, as shown in Table 9b. Both the BPNN and AutoML approaches were able to capture all ETs equally; however, AutoML was more accurate than BPNN. Therefore, AutoML could also perform better than BPNN on this COVID-19 classification problem. We then predicted the PO of SDG3 and SDG11 using only the original inputs and PO from the DEA experiment as the learning dataset. The evaluation metrics in Table 9c show the comparison of the BPNN and AutoML approaches. As with the results in the BRI study, all of the AutoML's RMSE, PRED, and R-squared scores were better than the BPNN's ones, with the exception of PRED (0.05) on daily visits to workplaces and R-squared on daily visits to public places in SDG11. Finally, we used this predicted PO along with the original inputs to measure Week 21 s performance using the DEA application and to compare performance on regression with multiple outputs. Figure 3 illustrates the comparison of the ESs and the distance to the efficient frontier of all DMUs between BPNN and AutoML. The ESs assessed by the BPNN were closer to 100% than those of AutoML in all SDGs. The mean ES of the BPNN was 0.973 in SDG3 and 0.983 in SDG11, while the mean ES of AutoML was 0.816 in SDG3 and 0.980 in SDG11. The minimum ES of BPNN was 0.899 in SDG3 and 0.912 in SDG11, while the minimum ES of AutoML was 0.272 in SDG3 and 0.778 in SDG11. These values were lower than the results in the BRI study. There were 31 efficient DMUs in SDG3 and 22 efficient DMUs in SDG11 from BPNN. In contrast, AutoML's efficient DMUs were 19 DMUs in SDG3 and 32 DMUs in SDG11. There were 12 DMUs from BPNN and 11 DMUs from AutoML that could perfectly distance zero from the efficient frontier; however, the mean distance in BPNN was quite higher than that in AutoML. As previously described, the data in the BRI study were better correlated than the data in this COVID-19 study; therefore, the results of the BRI study showed a higher ES and a higher ratio of efficient DMUs to total DMUs. Due to the limitation of TPOT, BPNN performed slightly better than AutoML in this regression problem with multiple outputs. Moreover, the integrative approach can be used to measure current performance in preparing for and dealing with COVID-19. Then each country understands its own capabilities and can choose to follow best practices to fight the COVID-19 pandemic. The projected results can be used as a target in action.
Sustainability 2020, 12, x 18 of 27 efficient DMUs in SDG11 from BPNN. In contrast, AutoML's efficient DMUs were 19 DMUs in SDG3 and 32 DMUs in SDG11. There were 12 DMUs from BPNN and 11 DMUs from AutoML that could perfectly distance zero from the efficient frontier; however, the mean distance in BPNN was quite higher than that in AutoML. As previously described, the data in the BRI study were better correlated than the data in this COVID-19 study; therefore, the results of the BRI study showed a higher ES and a higher ratio of efficient DMUs to total DMUs. Due to the limitation of TPOT, BPNN performed slightly better than AutoML in this regression problem with multiple outputs. Moreover, the integrative approach can be used to measure current performance in preparing for and dealing with COVID-19. Then each country understands its own capabilities and can choose to follow best practices to fight the COVID-19 pandemic. The projected results can be used as a target in action. Table 9. Performance comparison of prediction methods in the COVID-19 pandemic study: (a) regression evaluation metrics in the efficiency score prediction; (b) precision and accuracy in the efficient tier classification; (c) regression evaluation metrics in the frontier projection prediction.

Conclusions
Since the DEA does not have predictive capabilities, it is integrated with an AutoML application to predict the future performance and projected outputs, thus developing such a predictive ability.   Table 9. Performance comparison of prediction methods in the COVID-19 pandemic study: (a) regression evaluation metrics in the efficiency score prediction; (b) precision and accuracy in the efficient tier classification; (c) regression evaluation metrics in the frontier projection prediction.

Conclusions
Since the DEA does not have predictive capabilities, it is integrated with an AutoML application to predict the future performance and projected outputs, thus developing such a predictive ability. This article introduces the integrative method to assess and predict the performance of a country's SDGs. There are two empirical studies with different data properties to demonstrate the integrative approach. Three prediction targets were used to measure the performance of AutoML in the regression, classification, and multi-target regression algorithms. A BPNN was used to validate the AutoML outputs. There were three main outcomes, namely, a performance measurement analysis, performance evaluation of the predictive methods, and a goal analysis.
Most countries in the BRI region performed reasonably well on SDG7, but progress on SDG8 was quite limited. Countries in the BRI region were considered equally in SDG9, but inequality in SDG9 increased as the countries were clearly divided into two groups. Most countries performed well in treating COVID-19 and preparing for COVID-19; additionally, performance did not change significantly over the period studied.
In particular, policymakers should improve access to clean fuels and electricity to improve performance in SDG7. Domestic material consumption and employment in manufacturing are the key outputs leading to high performance in SDG8 and SDG9. To improve the performance of medical treatment for COVID-19 in SDG3, recovered cases and deaths are the primary goals that everyone should focus on. Daily visits to workplaces and residential areas are two main criteria by which the performance of preparing to deal with COVID-19 in SDG11 should be measured. These factors should be used as national goals in support of the achievement of the SDGs.
A low SD results in poor regression and classification prediction performance for the BPNN, but does not significantly affect AutoML. AutoML can outperform a BPNN on regression and classification prediction problems; however, a BPNN can handle the problem of multi-target optimization better than AutoML due to the limitation of TPOT in the multi-output regression. Long-term data with a high correlation can provide a more accurate and precise prediction result than short-term data with a lower correlation; however, the R-squared values of both data properties were quite similar.
The integrative approach can accurately predict the projected outputs that can be used as national goals to transform an inefficient country into an efficient country to meet the expectations in the SDGs.
The use of this method to reduce the population size and variance in the population analysis can be expanded in future work.