Assessment of the Development of the European Oecd Countries with the Application of Linear Ordering and Ensemble Clustering of Symbolic Data

Abstract The research background of the paper covers the development of a country, that can be measured in various ways. Simple indicators, like GDP and also complex indicators such as HDI (human development index), can be used to measure country development. However, usually countries are divided into groups via setting some arbitrary levels of final measure. What is more, the composite (complex) indices have some problems and errors. The main purpose of the paper is the assessment of the development of the selected European OECD countries with the application of the linear ordering and ensemble clustering of symbolic data as well as comparison of the ensemble clustering with a single model. Research methodology covers linear ordering with the application of multidimensional scaling for a visualisation of results and ensemble clustering for symbolic data. The results are compared according to adjusted Rand and silhouette indices. The obtained results show that ensemble clustering for symbolic data can be a useful tool in country development analysis and allows reaching better results than a single model. The novelty of the proposed approach is to use a cluster analysis to obtain the clusters of countries with similar variables’ values (indicators of development) and the application of multidimensional scaling for symbolic data in order to visualise linear ordering results.


Introduction
Recent papers show that there is a deep need for a more comprehensive way to measure the development of a country. A. Sen sees the development as the concept that goes far beyond the accumulation of wealth, measured by gross national product or similar indicators. In his opinion development should also enhance people's lives (Sen, 1999, p. 14). So the measurement of development should take into account different areas of people's lives -social, ecological, political and economic.
Many different papers deal with the problem of development and the comparison of development -see for example S. Vachon and Z. Mao (2008), S. Voigt (2009) Dasgupta et al. (1999).
When considering country development, cluster analysis and symbolic data analysis only two papers by M. Pełka (2017 and2018) present an analysis of innovation which is an important element of sustainable development in the European Union using different ensemble clustering approaches.
A paper by D.B. Alonso et al. (2016) analyzed the how global crisis (during 2008-2012) had an impact on macroeconomic and social factors in the EU member states. The research findings prove the sensitivity and vulnerability of European countries during the crisis and could help policy makers identify effective measures for strengthening the protective capacity of their states in the event of a future economic and social crisis. C H. Ketels and O. Memedovic (2008) present how clusters can be leveraged for economic policy and what the role of different stakeholders in this process is. However their application part of the paper focuses on the concept to resource-rich, oil-dependent economies. K. Liapis et al. (2013) analyses the clusters of similarities among EU member states before and during the recent financial and debt crisis.
Using Euclidian Distance and average linkage between groups two clusters were obtained.
The first group consists of two subgroups, including the Netherlands, the UK, Luxembourg and Germany, characterised by developed financial sector's and balanced fiscal policy.
The second subgroup consists of two subgroups: one subgroup consists of Finland and Sweden and the other one consists of Austria and Denmark. Finally the countries which are faced with several financial or debt problems, Belgium and Ireland are connected with the second group.
The second largest group consists of Southern European countries, such as Italy, Spain, France, Greece and Portugal, and are characterised by high deficit and high government debt, low gross wage revenues and low Bank's Assets to GDP, low or medium total taxations performance, and decreases or deficits of current accounts and balance of payments. B. Mercan and D. Goktas (2011) focused their paper on the innovation of ecosystems.
The development of a country can be measured in many various ways. The simplest way is to use well-known wealth indicators such as gross domestic product (GDP), gross national product (GNP), gross national income (GNI). Usually the GNI is the standard way of measuring the level of wealth in a country (Baker, 2011, p. 6). As the measurement of total value of GNI, GDP, GDP can be misleading because it does not take into account the population of the country, the GNI, GDP or GNP are measured per capita -that is the total wealth of a particular country is divided by its population.
There were many efforts done to build rather a composite index that would capture all aspects of development. A major work was done by the United Nations Development Programme (UNDP) that proposed the Human Development Index (HDI). The HDI goes far beyond the measurement of the pure GDP, GNI, GNP and takes into account many different aspects of development (see for example Aziz et. al., 2015;McGillivray, 1991;Stanton, 2007;Sen, 1994).
Many other development indices, besides the HDI, have been proposed for many different purposes -like Quality of Life, Inequality-adjusted Human Development Index, OECD Better Life Index, Gender-related Development Index, Bhutan GNH Index (see for example Magee, Scerri, James, 2012;Durand, 2015;Dijkstra, Hanmer, 2000;Bates, 2009). However even the composite indices have some drawbacks (see Sagar, Najam, 1998;McGillivray, 1991).
For example A.D. Sagar and A. Najam (1998) show that a Human Development Index fails to include any ecological considerations. According to them over the years, the HDRs seem to have become stagnant, repeating the same rhetoric without necessarily increasing the HDI's utility.
According to M. McGillivray (1991) the HDI assesses intercountry development levels on the basis of three so-called deprivation indicators: life expectancy, adult literacy and the logarithm of purchasing power adjusted per capita GDP. Using a simple statistical analysis, his paper questions both the composition of the HDI and its usefulness as a new index of development.
This paper concludes that the HDI is both flawed in its composition and, like a number of its predecessors, fails to provide insights into intercountry development level comparisons which pre-existing indicators, including GNP per capita, alone cannot.
What is more the HDI covers long-term changes (e.g. in GDP per capita) and may not respond to recent short-term changes. Many composite indices do not cover wide divergence within countries. A well-known problem is that higher national income may not mean welfare.
It depends on how it is spent. Some countries with high real GNI per capita have high levels of inequality (e.g. Saudi Arabia or Russia).
Thus, there is still a need for development of new indices, development measurement and cross-country development comparisons. The main aim of the paper is to present a symbolic ensemble clustering of the selected European OECD countries considering their development as well as the linear ordering results for this type of data using multidimensional scaling.
The obtained results show that ensemble clustering for symbolic data and linear ordering for this type of data can be a useful tool for a development analysis.

Symbolic data analysis
In classical data analysis objects are usually described by a vector of quantitative or qualitative measurements, where each column represents a single variable (a number or a category). However, this kind of data representation is too restrictive to represent more complex data. This type of data takes into account the uncertainty and/or variability to the data, variables must assume a set of categories or intervals even with frequencies or weights.
Such data has been mainly studied in Symbolic Data Analysis. It provides suitable methods and algorithms to deal with aggregated or complex data that are described by multi-valued variables, where cells of the data table can contain sets of categories, intervals or weights (probabilities) distributions (see for example Bock, Diday, 2012;Billard, Diday, 2006;Billard, Diday, 2008;Noirhomme-Fraiture, Brito, 2011). Table 1 presents examples of the main types of symbolic variables and their realisations (see Bock, Diday, 2000, p. 2 (2000) In this paper first and third quartile of the original data values will be used in the empirical part and the contemporary data aggregation was used (data about countries was aggregated over time).

Linear ordering and ensemble clustering for symbolic data
The first concepts on pattern and anti-pattern of development and the measurement of development were proposed by Professor Z. Hellwig (see Walesiak, 2017a, p. 2). A two-step procedure that allows visualizing the results of linear ordering was presented by M. Walesiak (see 2016, 2017b, p. 11). This procedure can be applied also for symbolic objects and it involves.
1. Choice of a complex phenomenon that cannot be measured directly. This phenomenon is considered among a set of objects A.
3. Selection of variables and collecting data and the construction of a symbolic data table.
Identification of preferential variables -stimulants, destimulants and nominants. Variable is a stimulant when for every two of its observations ,  Gatnar, Walesiak, 2011). The iterative procedure called smacof was used in this paper (Borg, Groenen, 2005, pp. 204-205). The symbolic-numeric approach allows representing symbolic objects as points. When considering clustering methods for symbolic data we can distinguish the following groups of methods: a) adaptations of the classical clustering method and clustering methods designed strictly for symbolic data (see for example Verde, 2004;Bock, Diday, 2000;Billard, Diday, 2006; Diday, Noirhomme-Fraiture, 2008); b) density based clustering for symbolic data -an adaptation of a well-known DBSCAN algorithm for symbolic data (see Pełka, 2018); c) conceptual clustering methods for symbolic data, e.g. pyramids or the adaptation of COBWEB (see Pełka, 2015;Brito, 2002;Brito, 1995).
In this paper the co-clustering (co-association) matrix will be used. The algorithm that uses a coassociation matrix can be described as follows (Fred, Jain, 2005, p. 848): a) obtain different base partitions (models). This can be done in many ways -e.g. by using the same clustering algorithm with different initial parameters (e.g. number of clusters, normalisation method, and distance measure, etc.), using subsets of objects, using subsets of variables, and using different clustering algorithms. In the paper different clustering techniques will be used (SClust, DClust, DBSCAN for symbolic data, spectral clustering for symbolic data, single, and complete link clustering) and also these methods will be used with different initial parameters (distance measure, normalisation, number of clusters varying from 2 to 20); b) use obtained partitions to build a co-clustering (co-association) matrix. The elements of this matrix are defined as follows: where: i, j -objects (pattern) number, n ij -number of times objects j i, were clustered together among N partitions, N -total number of partitions; c) the obtained co-association matrix is used as the data matrix for some classical clustering methods -like k-means, pam, etc.; d) choosing the best partitions -e.g. by using cluster quality indices. In the paper a wellknown silhouette index will be used (see Kaufman, Rousseeuw, 2009 for further details).
3. Adaptations of well-known bagging procedure for clustering (see Hornik, 2005;Leisch, 1999;Dudoit, Fridlyand, 2003). In this paper the F. Leisch's adaptation of bagging for clustering will be used (see Leisch, 1999): a) the initial data set is divided into M subsets, drawn from the initial data set with a replacement -in this paper 20 subsets with 20 objects each will be used; b) subsets are clustered and in the case of classical data centers of clusters are obtained.
In the case of symbolic data medoids will be used; c) medoids (cluster centers) are used as the data matrix for some clustering algorithmse.g. k-means, ward, complete, DIANA, etc. and final clusters are obtained; d) for the final clusters new medoids are calculated; e) all objects are assigned to the nearest medoid.

Results of the empirical study
The empirical study uses the statistical data obtained from the World Bank ( Variables v 15 , v 17 and v 18 are destimulants. To find the optimal multidimensional scaling (for a symbolicnumeric approach) the mdsOpt package of R was used (Walesiak, Dudek, 2018).
The best results were obtained for standardisation in terms of normalisation and Ichino-Yaguchi distance measure. Figure 1 presents the results of the multidimensional scaling of 32 objects (30 OECD countries, the pattern and the anti-pattern object). Objects 31 (pattern object) and 32 (ani-pattern object) were connected with a straight line to obtain the so-called axis of a set. Four isoquants were added by dividing the axis into four equal parts.
The distances of each country from a pattern object were calculated in accordance with formula (1). OECD countries were ordered by the growing values of this measure. The results are presented in Table 2. Figure 1. Results of the multidimensional scaling in the two-dimensional space for 32 objects -30 OECD countries, pattern object (31) and anti-pattern object (32) Source: own elaboration using R software.

Conclusions
As the results of the applied research the analysis of development for 30 OECD countries was done. The linear ordering (see the results presented in Table 2) and cluster analysis were conducted for 30 OECD countries using a symbolic-numeric approach for linear ordering visualisation, and single and ensemble clustering for symbolic interval-valued data.
The classifications done by international organisations (e.g. UNDP) were done on the basis of the composite development index where clusters are obtained by selecting some arbitrary values of these indices. This paper has used clustering methodology to obtain two clusters.
Cluster 1 contains the following countries: Denmark, Iceland, Norway, Sweden, France, Germany, Italy, the United Kingdom, Finland and the pattern object. These countries are the most developed ones. They reach the narrowest symbolic interval-valued variables spans (ranges) for all variables. This means that the objects from this cluster are most similar to each other. People living in these countries have very good access to clean fuels and technologies for cooking; both the youngest and the poorest have accounts in financial institutions. They have a high adjusted net national income (measured via constant 2010 in USD or GDP per capita) also their adjusted net savings damage is high. The countries from this cluster have a low adolescent fertility rate, and high age dependency ratio (percent of the working population).
People from these countries usually have a choice to use alternative and nuclear energy and they care about annual freshwater withdrawals caused by agriculture, households and industry.
Unfortunately the birth rate is usually lower than in other countries. Total central debt and CO 2 emissions are an issue in these countries. The costs of business start-up procedures are usually lower. Both unemployment with higher or basic education are quite low.
Cluster 2 contains the following countries: Austria, Belgium, Bulgaria, Croatia, Czech Republic, Estonia, Cyprus, Greece, Hungary, Ireland, Latvia, Lithuania, Luxembourg, the Netherlands, Poland, Portugal, Romania, the Slovak Republic, Slovenia, Spain, Switzerland and the anti-pattern object. This cluster contains all other countries with high and mid-high development (according to HDI). The countries from this cluster are the least similar ones.
People from these counties have good and very good access to clean fuels and technologies for cooking, usually the youngest have accounts in financial institutions. The poorest usually do not have one. Their income varies a lot, when looking at the different countries from this cluster. Net savings are not so high as in cluster one. The countries from this cluster also have quite a low fertility rate and age dependency ratio. People from these countries not always have a choice to use nuclear energy and how they care about freshwater withdrawals is not always so clear.
Sometimes the governments of these countries have problems when considering CO 2 emissions and CO 2 limits. The costs of business start-up procedures are higher than in cluster 1. Both unemployment with higher education is lower than in the case of basic education.