REGIONAL POPULATION EXPENDITURE FOR FOODSTUFFS IN THE RUSSIAN FEDERATION: COMPONENTIAL AND CLUSTER ANALYSES

The article describes the solving of the problem of conducting the component and cluster analyses of population expenditure on food as one of the most important components of the standard of living. The purpose of the analysis is to develop the regional clusters of the Russian Federation, which vary in the structure of household expenditure for foodstuffs. The foodstuffs are presented in absolute units taking into integral account the standard of living index. The methods of intellectual analysis such as component and cluster analyses are applied as the research methods. The procedure for the data intellectual analysis based on the interconnected performance of component and cluster analyses is proposed. The procedure of the data intellectual analysis considers the interrelation between the results received by different methods, and also the possibility to return to the previous method for the purpose of repeating the analysis to specify consistently the clusters composition. Few clusters of the wealthy regions characterized by the high and average levels of expenditure for foodstuffs are revealed as well as the quite many clusters of not enough wealthy and not wealthy regions characterized by the low level of expenditure for foodstuffs. It is shown that the growth of standard of living characterized by the size of a gross regional product per capita is followed by the growth of the Gini coefficient, which indicates both the inequality of income distribution and reduction in expenditure for low-value foodstuffs. The results of the analysis can be applied to the development of the decision-making support system intended for the analysis of the scenarios of macroeconomic regulation in the field of income policy for the purpose of increasing the standard of living of population. The analysis of the population expenditure for foodstuffs has allowed to reveal the cluster structure of the regions of the Russian Federation, to show it according to the generalized indications, to formulate the specific characteristics of the clusters of the regions and important management decisions.


Introduction
Modern conditions in the development of the Russian economy are characterized by significant problems connected with the growth and quality of living standards, which in the first place is due to an increase in the gross domestic product (GDP).
According to the forecast of social and economic development of the Russian Federation for 2013-2015 published by the Ministry of Economic Development, the promotion of the private domestic demand is one of the main ways to motivate the GDP growth 2 .
The scenario of a consumer-oriented increase, which is necessary during a recession period, must be transformed into a scenario of investment-oriented increase if there is to be any positive tendency in GDP dynamics. [1][2][3][4].
According to the forecasted scenario suggested by the Ministry of Economic Development, an increase in consumer demand should be provided initially by achieving strong growth in household consumption (about 5,7 % per year), which in turn is planned to be achieved by speeding up the growth of real income.
Therefore, it is relevant to research the dynamics of the handling of expenditure and income for households in a multisector macroeconomic system. Within this framework, software is being developed for an intelligent decision-making support system and simulation modeling for the formation of household income and expenditure in a multisector macroeconomic system. [5,6].
The basic components of an intelligent decision-making support and simulation modeling system are a module which simulates the population's income/expenses dynamics within the reproduction process of a multisector macroeconomic system and a module of data mining. The module of data mining and decision-making support in macroeconomic system handling has a characteristic feature -the formation of training samples based on the results of simulation experiments, with a dynamic model of the multisector macroeconomic system.
Statistical data was used only for the setup of the dynamic model data. The formation of household clusters was made on the basis of experimental data analysis of the dynamics received as a result of conducting simulation experiments.
However, it would be reasonable at the stage of cluster formation to use not only experimental data, but also statistical data of the household sector finance (particularly the consumer expenditures), and only then move on to the final formation of household clusters.
Therefore, it is relevant to develop a data mining system for the analysis of household consumer expenses based on statistical data. This will give an opportunity to form household clusters and find out their specific features taking into account the main direction of population expenses.

Methods
The article reviews the solution for intellectual analysis problems related only to a part of consumer expenses -purchases of food -which is one of the most important components in the standard of living.
The research includes such methods of data mining as component and cluster data analysis that helps to visualize data in a smaller space and to find out the cluster structure of regions with the purpose to form distinctive characteristics of the formed clusters for further decision-making in the population income handling sphere.
The sample is prepared on the official statistical database of the structure of expenses on purchasing food for the first quarter of 2012 published on the site of the Federal State Statistics Service of the Russian Federation 3 . As a research subject, household expenses are taken on food purchases in different regions of the Russian Federation.
The purpose of the conducted intelligent analysis is to form regional clusters of the Russian Federation whose household expenses on food purchases differ in their structure, presented in absolute units and adjusted for the generalized index of the population's standard of living.
The analysis of 83 subjects presented by the Russian Federation regions was prepared. The number of lines in the sample is n = 83. Each item (region) is characterized by thirteen features, so the number of columns in the sample is m = 13. The particularity of the analyzed (original) component features is that they include two groups of features.
The first group includes features that characterize expenses on purchasing different food products, presented in absolute (monetary) units and corresponding to ten types of consumer food expenditure: expenses on bakery and grains; meat; fish and seafood; dairy, cheese and eggs; butter and oil; fruits and vegetables; sugar (jam, honey, chocolate and sweets), non-alcoholic drinks; and expenses for other food.
The second group includes the following three features: expenses on final consumption (average monthly consumption per member of household (in rubles), per capita gross regional product (GRP) (in rubles) and Gini coefficient. These summarized characteristics describe the population's standard of living as a whole, considering the inequality of income distribution in regions with different standards of living. The procedure of the research based on the data mining techniques includes three stages and at every stage it is necessary to complete a number of steps. (picture 1). The first stage is designed for conducting a component analysis. At the second stage, a cluster analysis of the household sector expenses on food purchasing in different regions of RF is provided. [7,8] The third stage is appropriated for coordinating the results received on the basis of two methods; and for returning back to the previous stages and the repeated use of data mining techniques, with a possible adjustment of data. The procedure can be performed with the help of different software, for example, Statgraphics or Statistica [9,10].
The particular features of the component analysis implementation (stage 1) are the following. It is rational to create the formation of a features group for conducting analysis at stage 1.1 not only by the method of expert estimation, but also using a component analysis as a method of previous treatment of data at the stage of their preparation [7,11,12].
By performing steps 1.2 and 1.3, the basic components are formed automatically. These components are the coordinate axis of a new space that is mathematically stated as a linear combination of initial features. The diagram of dispersion is a projection of many items (regions) in a new smaller sized coordinate space. The analysis of the regions layout in a two-dimensional space with the help of 2D diagrams of dispersion is more preferable. If it is necessary to layout the regions in a three or fourdimensional space, it is advisable to use 2D diagrams with a combination of different axes. All further steps of the component analysis are performed automatically with the participation of a systems analyst.
The number of the analyzed basic components determined in step 1.4 can vary from 1 to k where k < m. As a rule, two or three basic components are analyzed if visual analysis of space features is intended.
In this research, the first two basic components are considered: the first basic component BC 1 and the second basic component BC 2.
Steps 1.5-1.7 are designed to create a name of the basic component 1 as a new integral characteristic of data and considered as a latent one.
The naming of basic components is performed on the basis of a calculation of the coefficient of information value ki the value of which must approach to 0,75 on the right side [7,9]. The calculation of the coefficient of information value ki supposes that the features with the most discriminant power are chosen for analysis.
On the basis of calculation of ki, the set of features that is sufficient enough for describing the first basic component is defined. A preliminary segregation of regional household clusters is based on the value differences only in the first basic component BC 1.
Steps 1.8-1.9 are implemented for analysis of the second basic component BC 2 and they are analogous to the previous ones. The overlapping of the results of division into clusters according to the first and the second basic components are implemented in step 10.
In step 1.11 the objective laws are identified, that is characteristic features inherent to the chosen household clusters that help to compose their profiles. Clusters are combined into groups of clusters at a high, medium and low levels of regions welfare.
Stage 2 is designed for conducting a cluster data analysis for food purchase expenses in the regions of the Russian Federation household sector [7,13,15].
In step 2.1, the initial data setup for a cluster analysis is implemented. Thereto, a choice of similarity measures and clustering algorithm types is made. It is necessary also to define the quantity of clusters for plotting dendrograms. The formation of clusters, calculation of the centroid coordinate, plotting dendrograms are implemented automatically (step 2.2).
Step 2.3 is advantageous, where it is proposed to put clusters, singled out as a result of the cluster analysis on the basic components' plane built on the base of the results of the component analysis. The result of the implementation of step 2.3 is the second 2D diagram of dispersal. The purpose of plotting the second 2D diagram of dispersal is a further comparison of the clusters plotting results by the algorithms of the component and cluster analysis.
The third stage involves the comparison of clustering results, received by the component and cluster analysis. If the result of comparing the clusters and their profiles (step 3.1) is successful, the procedure is completed with generating final conclusions on the clustering of regional households and their profiles (step 3.5).
Otherwise, it is possible to repeat the cluster analysis (step 3.2) in the case of updating the clustering parameters (metrics, clustering algorithms). In the case of short variation of cluster analysis parameters, a component analysis is repeated. It can be performed by "soft" corrections, changing the cluster's quantity and composition without recalculation of weighting coefficients (step 3.4).

Results
The results of the process of data mining performed in relation to the food expenses structure are presented below.
Based on the results of the first stage, three basic components were designed using Statgraphics software, additionally an aggregation characteristic for all components, including indication of their values, was obtained ( Table 1).
The presented figures show that the first and the second components describe 68,8 % of the initial data dispersion. The third basic component adds 7,5 % dispersion, so the total dispersion sum makes76 % which is enough for the spatial distribution analysis.
The weight coefficients of all features for three basic components are given in Table 2. On the basis of the received data of the weight coefficients, the information coefficient for all the three basic components was calculated. The set of features was defined for every component, that determines its meaning content.
It is indicated that the first basic component BC 1 largely characterizes the expenses on purchasing the main food products (meat, dairy, bakery, fruits, and vegetables) taking into account the total expenditures for final consumption. In the range of larger values of the basic component BC 1, there are regions that are characterized by significant expenses in absolute values on purchasing foods that   The second basic component BC 2 characterizes the aggregate expenditures for food products containing carbohydrate (sugar) taking into account per capita GRP (gross regional product) and income inequality (reflected in the Gini index).
It is necessary to highlight the characteristic feature of the regions located in the extreme areas (high or low values of basic component 2). According to the received data of the weight coefficients of the most considerable features, it was found that the increasing standard of living was accompanied by an increase in the degree of income inequality and decreased expenses on purchasing foods containing carbohydrates.
The regions that are located in the range of lower values of the basic component BC 2 (Moscow, for example) are characterized as having a higher standard of living, but at the same time, they are subject to more discrimination in income distribution and incur less expenses on food containing carbohydrates. The regions having higher values of the basic component BC 2, for example, Chechen Republic, are characterized as having small per capita GRP (low standard of living) as well as a lower degree of income distribution (low Gini coefficient) and a higher level of expenses on food containing carbohydrates.
The third basic component is characterized by the per capita GRP value, Gini coefficient and expenses for foods containing fat. The peculiarity of the described features interaction is that with the increase in per capita GRP the income inequality also rises while expenses for foods containing fat decrease.
According to the results of the component analysis, there are nine regional clusters. The 5th, 7th, 8th and 9th clusters include regions with high level of expenses for valuable foods and low level of expenses for foods containing carbohydrates. Regions characterized by a medium level of expenses for valuable foods correspond to the 3d, 4th and 6th clusters. Regions with the low or medium level of expenses for valuable foods are in the 2nd cluster. The structural characteristics of clusters distribution in the new 2D space are as follows. The right side of the 2D diagram above shows clusters with a small number of regions with high level expenditure both on food and per capita GRP. While the number of inhabitants (the number of objects in clusters) is increasing to the left, the expenditure characteristics on foods and per capita GRP are decreasing. On the left side, the diagram presents a cluster containing a large number of regions with a low standard of living. Figure 3 shows the two-level dependence of a region cluster number on a multiple analyzed characteristics of food expenses prior to the component analysis. A further Figure 4 reveals the threelevel treelike dependence of a region cluster number on a multiple analyzed characteristics of food expenses after the structuring of both the objects in the range of characteristics and the characteristics themselves has been completed. As a result, these characteristics integrate into new integral features known as basic components.
The cluster profiles formed on account of zones singled out within the range of basic components (Fig. 5) are presented in Table 3. Rules for clustering are formulated on the basis of data in Table 2 and a set of features defining basic components.
Here is an example of the rule to define the first cluster: "If expenditure on dairy, soft drinks, bakery and grains, meat, fruits, vegetables and other foods is low and the final consumption expenditure is  Very high Very high low, and expenditure to purchase sugar (jam, honey, sweets) is medium or high, Gini coefficient is low, and per capita GRP is low or medium, then this object corresponds to cluster 1." Stage 2 of the proposed procedure, where a cluster analysis is performed, resulted in the final decision to divide all regions into seven clusters. This decision has been made upon a comparison of the dispersion diagram in the component analysis and the dendrogram in the cluster analysis. To visualize the final clustering of multiple regions, we presented the second 2D diagram of dispersion within the range of basic components, based on the cluster analysis dendrogram (Fig. 6). It should be noted, that the segregation of clusters mostly corresponds to the division of subjects according to the basic component 1 as it has the most discriminant power.
Cluster 7, derived from the cluster analysis results, includes two regions that used to be two independent clusters in component analysis. Since they are close to each other geographically, these two regions fused into a single cluster during cluster analysis. They are Yamal-Nenets Autonomous Area and Камсhatka territory.
The fifth cluster, derived from the cluster analysis results, includes regions that have a large value of the basic component 1. These regions correspond to clusters 5, 6, 7 and the part of cluster 4, which were obtained by applying component analysis. The fifth cluster includes ten regions: Murmansk region, St. Petersburg, Primorye Territory, Khabarovsk Territory, Republic of Sakha, Sakhalin region, Moscow, Nenets Autonomous Area, Magadan region, and Chukchi Autonomous Area.
The third cluster, derived by applying cluster analysis, includes the remaining parts of the second cluster and part of the sixth cluster; it corresponds to a medium value of the basic component 1. The third cluster includes ten regions: Moscow region, the Republic of Komi, Perm region, Samara region, Sverdlovsk region, Tumen region, Trans-Baikal region, Krasnoyarsk region, Irkutsk region, Khanty-Mansi Autonomous Area.
The second cluster includes some parts of the second and the fourth clusters, which were derived from the component analysis results. The second cluster includes twenty regions: Bryansк region, Ivanovo region, Каluga region, Kostroma region, Tver region, Tula region, Yaroslavl region, the Republic of Karelia, Arkhangelsk region, Volgograd region and others.
The first cluster includes partly clusters 1, 2, 3, which were formed as a result of component analysis. It includes twenty four regions: Belgorod region, Vladimir region, Voronezh region, Kursk The fourth and sixth clusters formed applying cluster analysis are characterized by a small value of the basic component 2 and mostly include the first cluster and the second cluster that is close to the first, which were derived as a result of cluster analysis. The sixth cluster includes eight regions: the Republic of Adygei, the Republic of Dagestan, the Republic of Ingushetia, the Republic of Kabardino-Balkaria, the Republic of Karachai-Cherkess, the Republic of Ossetia-Alania, the Chechen Republic, Stavropol region. The fourth cluster includes nine regions: Tambov region, the Republic of Кalmykia, the Republic of Mari-El, the Republic of Mordovia, the Republic of Udmurtia, the Republic of Buryatia, the Republic of Tuva and others.

Discussion
The proposed data mining procedure carried out to find out the food expenses structure is based on the interrelated and cyclic application of component and cluster analysis methods that study the population's expenditure on food. The distinctive feature of the proposed procedure of data mining lies, firstly, in the fact that it takes into account interrelation of the results obtained by using different methods and, secondly, in the possibility to return to the previous method with the aim to employ it again in order to update the composition of the formed household clusters successively. The procedure can be applied repeatedly to a number of training samples obtained by structuring the items of foods expenditure. It helps to form household clusters and provides integral characteristics at the expense of reducing the initial space dimension.
According to the results of component analysis, 9 regional clusters have been derived that differ in the level of expenditure on valuable foods, on foods containing carbohydrates and in final consumption expenditure.
Cluster analysis updates and corrects the number of clusters and their borders that were singled out at the stage of preliminary analysis using the methods of basic components. The classification rules formulated on the base of centroid coordinate data satisfy the component analysis results.
According to the data analysis of regional household food expenditure, a small number of clusters composed of successful and wealthy regions have been singled out which are characterized by high and medium level of household expenditure on the basic foods, and a larger number of poorer and problem regions, characterized by low level household expenses for main food purchases.
It is shown that the growth of the standard of living (characterized by the volume of per capita GRP) is accompanied by an increase in profit distribution inequality, and a decrease in expenses incurred by purchasing of low value foods. The population in regions with a high standard of living has a significant amount of valuable foods in their structure of expenses which is decreasing and shows their position to foods of little value in regions with a low standard of living.
The outcomes of the conducted analysis can be applied to develop a decision-making support system that would formulate recommendations for governmental regulation of food supply in order to increase the population's standard of living.