Empirical Statistical Analysis and Cluster Studies on Socio-Economic Status (SES) Dataset

Socio-economic status (SES) levels and conditions are extremely influential variables in the study of a particular area of society or any society. Social factors, for instance, the position of caste, religion, marital status, education levels, give good assessment results for us about a person’s goals and the method of achieving their objectives. Generally economic status of any family is needy upon the social factors, for instance, the size of the family, educators in family and levels, and the level of the friendly environment in the family. SES with machine learning (ML) especially cluster analysis is important to identify important features or dimensions of the SES dataset, evaluate the rakings of dimensions and dimensional reductions. In this research, we collected 1742 samples (household information) as per socio-economic ratios and area (rural and urban) wise ratios with good questionnaires between 2018 and 2019 from Rajamahandravaram, East Godavari District, AP, India. We conduct the statistical analysis and cluster analysis for identifying the important factors of SES levels and their problem analysis. In cluster analysis, we apply k-means, hierarchal clustering (HC), and hierarchal with principal component analysis (PCA). The good projection results related to HC and PCA-HC specifies passements of SES class values.


Introduction
The word 'socio' originates from the word 'social' and denotes to the people, and the ways (level) they fit into the community or residential area in which they live. It reflects how well they are instructed, educated, and working a job and so on. Economic denotes to the money related situation of the individuals inside society and incorporates, earning capabilities, own house and the resources, and so on. As such, it is a sociological and economical total assessment measures of an individual's work insight and a person's or family's economic and social situation according to other people, based on occupation, education, income, and other factors. The assessment of the family's SES is measured using the income of the household, education in house, earning members in the house, other sources of income, and assets of that house. Mainly, SES levels are three that are low, middle, and high SES. Some researchers are

Literature Survey
In this section, we reviewed numerous reputed journal papers and various works on SES from reputed articles, and some of the paper works of different authors are presented. Conger et al. (2009) [6] reviewed SES related to individual development and family processes. In this review, they discussed the integrationist model for family life and SES, incorporates social selection and social causation positions. Mainly, they focused on concepts that are 1. New Millennium Economic Climate, 2. Social class and SES measurements, 3 [7] researched age-related health issues related to SES that there are 3 psychosocial socioeconomic disadvantages, maltreatment, and social isolation". For this, they experimented with 1037 members from New Zealand. In this research, they found that study members suffered from 3 age-related health issues at age of 32 that are high inflammation levels, metabolic risks, and major depression. Lampos et al. (2016) [8] studied SES classification using Facebook user's language and behavior. At first, we detail a 3-way characterization task, where clients are delegated having an upper, center, or then again lower financial status. A nonlinear, generative learning approach utilizing a composite Gaussian Process bit gives essentially better grouping precision (75%) than a serious direct other option. By transforming this assignment into a double order -upper versus medium and lower class -the proposed classifier arrives at the exactness of 82%. Every day, the feeling and assessment of various individuals over the world are reflected as short messages utilizing microblogging stages Ghazouani et al. (2019) [9] reviewed and researched Twitter user's classification according to SES levels. In this, they mainly focused on the topics like "Tweet Gathering and Analysis, Classification Methods of SES, Data Corpuses and Resources, Inference Methods for SES Evaluation on Twitter, Regression Methods, Demographic Attributes, Temporal Information, Social Network Relations, Spatial Information, SES Features and Indicators, and Evaluation of Socioeconomic Status".
SES is a significant social and economic perspective broadly concerned. Evaluating individual SES can help related associations in settling on an assortment of policy choices. Customary methodology experiences a very significant expense in gathering enormous scope SES-related review information. With the pervasiveness of advanced cells, cell phone information has become a novel information hotspot for anticipating singular SES with minimal effort. In any case, the assignment of foreseeing singular SES on cell phone information likewise proposes some new difficulties, including inadequate individual records, scant express connections, and restricted named tests, indifferent in earlier work confined to territorial or household unit situated SES expectation [10]. The social determinants of wellbeing writing regularly SES as a critical factor in representing ladies' height. a setup marker of human government assistance at the populace level-utilizing customary relapse. Notwithstanding, this writing comes up short on a precise distinguishing proof of the prescient influence of SES just as the conceivable non-direct connections between the proportions of SES (material abundance, education, and occupation) in anticipating variety in ladies' tallness [11]. Zhao et al. (2020) [12] evaluated the predictive power of SES. In this research, they used demographic health surveys from 66 middle and low countries 1,273,644 women individual samples from 1994 to 2016. They assessed this data with 7 ML algorithms and concluded with the best ML performer NN. The NN was performed for SES prediction with 31.52 variances or MSE value.

Proposal Model
The figure1 shows the proposed model of the Socio-economic system with statistical and cluster studied models. In this, we collected the information from each house of rural and urban areas of the Rajahmundry constitution, district of East Godavari, A.P, India. We have gathered all the information with a good questionnaire and store the necessary information in the secondary storage section. After that, we extract the information with features and classes into a data set as *.csv format. Some of the information is plotted on the Rajamahandravaram (Rajahmundry) map with longitude and latitude measures. The targeted class attribute contains four classes that are rich, upper-middle-class, middle class, and poor. For this investigation, we extract the feature attributes as per household information that are personal-data, Socio-status, Economical-status, Living-status, Health-wealthy status, and so on. In this, we have to use 1742 household records with 49 attributes and construct *.csv format file and stored into the secondary storage section as well as plot the data points on the Rajamahandravaram map with IOP Publishing doi:10.1088/1757-899X/1085/1/012030 4 full details for the convenience of visualization. We have to conduct the statistical analysis of the SES data set with statistical parametric equations like mean, median, max, and min and generate the reports for analysts. Cluster analysis is important to analyse the data set for the identification of important factors related to the target class values, dimensional reduction, and ranking of the attributes. In clustered studies, we choose the k-means and HC clusters using PCA. Before that, we conducted the preprocessing, the data set using the min-max method for better analysis.

Data Set Description
Rajahmundry renamed as Rajamahandravaram is one of the major consistencies of East Godavari district in Andhra Pradesh, India. We gather information about each house from this constitution area of rural and urban. Nearly, we collected the 1742 samples as per socio-economic ratios and area wise ratios with good questionnaires between 2018 and 2019. Some of the data is plotted on the Rajamahandravaram Map using longitude and latitude values. The figure 2 shows location details and detailed information about plotted houses clicking on that point of more details button. For this experiment, we used 48 feature attributes and one class that is the status (rich, poor, middle, and uppermiddle classes) described in table 1.   The figure shows the Rajamahandravaram map where it contains the collected house hold data points that plotted on map with location points (blue coloured bubble points) with latitudes and longitudes. We established this map with attach with our data set using latitudes and longitudes values. Moreover, in this we provide statistical analysis results of the data set, individual data, bar charts and pi charts of the attributes, searching and upload data record facilities, defined queries projections and so on. The local cite information with IP address that is 13.233.164.180.

k-Means algorithm
K-Means is an apportioning clustering technique which displaces objects by moving starting with one cluster then onto the next beginning from an initial partitioning [14] [15]. The motivation of the cluster examination is to segment n observations into K groups in which every perception has a place with the cluster with the closest mean. It is one of the simplest unsupervised ML that solve the problems with clustering. The K-Means is a transformative and evolutionary method that picks up its name from its technique for activity [16] [17]. The k-means algorithms as follows step 1: The model chooses K data points as the centres of initial cluster using mean values step 2: Each data point in the data is allocated to the nearer cluster group using Euclidean distance between each data point and each centre of the cluster. step 3: Each cluster is determined by cluster average data points.
Step 4: Steps 2 and 3 repeats until the clusters converge.

Hieratical Clustering
Agglomerative HC has been the predominant approach to deal with embedded group of clusters. The main aim is to focus toward useful models and methods that both efficient and effective [18]. It is regularly useful to recognize strategy, including a minimization standard and the objective structure of a 2-way tree speaking to the fractional request on subsets of the power set; instead of an execution, which identifies with the detail of the calculation utilized. Similarly, as with numerous other multivariate strategies, the items to be classified have mathematical estimations on cluster factors or characteristics.
A mathematical system of this sort isn't the one in particular which can be utilized to plan grouping calculations. Reasonable elective types of capacity of a rectangular cluster of qualities are not conflicting with review the issue in mathematical terms [19][20].

Principal Component Analysis (PCA) Algorithm
PCA is a symmetrical linear transformation that moves the information to new coordinate system to such an extent that the best variance by any projection of the information comes to lie on the primary arrange (first principal component (PC)), the second most noteworthy difference lies on the subsequent coordinate (second PC), and etc. PCA used to lessen measurements of information absent a lot of loss of data. It is utilized in AI in frequently. 1. Get the mean vector, in this calculation every data point value is subtracted by the mean value and divided by the Standard deviation (SD). Below equation shows mean vector value.
Factor Value -Mean Value Z= Standard Deviation (SD) In this, every variable value is changed in normalized value (same scale value). 2. Construction of Covariance Matrix using assembled samples from mean datapoint value matrix.
This progression is to perceive how the components of the informational collection are transforming from the mean with respect to each other. At times, the components are very related or associated so they contain abundance or excess information [21]. Thusly, in order to recognize these relationships, we compute the covariance model network. The covariance grid is a n × n symmetric lattice with n

Experimental Setup and Result Analysis
In this section, we discuss the data set attributes empirical statistical analysis. Furthermore, we apply unsupervised ML models k-means and HC cluster on SES Dataset and analyse their abilities.

Statistical Analysis
We collected the data from rural and urban areas of the Rajahmundry constitution, East Godavari District, A.P., India. For this, collected sampling data is as per ratios of socio-economic status. The rural area samples are 946, and urban area samples are 796 (Total 1742). As per the statistical analysis of the household dataset, some of the houses contain the average 4 to 5 where the mean value is 4.381 and Std. Dev is 1.467. Some of the houses have only one member (min value is 1), and some of the houses contain 16 (max value). Each house contains at least one male person (min value male persons in a house is 1) and a maximum of 8 male persons as well as on average 2 to 3 persons per one house. On the other hand, the female persons' min value is 0, and the max value is 8, and mean and SD values are 1.975 and 0.776 respectively, which means every house contains on an average one to two females. As per statistics some good conditions that very fewer child workers, average young generation 2 to 3 people in every house and average 1 to 2 workers in each house. Another good thing, the number of diseased people and the number of disabled persons is very less percentage that the mean values are 0.066 and 0.024 respectively. Table 2 shows the detailed statistical analysis of each attribute of the data set. .

Figure 3: Counting SES Levels in Rajamahandravaram area
An important thing for the economic status that fully depends on annual income for each house and its resources that are from the public, private, asserts, and work, and so on. As per statistics, the annual income min value is 27000/-and the max value is 80, 000, 00/-. The main income resources from private, government, or pension schemes. The detailed analysis is shown in table 1. The educational and health resources are also available within the distance of every house. The figure shows the SES level samples that are rich, middle-class, upper-middle-class, and poor as per the ratios of Rajamahandravaram. The Rich and Poor positions are in the 4th and 3rd, and the middle class and upper-middle-class occupy the first and second positions in the area. Figure 3 shows the counting number of SES status levels were low to high as per ratios Figure 4 shows the Correlation heat map of each attribute of the Rajamahandravaram SES dataset. The correlated values are specified with the colour that the indicator mentioned in the figure. The values are measured between -1 and 1. If the cell colour is a dark red that its indication is neutrally correlated with each other. The dark blue indicates the zero correlation. The attribute electrical is correlated marginal with all other attributes. The relationship between attributes in the dataset is vital reasons to evaluate the fitness of the data set that one variable value is fully dependent on other variables or weak associations with others. Sometimes the variable relations are peculiar in that one unknown variable value depends on two or more variable values. The correlation values are useful for modelling and analysing the data in a better way. The correlation is defined using two attributes or variables of statistical value relationships. The correlation value is between -ve (negative) and +ve (positive) values. +ve value of correlation represents that change of attributes or variable movement in the same direction. -ve value of correlation represents that change or relationship of attributes or variable movement in the reverse direction. Neutral or zero correlation represents that both variables are divergent or unrelated. In some situations, two or more of the variables in the dataset are related very rightly named as multicollinearity that it impacts more and more on the performance of some algorithms like linear regression. In this situation, we can remove highly correlated attributes from the experimental dataset for improving model performance.

k-Means Clustering
In the k-means cluster analysis, we give the value 4 for the number of clusters that clusters are cluster 0, cluster 1, cluster 2 and cluster 3. The total data set size is 1742(no. of houses information). The cluster 0 contains 485 instances, the cluster 1 is dealing with 480, the cluster 2 contains highly 547 instances and the cluster 3 contains 230 instances. The table 2 shows the k-means unsupervised ML algorithm centroids of each cluster. In this we describe some of the attributes centroids related to each cluster and full data set also. The total data set concened with the target class attribute SES level vlaue Middle-class, and cluster 0 and 3 are related to Upper-middle. The cluster 0 is related to class attribute value poor and cluster 1 is related to middle. The detailed cluster centroids details are shown in the table 3.  Figure 5 shows the cluster analysis between cluster number and target calsses. In this plot, the X-axis specifies the SES status and Y-axis indicates the cluster number. The clusters form different colors and describes related SES stus. The cluster 0 (colour blue) contains most of instaces from Upper-middle class very less instances from middle and rich classes. The cluster 1( red colour) is constructed with more poor and less middle-class instances. The cluster 2 colour is green that it is constructied with most of middle-class elements and very less elements of poor. The cluster 3 is combination of middle and upper-middle class elemets. Figure 6 shows the HC column-wise complete link clustering. In this, the attributes closure relations of SES data set are analysed one to other. The Euclidian distance is the measurement of the HC construction. The complete HC network is constructed with 7 clusters. Each cluster is represented with each colour shown in figure 6. The cluster1 is constructed with the pairs of attributes like occupation and Major Work with height 0.4, married People and income with height 3.93 status, female and below 18 with height 0.334. In the C1 cluster, highly corelated and related attributes are female and below-18age.

Figure 6: Column or attribute wise Hierarchal Cluster Analysis of SES Dataset
The cluster 2 is constructed with the pair of attributes like male and above-18-age with 0.33 height, this pair is connected to the social-status with height of 0.71. In this, the connected attributes in chain and increase the height values. In cluster 2 highly corelated attributes are male and above-18-age. The cluster 3 is constructed with 12 essential attributes. In this cluster, highly correlated attributes Diseased-people and income-from-Govt with height 0.109 and next chain connection attributes are car, others and so on. The cluster 3 and cluster 4 are constructed with single attributes that are gold and nearest-university. The cluster 6 is constructed with two attributes that are income-private and annual-income with height 0.88. The cluster 7 has 3 attributes that are nearest-hospital and family-size with 0.64 connected to nearest-college attribute with height 0.912. As per analysis, the cluster 6 contains highly correlated attributes than other clustered attributes and cluster 7 contains non-correlated or low-correlated attributes where distance is very high than other cluster attributes. The overall cluster analysis is with complete linkage, maximum depth is 8, the height ratio is 14%, and distance measurements are calculated with Euclidian distance formula. . High-Qualification. The X-axis represents the Annual income (AI x 1e+06), Y-axis represents the High Qualification that the range is 0 to 6, and the target class data points represented with colour bubbles (red-poor, green-rich, blue-middle, and orange-upper-middle). As per the analysis, income is related to higher education impacts on SES levels that poor class attribute data points decrease inversely proportional to the higher education. Most of the data points in higher education levels 3,4,5 and 6 contain middle, upper-middle, and rich classes only. The highest degree of Ph.D. (level 6) is pursued by rich and upper-middle classes only. In this analysis, we observe that poverty is related to education 14 coloured data points specify the upper-middle-class. Most of the rich data points (green coloured dots) have high Fraction of Annual-Income, land in cents, having a bike, Gold, and Building models. These components are related to the HC target class values like rich, poor, middle, and upper-middle classes. The projections are described in Figure 9 without implementing PCA. These five dimensions are the first 5 listed ranks in the experiment according to SES classes with an accuracy of 99.13%. Figure 10 demonstrates the HC cluster projections with PCA according to attributes concerning Annual-Income, land in cents, having a bike, Gold, and Building models. The blue elements indicate the middleclass, the red elements specify the poor data points, the green-coloured elements describe rich, and the orange-coloured data points specify the upper-middle-class. Most of the rich data points (green coloured dots) have high Fraction of Annual-Income, land in cents, having a bike, Gold, and Building models. These components are related to the HC target class values like rich, poor, middle, and upper-middle classes. The projections are described in Figure 9 implementation of PCA-HC. These five dimensions are the first 5 listed ranks in the experiment according to SES classes with an accuracy of 99.93%.

Conclusion
Statistical analysis and Machine learning are very essential and crucial current days for analysing the Socio-Economic Status problems. In this research work, we collected and analysed Rajamahandravaram SES dataset using statistical models and identify the important features of the SES. As well as, we conducted the cluster analysis like k-means, HC, and HC-PCA on SES dataset and get good visualization results from this experiment.