CLUSTERING ANALYSIS FOR RESIDENTIAL AREAS BASED ON NEIGHBORHOOD AMENITIES

The use of urban land in cities can be improved and the poor execution of Urban planning is related to the problem of housing. The problem of housing has become acute because of the tremendous increase of urban population and unplanned growth of the cities. Mumbai has a population of 20,411,000 thus it is the target of our analysis project. Affordable housing in Mumbai has become an unfathomable challenge, it one of the most complex probes in this city. About 42% of Mumbai’s housing comprises slums. With this in mind, our aim is to help the decision of buying houses, by recommending localities with basic amenities. We hope to make the process of scrutinizing residential buildings more streamlined. We also hope to underscore areas with housing potential in this study. We use K-Means Clustering to cluster the different neighborhoods of Mumbai, based on the availability of 31 amenities in the neighborhood. We have used Data from Wikipedia to get the list of neighborhoods in Mumbai, and we use Foursquare API to get a list of amenities in each area of the neighborhood. We then evaluate the model using silhouette score and plot a graph using folium to show the different clusters on the map of Mumbai.

The use of urban land in cities can be improved and the poor execution of Urban planning is related to the problem of housing. The problem of housing has become acute because of the tremendous increase of urban population and unplanned growth of the cities. Mumbai has a population of 20,411,000 thus it is the target of our analysis project. Affordable housing in Mumbai has become an unfathomable challenge, it one of the most complex probes in this city. About 42% of Mumbai's housing comprises slums. With this in mind, our aim is to help the decision of buying houses, by recommending localities with basic amenities. We hope to make the process of scrutinizing residential buildings more streamlined. We also hope to underscore areas with housing potential in this study. We use K-Means Clustering to cluster the different neighborhoods of Mumbai, based on the availability of 31 amenities in the neighborhood. We have used Data from Wikipedia to get the list of neighborhoods in Mumbai, and we use Foursquare API to get a list of amenities in each area of the neighborhood. We then evaluate the model using silhouette score and plot a graph using folium to show the different clusters on the map of Mumbai.

Objectives:
To cluster the neighborhoods in Mumbai according to the prospect of growth. Visualize the clusters on the map of Mumbai to streamline decision-making related to housing

Research Background:
In this section we discuss the variables we used to conduct clustering and the method with which we procured data.
Understanding the Variables: 31 amenities have been selected for analysis, and the measure of each is the division of the number of instances of that amenity by the total number of amenities to discern the level of availability. Then, we proceed with Foursquare API to get all of the amenities available in these neighborhoods. We then use feature selection to trim the amenities based on a study by Strutt & Parker (a real estate agency).

Research Methodology:-
In this section, we describe the procedure we followed for the clustering of Residential Areas in neighborhoods. We also delve into feature selection after assessing the quality of variables. We discuss K-Means Clustering and the evaluation metrics we used to evaluate our model.

Data Preparation and Feature Selection:
The dataset has neighborhood names with the latitude and longitude coordinates, along with the values of the amenities/total amenities. We then use a ranking system inspired by a study from Strutt & Parker real estate agency, to create a priority of variables. The ranking was based on the effect of these parameters on the cost of a Residential Building.Thus, the houses are in accordance with the findings, Airport and Scenic outlook make a neighborhood prone to development. For other factors, the order of precedence is as follows: 1) Shopping Stores 2) Transport Hub 3) Restaurants 4) Green Spaces 5) Sports and Recreation We decide on 8 categories to create a prioritized scoring system for all neighborhoods. We multiply the variables of each column depending on the priority, from a range of 1 to 8. We multiply 8 with the variables that affect prices the most and so on.

Model Development:
We develop a model using the unsupervised machine learning algorithm, K-Means Clustering. Unsupervised algorithms, unlike supervised learning algorithms, provide unlabeled data and form patterns and features of their own to make sense of the model. Clustering Analysis can be defined as a classification of data points to subgroups/clusters. The data points in each subgroup are similar, while the data points in other subgroups/clusters are different. The metrics used to discern similarity can be based on either correlation or euclidean distance between the data points.
We use K-Means Clustering algorithm, which is an iterative algorithm that divides data points into k distinct clusters, while making the intra-cluster data points as similar as possible while keeping the inter-cluster data points as different as possible. It works to minimize the distance of each data point in a cluster to the centroid of that cluster, the lesser the variation in a cluster the more homogeneous it is. It achieves this by calculating the sum of the squared distance between the data points and the cluster's centroid.

The approach that K-Means follows is Expectation-Maximization method. The objective function used to get the best model is: Equation 1:-Expectation-Maximization
Here, wik =1 if xi belongs to cluster k, xi is the data point, μk is the centroid of the cluster.
Here, the E-method is achieved by assigning data points to a particular cluster. This is done by differentiating J wrt wik and updating the cluster assignments made.  We assign k=5 for our experiment, we use Lifestyle Score as the discerning factor for cluster assignment.

Cluster 1:
We can see here that Cluster 1 groups regions with an okay LifeStyle Score in the range: 0.857 -1.800.

Cluster 2:
We can see here that Cluster 1 groups regions with an average Lifestyle Score. Range : 1.880 -2.359 Figure 6:-Cluster 2 Lifestyle Score.

Cluster 3:
We can see here that Cluster 2 groups regions with poor Lifestyle Score. Range : 0.000 -0.500.

Cluster 4:
We can see here that Cluster 3 groups regions with a good Lifestyle Score. Range : 2.405 -3.048 Cluster 5: We can see here that Cluster 4 groups regions with a good Lifestyle Score. Range : 3.500 -3.818. We then proceed to evaluate our model.

Model Evaluation:
We use the silhouette coefficient to evaluate our clustering model. Silhouette Coefficient's value ranges from -1 to +1. If the coefficient is closer to +1, it signifies that mean clusters are well apart from each other and clearly distinguished. If it is closer to -1, mean clusters overlap and have not been assigned properly. Finally, if it is closer to zero, it means the distance between the clusters is not significant.
Silhouette Coefficient is used to analyze the coherency of data points in the clusters.

Equation 2:
Silhouette Score Silhouette Score = (b-a)/max(a,b) Here, b is the average inter-cluster distance and a is the average intra-cluster distance or the distance between data points in a cluster.
For our model the Silhouette coefficient is 0.58354.

Color Index:
Cluster 1 (poor): medium sea green Cluster 2 (average): teal Cluster 3 (the worst): yellow Cluster 4 (good): dark magenta Cluster 5 (the best): crimson From these visualizations, we can see that Central Mumbai has some of the best neighborhoods based on availability of amenities, followed by South Mumbai, North Mumbai and Upper-Central Mumbai.

Conclusion:-
The neighborhoods that are magenta or red show great prospect for growth and have a multitude of amenities in their vicinity. These neighborhoods should be the top picks for anyone looking to invest in property or buying a house. Thus, we have successfully concluded our clustering and helped discern the 136 neighborhoods of Mumbai.