µ -shapes: Delineating urban neighborhoods using volunteered geographic information

: Urban neighborhoods are a unique form of geography in that their boundaries rely on a social deﬁnition rather than a well-deﬁned physical or administrative boundary. Currently, geographic gazetteers capture little more than then the centroid of a neighborhood, limiting potential applications of the data. In this paper, we present µ -shapes, an algorithm that employs fuzzy-set theory to model neighborhood boundaries suitable for populating gazetteers using volunteered geographic information (VGI). The algorithm is evaluated using a reference dataset and VGI from the Map Kibera Project. A confusion matrix comparison between the reference dataset and µ -shape’s output demonstrated high sensitivity and accuracy. Analysis of variance indicated that the algorithm was able to distinguish between boundary and interior blocks. This suggests that, given the existing state of GIS technology, the µ -shapes algorithm can enable neighborhood-related queries that incorporate spatial uncertainty, e.g., ﬁnd all restaurants within the core of a neighborhood.


Introduction
Operationally, neighborhood boundaries can be defined by the degree of consensus between individuals.Ethnographic studies of neighborhood boundaries demonstrate that individuals will demarcate their neighborhoods by using the physical and institutional characteristics of the neighborhood, its class, race, and ethnic composition, perceived criminal threats from within and outside the neighborhood, and symbolic neighborhood identities [2].Guo and Bhat [17] offer a functional definition of neighborhoods as discreet, non-overlapping communities generally defined by their physical geographies, natural advantages, and transportation systems.Urban neighborhoods have also been delineated by their patterns of land use such as types of residential or commercial buildings [9,15].
Neighborhood delineation is a multidisciplinary task with two primary goals.The first is to define a neighborhood as a unit of observation.Neighborhoods are studied in housing market research [3], crime analysis [13,29], and health research [25,26,31].These studies are more concerned with understanding the effects a community and its built environment have on social processes rather than defining urban spaces with a common name.The second goal (and the primary aim of this paper) is more concerned with identifying the spatial extent of neighborhoods to populate gazetteers [21,34,35].This has utility for queries in spatial databases.For example, a query to find all the restaurants within a particular neighborhood would rely on an accurate gazetteer containing the neighborhood boundaries.
The purpose of this research is to demonstrate a way to automate the delineation of urban neighborhood boundaries suitable for populating a gazetteer.A suitable method should draw on local data to delineate neighborhoods, allowing for an accurate representation of perceived boundaries.Additionally, the method should adhere to the logic that individuals use to demarcate their neighborhood boundaries.That is, the method should consider geographic context (e.g., major road boundaries or rivers) as a means to demarcate boundaries.Finally, the result of the method should account for spatial uncertainty in delineating the boundaries.Specifically, we apply a fuzzy-set model to address geographic vagueness as discussed in Varzi [32].
The following presents an expansion of the betashapes algorithm described in Erle [10].While the original betashapes does not take into account context or uncertainty, it does provide a scalable solution to delineate neighborhood boundaries that adheres to the logic individuals use to demarcate their neighborhoods.We use the term µ-shapes to describe our algorithm because it incorporates fuzzy set theory to represent spatial uncertainty [36].
Fuzzy sets generalize traditional sets by generalizing the binary membership response, usually to a functional response with continuous range [0, 1] [36].Fuzzy sets and the accompanying fuzzy logic have been used in a broad range of residential and industrial (e.g., railroad systems) applications [33].Fuzzy geographic applications were conceived early on [23], with the consequent fuzzy modeling of land-use classification [22,8], and fuzzy representation of continuous spatial phenomena [14,37].
This paper presents an expansion of the betashapes algorithm, incorporating fuzzy logic for the boundary demarcation process to quantify spatial uncertainty.In Section 2, we list the commonly used approaches to delineate urban neighborhoods.In Section 3, we present our method of delineating neighborhood boundaries.Section 4 presents an accuracy assessment for our method.Section 5 concludes with a discussion of future work.

Related work
A number of researchers [2,6,24,30] have suggested using the aggregated cognitive maps of individuals to define neighborhood boundaries.In a study of neighborhoods in a city in Ohio, Coulton et al. [6] asked residents to draw their neighborhood boundaries onto a printed map.The researchers aggregated the core areas where 70% of respondents had drawn boundaries containing the same area [6].Their research demonstrated that residents consistently drew their neighborhood boundaries along physical barriers or areas that were www.josis.orgconsidered to have different socio-demographic attributes.While this work demonstrates a way to map the perceived boundaries of urban neighborhoods, the time to recruit and train local experts is prohibitive in relation to setting provisions to collect volunteered geographic information.
Other solutions have been explored to derive the spatial footprints of vague geographic regions (e.g., neighborhoods) using aggregated volunteered geographic information [7,16,18].These solutions rely on point-to-area methods to estimate the spatial footprint of imprecise geographic regions.One key advantage of point-to-area methods is that they are able to leverage volunteered geographic information (VGI) from popular social media platforms and thus provide a scalable means to populate gazetteers.
Kernel density estimates (KDE) are commonly used to delineate vague regions from point datasets [16,20].This method assigns a certainty to every point using a kernel function.This has the advantage of simplicity but lacks support for geographic context.Additionally, KDE allow for areas to be disjoint and may not create a realistic representation of neighborhoods which are compact and spatially contiguous.
Alani et al. [1] use Voronoi diagrams to estimate the areal extent of large vernacular regions of Scotland.This method requires the use of negative space.That is, it is necessary to have a set of points that are positively in the region as well as a set that are definitely outside of the region.A key limitation to this method is that the borders are unlikely to adhere to standard boundaries, since Voronoi cells are driven by the point distribution of the VGI.
[27] used α-shapes to determine the spatial extent of the South of France from point coordinates.This approach cannot deal with context (e.g., physical boundaries) or areas with varying density.Density is especially problematic when using geographic information derived from social media posts.For example, popular tourist destinations tend to be over represented in Flickr datasets whereas residential areas are underrepresented in coverage [18].This makes it difficult to determine the extent of primarily residential neighborhoods.
Wilske [34] uses geotagged social media documents (Flickr photos) annotated with place names to approximate the boundaries of neighborhoods in New York City.The author uses fuzzy sets to represent the spatial footprint of each neighborhood.The model represents a vague region through a pair of concentric regions with determinate boundaries [5].This model assumes that a vague region has a core area that can be unambiguously assigned to the region, the lower approximation, surrounded by an area where membership within the region is uncertain, the upper approximation.Wilske [34] uses the spatial median derived from all geo-tagged documents to define the lower approximation of the model and defines the upper approximation as the convex hull (of the same set of geotagged documents).
The method described in Wilske [34] provides a scalable means to populate gazetteers using the spatial footprints of neighborhood boundaries.This method has two key limitations.First, the convex hull drawn around the individual VGI points is arbitrary and is unlikely to adhere to the perceived boundaries of individual residents.Second, the distance from the spatial median is not an informative measure for vague boundaries.The certainty that an area belongs to a neighborhood is not the same in all directions from the spatial median and is determined by other factors (e.g., physical barriers).
Erle [10] provides a possible point-to-area solution to generate neighborhood boundaries that considers the physical geography of a city.His method uses line datasets for major physical barriers (e.g., roadways, rivers, railroads) to partition the area of a city into meaningful geographic primitives (blocks).Each block is then assigned to a neighborhood based on its proximity to a set of geographic coordinates tagged with a place name.A key limitation to this approach is that it does not account for spatial uncertainty when determining neighborhood boundaries.In the next section, we present a modification to the betashapes algorithm using an index to represent spatial uncertainty.

Methods
Given a spatial extent S and a set of associated features (e.g., roads, rivers, railways), we can partition S by the features' boundaries, into the set of blocks B. (While geographic feature data may not be sufficient to partition, e.g., a rural spatial extent or a spatial extent with few associated geographic layers, we assume the feature set of an urban environment, particularly one associated with VGI, is rich enough to partition S. Spatial extents with feature sets failing this assumption our beyond the scope of this work.)A µ-shape is a contiguous region R ⊆ B, such that, each b ∈ B comprising R has one or more fuzzy metrics reflecting b's degree of membership to R. Intuitively, we think of each µ-shape as a fuzzy neighborhood.Given S and a set of polylines representing urban boundaries and natural barriers, point set P of VGI, and a user-defined number, k, of nearest neighbors (by Euclidean distance from the centroid of a block), the µ-shapes algorithm (Algorithms 1-3) generates a set of µ-shapes, each a contiguous collection of blocks, with boundaries that adhere to the physical structure of a city, and two fuzzy metrics.The first metric, µ VGI (b), is the fraction of VGI from block b's k nearest neighbors reporting membership to R. The second metric, µ ADJ (b), is the fraction of blocks adjacent to b, reporting the same parent neighborhood R as b.

Algorithm 1 µ-shapes
Input: A set of points P over a spatial extent S, each with a VGI neighborhood name; a set of polylines, representing urban boundaries and natural barriers; and a natural number k, where 1 ≤ k ≤ |P |.Output: A set of blocks B that partition S; for each block in B, the assigned parent neighborhood's name and µ VGI and µ ADJ values 1.Given S and a set of polylines, representing urban boundaries and natural barriers, generate the corresponding set of blocks B, where B is a partition of S. In this work, each block is assigned to a unique neighborhood, or in rare cases, unassigned.We could have defined the µ-shape to assign each block fuzzy metric scores for each neighborhood name, but this would have created a scaling challenge.Thus, we opted for low computational cost over complete fuzzy membership.Alternative definitions of www.josis.org

End.
µ-shape could be formulated that enrich neighborhood membership with limited increase in computational cost.
µ VGI and µ ADJ address complementary aspects of neighborhood membership.The former metric is a measure of social consensus and therefore dependent on variations in the population that contributes VGI, as well as confounders, such as mobile device dead zones.The latter metric is a measure of physical adjacency and therefore subject to boundary effects and the geometric nature of the urban and natural boundaries.In the section that follows, we evaluate the µ-shapes algorithm and the performance of the two fuzzy metrics, via a case study.4. End.

Experiment and results
We present the performance of our algorithm against one technique for delineating neighborhood boundaries.Namely, we evaluate how well our algorithm compares to hand drawn boundaries by community experts.A full comparison of our algorithm to other methods is beyond the scope of this project and is left for future work.
The test dataset was derived from the Map Kibera Project (http: mapkibera.org),a resident led effort to map the neighborhoods and community amenities of the Kibera region of Nairobi, Kenya.The dataset consists of a shapefile of VGI, GPS points with an assigned toponym for one of fifteen neighborhoods.Additionally, the dataset contains a set of polygons for the neighborhood boundaries.The line dataset used to represent major physical barriers (and thus potential neighborhood boundaries) was derived from road, river, and railway data courtesy of OpenStreetMap, an open source repository of transportation data.
The Map Kibera Project is a community and NGO effort to collect significant geographic features of Kibera.The group splits into teams to collect points of interest (e.g., schools, clinics) throughout the community using GPS receivers.Each point has the neighborhood that the point resides in as part of its metadata.The neighborhood assignment to individual points was determined through consensus with the residents involved in the project.We use the point locations and the metadata as the input to test our algorithm.The neighborhood boundaries were hand-drawn over a satellite image of Kibera by a group of residents and volunteers.We use the resident drawn boundaries as the basis of comparison for our algorithm.The Kibera dataset was selected because the neighborhood boundaries were community derived rather than defined administratively or by a cartographic expert.Moreover, the VGI point data is well distributed and each labelled point falls into its respective neighborhood.Figure 1 shows the neighborhood boundaries overlaid with their respective neighborhood points.
Figure 2 shows the areal extent of Kibera overlaid with a dataset of aggregated line barriers (roadways, rivers, and railways).The blocks that are used for the evaluation were derived from the polyline data and the vector shapefile for the boundary of Kibera.The Kibera boundary data was split into pieces by the overlain polyline data.In the evaluation, the blocks are the spaces between the polylines.This was done in ArcGIS using the "Cut Polygons" tool [11].The size and shape of the blocks are determined by, and sensitive to the availability of, the polyline data.
Referring to the algorithm, each block is assigned to a neighborhood based on the mode nominal value of the nearest k VGI points.After each block is assigned to a neighborhood, the fuzzy metrics µ VGI and µ ADJ are computed, indicating the degree of membership of a block to the neighborhood.Figure 3 presents the neighborhood assignment.Each neighborhood derived from the µ-shapes algorithm is overlaid with its original neighborhood.Block misassignment can occur if either the original boundaries are not derived from a line barrier (e.g., a park) or if the point coverage for nearby primitives is sparse.
A key limitation of the original betashapes algorithm [10] is that it does not provide a measurement for spatial uncertainty.It is important to express and be able to visualize spatial uncertainty because of the inherent vagueness of neighborhood boundaries.Thus, the use of fuzzy sets in this paper.Figure 3 shows the neighborhoods derived from the µ-shapes algorithm overlaid with their respective blocks.The red boundaries illustrate the boundaries of the reference neighborhoods.For comparison purposes with the reference dataset, which has a traditional set notion of membership, we merge the blocks by their assigned neighborhood, regardless of the fuzzy metric scores.In practice, a minimum fuzzy metric threshold may improve accuracy of neighborhood assignment.We opted to compare block assignment to neighborhoods against the reference dataset, using the minimum membership assignment.

www.josis.org
To compare the algorithm's neighborhood assignment against the reference neighborhoods, a confusion matrix was calculated.The confusion matrix was derived by comparing the reference neighborhoods against the µ-shape neighborhoods (Figure 5).From the confusion matrix, a number of classification metrics can be derived.We computed the sensitivity, the accuracy, and the kappa coefficient for each neighborhood, and the overall accuracy and kappa coefficient for the algorithm's entire output.A kappa coefficient is a measure of inter-rater agreement which determines the extent to which the agreement of a classification is due to "true" agreement rather than mere "chance agreement" [4].
Tables 1 and 2 present the neighborhood and overall results, respectively.The overall accuracy of the neighborhood classification is approximately 90%, with a kappa coefficient of 91% better than random labelling.The average sensitivity is approximately 90% and all but four classes have a sensitivity greater than 85%.The lowest sensitivity (Kambi Muru) can be explained by the proximity measurement used to classify neighborhood blocks: while a VGI point labelled Kambi Muru overlaps a misclassified block, the mode for the nearest k points was attributed to the Kisumu Ndogo neighborhood.This demonstrates that the µ-shapes algorithm is sensitive to the selection of k and the VGI's distribution.
We evaluated µ ADJ and µ VGI 's ability to differentiate between a neighborhood's boundary blocks D (blocks that are adjacent to or overlap two or more neighborhoods), exterior blocks E (blocks that are not in D and on the boundary of the spatial extent S), and interior blocks I (blocks that are not in D or E), relative to the neighborhoods.To compute D, E, and I, we used a combination of spatial queries in ArcGIS [11], followed by visual inspection to finalize assignment to D, E, or I.This assignment was made only when there was agreement between the reference and computed neighborhoods and when a block could be assigned to a unique block category.From the 235 Kibera blocks, we found 99 boundary blocks D, 41 exterior blocks E, 71 interior blocks I, and 24 blocks for which assignment was not possible.We applied one-way analysis of variance test (ANOVA) to the values of µ ADJ and µ VGI by block group (D, E, and I), followed by multiple comparisons analysis (Table 3).For each fuzzy metric, for each block group pair, the table summarizes the difference of means, 95% confidence intervals of the lower and upper limits of the difference of means, and the p-value.If the difference of means interval does not contain 0 and p ≤ 0.05, then the fuzzy metric has successfully differentiated between the block groups.We see that µ ADJ and µ VGI were able to distinguish between block groups D and E, and D and I, but not E and I. Since E and I can be distinguished by virtue of their relationship to www.josis.org the boundary of the spatial extent S, the results imply that we can distinguish between any two block groups.
Since the ANOVA results are similar for µ ADJ and µ VGI , it is not clear that both metrics are needed to score block membership to a neighborhood.To assess the degree of independence of the two fuzzy metrics for this case study, we computed the linear correlation coefficient between µ ADJ and µ VGI , yielding R 2 = 0.38.The low correlation indicates a degree of independence between the metrics, suggesting that further investigation of µ ADJ and µ VGI is required.

Discussion and future work
The µ-shapes algorithm couples VGI data, specifically OpenStreetMap data and geotagged social media posts, with existing natural and infrastructural boundary data to address automated neighborhood delineation.Despite the data dependencies, in light of the recent surge in VGI, we expect the µ-shapes algorithm to be broadly applicable.
The confusion matrix comparison between the reference dataset and algorithm's output demonstrated high sensitivity and accuracy for this case study.ANOVA indicated that µ ADJ and µ VGI were able to distinguish between boundary (D) and interior (I) blocks, but neither metric could differentiate between exterior (E) and interior (I) blocks.However, a simple spatial query can distinguish between blocks in E and I based on whether they lie on the boundary of the spatial extent S or not.Thus, given the inputs required by this research and a GIS to perform simple spatial queries, the µ-shapes algorithm proves effective in delineating neighborhoods, and in distinguishing between different parts of a neighborhood.
The µ-shape was defined for scalability, which is why each block is assigned to a unique neighborhood.For large geographic regions, it would be impractical to assess the degree of block membership to each neighborhood in the feature space.Having said that, alternative µ-shape definitions that score fuzzy block membership to multiple neighborhoods is conceivable, and may be desirable in certain instances.
There are a number of practical limitations to this algorithm.First, if a neighborhood lacks VGI representation, nearby neighborhoods will expand without limitation.Second, it requires supporting vector data to construct the blocks.Third, the algorithm performs poorly when there are too few points to initialize a mode or when the point distribution is biased.Finally, the output is sensitive to the selection of k (nearest neighbors).
It is worth noting that the Kibera dataset provides a near perfect dataset to test the µ-shapes algorithm.Other datasets, such as geotagged photos, will inevitably require some pre-processing step to filter outliers and mistagged points.The Kibera neighborhood boundaries were also mapped directly to hard physical barriers (i.e., roadways).The www.josis.orgµ-shapes algorithm can accommodate other physical barriers, such as sharp changes in elevation or large parks, or more abstract boundaries, such as areas of different socio-economic status, only if these boundaries/barriers have been previously delineated.
Previous work has addressed topological classification of a region based on a pointset of data, and could be applied to handle misassignment arising from boundary, or other topological features [12,19,28].This could support automated handling of each topological subcase rather than the present system that requires a human to form a judgement about the block's group membership, when the fuzzy metrics have low scores.
In future work, we will evaluate a broader scope of membership metrics, including the application of consensus over a set of metrics to arrive at a fuzzy determination of block membership to a neighborhood.The algorithm will also be generalized to accommodate sparse datasets, to weigh the importance of the k nearest neighbors according to their distance from the block, to let k vary according to the distribution of VGI, and to address regions where block computation is challenged by limited boundary information.

2 .
For each block b in B: a. Determine the k nearest neighbors, the points closest to the centroid of b, P b ; b.BlockNbhd(b, P b ): compute b's parent neighborhood name and µ VGI (b); c.BlockScore(b, A b ): compute µ ADJ (b).

Algorithm 2
BlockNbhd(b, P b ) Input: Block b and the k nearest points from the centroid of b, P b Output: The name of b's parent neighborhood and the VGI membership value of b, µ VGI (b) 1.If name has a mode m among the points in P b : a. Set block name to m, b.name = m; b.Compute µ VGI (b) = |m|/k, the fraction of points in P b with name b.name.2. Else: a.If |P b | == 2: i. Block b is not assigned to a neighborhood; ii.End.b.Else: i. Remove the furthest point from P b , P b ← P b − "furthest point"; ii.BlockNbhd(b, P b ); iii.End.c. End.

Algorithm 3
BlockScore(b, A b ) Input: Assigned block b and its adjacent blocks A b Output: The adjacency membership of b, µ ADJ (b) 1. Initialize Adj Count = 0. 2. For each adjacent block a in A b : a.If a.name == b.name: i. Adj Count++; b.End.

3 .
µ ADJ (b) = Adj Count/|A b |, the fraction of blocks in A b with the same name as b.

Figure 1 :
Figure 1: Kibera reference neighborhoods and the corresponding VGI point dataset, where each VGI point falls in its parent neighborhood.

Figure 4
Figure4presents the mapped µ ADJ and µ VGI results.The bottom panel presents neighborhood assignment using the adjacency fuzzy metric, µ ADJ , while the top panel presents assignment when using the VGI fuzzy metric, µ VGI .Here, we used Jenks natural breaks to stratify the symbology in each figure.Darker areas on the map indicate a higher membership value as determined by µ ADJ or µ VGI .For comparison purposes with the reference dataset, which has a traditional set notion of membership, we merge the blocks by their assigned neighborhood, regardless of the fuzzy metric scores.In practice, a minimum fuzzy metric threshold may improve accuracy of neighborhood assignment.We opted to compare block assignment to neighborhoods against the reference dataset, using the minimum membership assignment.To compare the algorithm's neighborhood assignment against the reference neighborhoods, a confusion matrix was calculated.The confusion matrix was derived by comparing the reference neighborhoods against the µ-shape neighborhoods (Figure5).From the confusion matrix, a number of classification metrics can be derived.We computed the sensitivity, the accuracy, and the kappa coefficient for each neighborhood, and the overall accuracy and kappa coefficient for the algorithm's entire output.A kappa coefficient is a measure of inter-rater agreement which determines the extent to which the agreement of a classification is due to "true" agreement rather than mere "chance agreement"[4].Tables1 and 2present the neighborhood and overall results, respectively.The overall accuracy of the neighborhood classification is approximately 90%, with a kappa coefficient of 91% better than random labelling.The average sensitivity is approximately 90% and all but four classes have a sensitivity greater than 85%.The lowest sensitivity (Kambi Muru)

Figure 3 :
Figure 3: The reference Kibera neighborhoods (red polylines) and the algorithm's block assignment to neighborhoods (colored by neighborhood).

Figure 4 :
Figure 4: Kibera neighborhoods and the blocks comprising them, scored according to the fuzzy metrics µ ADJ (bottom) and µ VGI (top).

Figure 5 :
Figure 5: The reference Kibera neighborhoods (top) and the neighborhood assignments, as computed by the µ-shapes algorithm (bottom).

Table 1 :
Neighborhood-level assessment of the µ-shapes algorithm's classification.

Table 2 :
Overall assessment of the µ-shapes algorithm's classification.Science and Education through an interagency agreement between the U.S. Department of Energy and NGA.

Table 3 :
The results of balanced one-way ANOVA with post-hoc multiple comparisons analysis provides evidence that both µ ADJ and µ VGI differentiate between block groups D and E, and D and I, but not E and I.