The utility of Zip4 codes in spatial epidemiological analysis

Jayakrishnan Ajayakumar; Andrew Curtis; Jacqueline Curtis

doi:10.1371/journal.pone.0285552

Abstract

There are many public health situations within the United States that require fine geographical scale data to effectively inform response and intervention strategies. However, a condition for accessing and analyzing such data, especially when multiple institutions are involved, is being able to preserve a degree of spatial privacy and confidentiality. Hospitals and state health departments, who are generally the custodians of these fine-scale health data, are sometimes understandably hesitant to collaborate with each other due to these concerns. This paper looks at the utility and pitfalls of using Zip4 codes, a data layer often included as it is believed to be “safe”, as a source for sharing fine-scale spatial health data that enables privacy preservation while maintaining a suitable precision for spatial analysis. While the Zip4 is widely supplied, researchers seldom utilize it. Nor is its spatial characteristics known by data guardians. To address this gap, we use the context of a near-real time spatial response to an emerging health threat to show how the Zip4 aggregation preserves an underlying spatial structure making it potentially suitable dataset for analysis. Our results suggest that based on the density of urbanization, Zip4 centroids are within 150 meters of the real location almost 99% of the time. Spatial analysis experiments performed on these Zip4 data suggest a far more insightful geographic output than if using more commonly used aggregation units such as street lines and census block groups. However, this improvement in analytical output comes at a spatial privy cost as Zip4 centroids have a higher potential of compromising spatial anonymity with 73% of addresses having a spatial k anonymity value less than 5 when compared to other aggregations. We conclude that while offers an exciting opportunity to share data between organizations, researchers and analysts need to be made aware of the potential for serious confidentiality violations.

Citation: Ajayakumar J, Curtis A, Curtis J (2023) The utility of Zip4 codes in spatial epidemiological analysis. PLoS ONE 18(5): e0285552. https://doi.org/10.1371/journal.pone.0285552

Editor: Ayesha Maqbool, National University of Sciences and Technology NUST, PAKISTAN

Received: October 10, 2022; Accepted: April 25, 2023; Published: May 31, 2023

Copyright: © 2023 Ajayakumar et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: https://figshare.com/s/91f8c142299dd6fa1b99.

Funding: Research reported in this publication was supported by University Hospitals and by the Ohio Department of Higher Education Third Frontier Research Incentive under an award entitled “Geographic Monitoring for Early Disease Detection (GeoMEDD): An Actionable Warning System for Opiate Overdoses in Ohio.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Disease surveillance data or health outcome data, such as that collected by hospitals or made available to local health departments, can provide the opportunity for near-real time spatial analysis leading to on-the-ground improvements in healthcare [1]. These data might be used to target opioid overdose or gun injury clusters [2], as well as emerging infectious disease threats such as the recent Covid-19 outbreak [3]. Precise on-the-ground insights can support operational response including whether incorporating near-real time spatial syndromic surveillance [4], cross sectional cluster detection [5] or risk mapping [6]. For maximum intervention insight, for example identifying which congregate buildings are becoming “hot” with disease, it is vital to use granular data, such as a patient’s residential address, or the location of an overdose. This is not an unusual situation in that it mirrors typical emergency service response using 911 call for service data.

However, from a region served by multiple health systems, while such granular data exists within the Electronic Medical Record (EMR) or Emergency Department (ED) visit records, there is traditionally a barrier between institutions sharing their data because of spatial confidentiality and business concerns [7]. This concern is understandable as a plethora of works in spatial confidentiality research suggest that the use and sharing of personal geospatial information puts data contributors (research participants or even patients) at risk of being identified ([8–11]. This situation becomes even more untenable as other data attributes, or “quasi identifiers” are added to the geographic location [12]). And yet not finding an acceptable spatial analytical solution to data sharing at a granular geographic scale can inhibit an effective health response.

Geoscientists have tried to tackle spatial confidentiality challenges by developing spatial privacy protection policies or by manipulating the underlying geospatial data to preserve locational privacy [13]. While spatial privacy policies can help to alleviate privacy and confidentiality concerns, stringent policies such as no data sharing between research groups or health institutions can act as a major barrier for research collaborations, research reproducibility, and might also lead to repetitive data collection [14]. The type of constraints associated with spatial privacy policies lead to the development of geomasking/obfuscation techniques, which use mathematical transformation to transform geospatial data to protect locational privacy. While several variations exist such as donut masking [15], voronoi masking [16], and adaptive areal elimination [17] the complexities associated with implementing these types of advanced methodologies have led health institutions to primarily mask their data using spatial aggregation [18], which is the easiest to understand and implement.

While spatial aggregation to a postal delivery area (zip code) [19] census enumeration unit, or even a county can be easily achieved using software commonly available in a hospital system, the associated loss in the analytical power can have considerable operational consequences [7]. Simply put, the greater the aggregation, the less the risk of reengineering, but also a reduced insight in terms of detecting spatially specific risk through clustering [19, 20]. This tension is well known among geoscientists, and previous studies have tried to quantify the error involved, and the anonymity preserved while using aggregation based on commonly available spatial units such as County, Census enumeration unit, postal code, and census tracts. However, by comparison, little to no investigation has occurred regarding the use of one particular aggregation, the Zip4, which is an addition to the five-digit ZIP code of four additional numbers representing a geographic segment such as a city block, office building, an individual high-volume receiver of mail, or other distinct mail units. While the Zip4 geocode has been used as a source of spatial data in research [21–23], little is known about its operational strengths and weaknesses. This might be due to the lack of readily available polygon geometries that represent Zip4 codes. Indeed, while its geographic expression may vary, it is typically linear, not a commonly perceived form of spatial aggregation. This has led health institutions to extract out only postal code data (the five-digit zip code) from the Zip4 and use it for aggregation. In this paper, we assess the suitability of Zip4 codes as a source for transferring confidential locational data when viewed simultaneously through the lens of accuracy and confidentiality, through a series of comparative experiments with other commonly used sources of spatial aggregations and couch the results in terms of societal implications. The next section of this paper provides the necessary background regarding the utility of geospatial data in health research, the spatial confidentiality risk associated with health data, and the development of techniques to preserve spatial confidentiality while preserving the usability of the data. After that, we provide information about data and experimental setup followed by the results and discussions. The final section of the paper will discuss the conclusions from this study.

Background

While the power of analysing geographic health data is widely understood by spatial scientists [24–26], that same appreciation has only relatively recently expanded (and continues to grow) within clinical and public health organizations. For example, a hospital system might develop spatially guided precision medicine strategies [27] using their own internal data. Typical tasks would include mapping patient locations [28] or finding distances between patients and nearest facilities or services [24]. At a more elevated level, recent developments in spatial data storage systems such as spatial databases [29], and novel spatial analysis techniques such as spatial and spatio-temporal clustering [30], have enabled researchers and health practitioners to utilize health data for developing real-time spatial support systems [1]. These spatial support systems can be particularly useful in balancing surge across multiple hospitals based on where events happen, such as gunshots in a building complex. From an infectious disease perspective, monitoring the spread of cases across space and time can help inform clinicians and health authorities about potential outbreaks and mobilize preventive strategies [31–33] or predict where upcoming surge could stress resources. For example, in northeast Ohio a spatial syndromic surveillance system [4, 34, 35] was used to identify emerging Covid-19 disease clusters at a granularity that includes care homes and congregate living buildings. The results were used to guide hospital intercept teams to limit further disease spread both within and between densely packed living areas. Even though there has been an appreciation among health organization regarding the value of this type of fine-scale spatial insight, the issue of spatial privacy and confidentiality has always been an impedance for data sharing and collaborations between health organizations.

The spatial confidentiality risk associated with using geographically granular data or how inappropriate mapping might lead to an unacceptable re-engineering of location are well known [36–39]. Suggested solutions have included the use of anonymity, spatial privacy policies (such as HIPPA), and obfuscation/geomasking [40]. Of these, geomasking is the most commonly applied by health organizations, and the most likely to be implemented in the high-pressure, short time frame of a regional response to an emerging disease threat. More specifically, Geomasking can be categorized into affine [7, 41], random perturbation [15] or spatial aggregation [17]. Affine geomasks utilize geometrical translations, rotation, or a combination of both for relocating non-aggregated point data. Random perturbation displaces the original point either deterministically or stochastically. Implementations include grid masking [42, 43], “flipping methodology” [44], donut masking [15, 45], Vornoi masking [46], verified neighbor masking [47], and adaptive areal elimination masking [17]. Due to the complexities associated with implementing advanced geomasking methods, health organizations have always adopted geomasking based on spatial aggregation. While easier to implement, geomasking based on spatial aggregation leads to loss in analytical accuracy especially at coarse scales [41, 48, 49].

Research on the effects geographic masking has on different spatial-analytic methods is quintessential to determine whether the masking or aggregation strategy that is being applied strikes an appropriate balance between the protection of confidentiality and the ability to derive relevant spatial relationships [13]. Kwan and associates [41] in their study to examine the effects of two different masking techniques: random direction with a fixed radius and random placement within a circle, had shown that there was a consistent tradeoff between the amount of spatial manipulation and the accuracy of the analytical results. Another study using artificial clusters generated from a SaTScan of point locations masked using bimodal Gaussian displacement suggests that there was a gradual decrease in cluster detection sensitivity and specificity with an increase in the average displacement distance [50]. Similarly, another study on household travel surveys where donut masking was used as the geomasking method revealed that there was a gradual reduction in analytic accuracy with increasing displacement between points [51]. Lu et al. [52] in their work on measuring the impact of aggregation based masking on spatial analytic methods such as Nearest Neighbor Index and Moran’s I found that there were only minor effects associated with geographic masking for displacements of up to 250 m. Recently, Wang et al. [53], did an extensive comparative analysis between eight different geographic masking methods including aggregation to tease out the efficiency of geomasking in privacy protection while maintaining sufficient analytical accuracy. Their results suggests that geomasking methods such as point aggregation introduces error as the masking radius increases. While the research on geomasking using spatial aggregations have focused on quantifying the error introduced with readily available aggregation units such as zip code or Census tracts, there is a dearth of research in using other aggregation units such as Zip4 codes or street segments. To address this gap, we assess the usability of Zip4 codes as a spatial unit for aggregation based geomasking through a set of comparative experiments.

Data

Zip4 address data

Acquiring Zip4 address data.

Currently there are no official data sources available for obtaining Zip4 address data. Even though the United States Postal Service (USPS) provides access to a reference file that can be used to assign a Zip4 to a physical address [54], the process of obtaining the file and further parsing the Zip4 addresses is not trivial. Apart from the USPS, many private vendors such as Melissa (https://www.melissa.com/v2/lookups/zip4/zip4/) provide Zip4 API lookups and Zip4 address data extracts, often at a considerable cost. Large address sources such as OpenAddresses (https://openaddresses.io/) provide residential addresses for places, which can also be used for mining Zip4 records during the geocoding process. For this study, we collected a large data extract of 1.6 million Zip4 codes within the state of Ohio from a private vendor without any cost. The data extract was available as a large tab separated text file, which was then converted into a data table after preprocessing. Each row of the data table has a unique Zip4 code as well as additional attributes such as the upper, and lower level street address associated with the Zip4 code. Table 1 shows the structure of a Zip4 record along with its various attributes. The lower address for the first row is “4459 E Main St Cleveland, 44106” and the higher address is “4477 E Main St Cleveland, 44106”, while the low and high address for the second row is 7453 Sadie Rd N Akron, 43001. Generally, Zip4 codes that have the same lower and higher address tend to be congregate living facilities such as apartments, nursing homes, care homes etc.

Download:

Table 1. Zip4 record with attributes.

https://doi.org/10.1371/journal.pone.0285552.t001

Geocoding Zip4 data.

While some Zip4 data may be mapped as a point if enough addresses are found in a single building, more typically the output are polylines that require additional manipulation if they are to be used for spatial analysis. To do this, the lower and higher addresses associated with the Zip4 code were geocoded using ArcGIS which includes as output “Point Address” (house and building locations), “Street Address”, “Street Name” (missing house number), “Locality” (place-name representing a populated place), “Postal” (five digit Zip code), and “Admin” (state name). Here we only use the Zip4 records that can either be geocoded to the “Point Address” or “Street Address” to maximize precision. The lower and higher addresses are separately geocoded and the resulting shapefiles are merged together using the Zip4 code as the common key. In order to efficiently retrieve and perform a geometrical operation on the addresses, a spatial database (PostgreSQL, with PostGIS extension) was developed. Finally, each Zip4 code along with the line segment joining the coordinates of the lower and higher address was stored as a geometry object to the spatial database. A total of 1,389,482 Zip4 records were geocoded at either ‘Street Address’ or ‘Point Address’. In this paper, to mirror the geographic area analyzed by the authors as part of a multi-hospital response to Covid-19 response, only Zip4 records for Cuyahoga County and Portage County were included. These were also selected because Cuyahoga County is more urban in nature, with its major urban center being Cleveland, while Portage County has small less dense settlements and an overall “rural” nature. Using the described approach there were 137,732 Zip4 records geocoded for Cuyahoga County and 16,639 records for Portage County.

Voter data

In order to understand the potential of using Zip4 centroids for analytical purposes, Ohio voter data (from now on OV) was utilized as a proxy for typical patient data. The OV dataset is freely available to the public through the Ohiosos (Ohio Secretary of State) website (https://www.ohiosos.gov/). This dataset contains 8,070,402 records with details such as the political affiliation and residential address of voters. ArcGIS was used to geocode 7,932,791 (98%) of these addresses to either the Point or Street level. All the geocoded addresses were added to the spatial database to enable faster spatial queries and other geometric manipulations. Similar to the Zip4 address data, only voters address data that were within Cuyahoga County (n = 773,334) and Portage County (n = 83,778) were further used for the analysis.

Data for comparative spatial analysis

In order to compare different spatial aggregations, the OV dataset was spatially aggregated by street segment (from now on SS), Census Block Group (from now on CBG), and Zip code polygons. OpenStreetMap (OSM) [55], a collaborative project to create free editable maps, provides various free geodata including street segment data. To download street segment data for Cuyahoga County and Portage County we utilized OSMnx [56], which is an open-source tool built in Python for querying OSM API’s. The street segment data extracted using OSMnx were added to the spatial database as a line segment object and its centroid was used in the analysis. The OV dataset was aggregated to CBG and Zip spatial units through a point-in-polygon operation. For SS aggregation, each voter record was assigned to its nearest SS centroid. For Zip4 aggregation, the reverse geocoding option available in ArcGIS, which converts a coordinate to a physical address, and generates the Zip4 code associated with that location, was utilized. Each of the records in the OV dataset were assigned a Zip4 code through this reverse geocoding process and the associated geographical coordinates (Zip4 centroid) were retrieved from the spatial database and assigned to each of the records. Finally, a spatial table was generated, with each row containing a voter’s original address in coordinates (real), Zip4 code, Zip4 centroid, SS centroid, CBG centroid, and Zip centroid (Table 2).

Download:

Table 2. An example of voter data with real, Zip4, Street Segment, Census Block Group, and Zip centroid details.

https://doi.org/10.1371/journal.pone.0285552.t002

Data for clustering analysis

In order to replicate the type of analysis performed during the Covid-19 pandemic response, a synthetic dataset was created from the Cuyahoga County OV records. Case and control data required for the spatial clustering methods were sampled from the dataset based on the Covid-19 test result distribution for Cuyahoga County. To achieve this, the boundary polygon for Cuyahoga County was tessellated into 5km grids and the total number of negative and positive Covid-19 test results that fell into each grid was calculated using a point-in-polygon operation. For the spatial clustering analysis, 1,000 positive cases and 3,969 negative cases were sampled from the dataset using the test result distribution.

Experimental setup

Zip4 length distribution

Typically, Zip4 codes are represented as polylines with an upper and lower address representing the two ends. To understand the distribution of these lengths, a histogram-based approach was followed. The length of each polyline was calculated using the in-built length function in PostGIS and then the results were aggregated in intervals of 100 meters. Further, the percentage of records that fall into each bins were calculated. While a histogram provides a general trend of the length distribution for Zip4 codes, it does not provide any spatial insights. To understand the spatial distribution of lengths of Zip4 codes a grid-based approach was followed (Fig 1). The cell size for the grid was 500 meters and the bandwidth (radius) was 2000 meters. Each cell was assigned a value based on the average length of Zip4 polylines that fell within the 2000m radius. Both these experiments were conducted using the Zip4 data from Cuyahoga and Portage counties.

Download:

Fig 1. Spatial distribution of length of Zip4 segments for (a) Cuyahoga County, and (b) Portage County.

Zip Code boundaries are also displayed.

https://doi.org/10.1371/journal.pone.0285552.g001

Distance to real location

In order to establish how much potential analytical error occurs when using the Zip4 centroids, the distance between the centroid and the address location was calculated. These calculated distances were summarized as a histogram with a bin size of 50 meters for Cuyahoga and Portage counties. Further, the total counts in each bin were normalized using the total number of records.

Spatial analysis

To test the impact of using Zip4 centroids as compared to “real” data, a series of experiments were conducted on the OV dataset for Cuyahoga County. As there were a large number of addresses for Cuyahoga County (n = 773,334), smaller subsets of randomly permuted address points were used for the experiments; 100 datasets each containing 10,000 randomly selected address points were generated and used for further analysis.

Average nearest neighbor analysis (ANN).

The average nearest neighbor analysis helps to determine whether a point pattern is clustered or not [57]. A point pattern is deemed clustered if the average of the nearest neighbor distances is less when compared to the average of a hypothetical random distribution. For this experiment, ANN was run on each of the 100 datasets for real, Zip4, SS, CBG, and Zip centroids.

Ripley’s K.

While ANN analysis identifies clustering at a global level, Ripley’s K function [58] helps to determine whether clustering occurs at various local distance bands. Ripley’s K function was applied to the datasets in each category and the observed and expected counts at various distance bands were recorded.

Kernel Density Estimate.

Kernel Density Estimation (KDE) [59] calculates the density of features around a point of interest. The area of interest (in this case Cuyahoga County) is covered with a grid surface (here 500 meters) and the density across each grid cell is calculated by identifying all the points that fall within a bandwidth (here 1000 meters). KDE was run on the 100 datasets for the five categories (Fig 2), and the raster surface generated was compared to the raster for original dataset by calculating a cell-to-cell correlation coefficient.

Download:

Fig 2. Kernel Density Estimate (KDE) surfaces generated for (a) Real, (b) Zip4, (c) Street Segment, (d) Zip, and (e) Census Block Group.

https://doi.org/10.1371/journal.pone.0285552.g002

Cluster detection.

An important part of syndromic surveillance is spatial cluster detection [4, 32, 60]. During the Covid-19 pandemic identifying emergent clusters was a key approach used by hospital systems in Ohio to respond to outbreaks [4]. The type of spatial clustering developed for Ohio is extremely sensitive to geocoded accuracy and precision. To understand how Zip4 centroid aggregation affects spatial clustering and detection, two different spatial clustering methods were used; SaTScan [61] and GeoMEDD [4]. SaTScan is a popular spatial clustering methodology used by epidemiologists and health geographers around the world to detect spatial clusters of infectious and chronic disease [62], as well as disease vectors and risk factors [63]. For cluster detection, SaTScan uses the spatial scan technique, which creates a theoretically limitless number of circles in a geographic area [61]. The circles vary in size and each circle is evaluated as a possible cluster by being compared with the area outside the circle. The circle with the highest maximum likelihood of being a cluster is assigned a P value by using a Monte Carlo Simulation. GeoMEDD [4] is a recently developed spatial clustering methodology based on DBSCAN [64] that groups proximate points based on both spatial and temporal distance. GeoMEDD clusters used to identify emerging threats are typically classified into sentinel (a minimum of two members within a distance of 100m), micro (a minimum of five members within a distance of 500m), and neighborhood clusters (a minimum of 10 members within a distance of 1000m) [4] for 3 and 7 day lookback periods. Unlike SaTScan, which generates only circular clusters, GeoMEDD generates clusters of any shape based on the location of cluster members being connected by a convex hull.

To understand how Zip4 based aggregation changes spatial cluster detection, a comparative analysis was conducted. Five different locational datasets including the geographical coordinates of the real dataset, and the centroids for Zip4, SS, CBG, and Zip were created. SaTScan was configured to detect purely spatial clusters based on the Bernoulli model (case-control). For the same case and control data, SaTScan runs were made for the five different locational datasets (Fig 3). Each run generated cluster outputs as ESRI shapefiles (Polygon), Keyhole Markup Language (KML) files, and a summary text file. For GeoMEDD, the neighbourhood clustering (10 member and 1000 meter) was run for the five locational datasets (Fig 4). The cluster member identifiers were stored as JavaScript Object Notation (JSON) files while the cluster geometries (Polygons) were stored as ESRI shapefiles.

Download:

Fig 3. SaTScan clusters for (a) Real, (b) Zip4, (c) Street Segment, (d) Census Block Group, and (e) Zip.

https://doi.org/10.1371/journal.pone.0285552.g003

Download:

Fig 4. GeoMEDD clusters for (a) Real, (b) Zip4, (c) Street Segment, (d) Census Block Group, and (e) Zip.

https://doi.org/10.1371/journal.pone.0285552.g004

To quantify the similarity between the datasets, the individual cluster member details were used. The cluster members for all the categories were compared against the real category using the Jaccard similarity coefficient. Jaccard similarity coefficient for two sets is defined as the ratio between the size of the intersection and the size of the union between the sets. The Jaccard similarity coefficient of the real cluster to the corresponding aggregated cluster was calculated and its average was assigned as the similarity score. A similarity score of one indicates perfect similarity while a value of zero indicates complete dissimilarity. Apart from this similarity score, the total number of real clusters that were never detected as well as the total number of spurious clusters generated as an artefact of the aggregation was also noted. The missing and additional clusters were identified by applying a spatial intersection test between each of the real clusters and the aggregated clusters.

Privacy

Re-engineering risk with Zip4 aggregation.

To address the re-engineering risk of using a Zip4 based aggregation, the percentage of points falling within a distance of 5 meters (as a minimum tolerance) to the assigned Zip4 centroid was calculated. For a comparative analysis, the same procedure was repeated for SS, CGB, and Zip categories.

Spatial K-anonymity.

Spatial K-Anonymity [65] is a widely adopted method to evaluate the degree of geoprivacy achieved with any type of geomasking technique. It is a variant of k-anonymity [66] which quantitatively assess the probability of identifying an individual record from a group of other records. Spatial k-anonymity estimates the probability of identifying an individual location from a set of locations after geomasking. The k value in k-anonymity is determined by calculating the total number of other (k-1) addresses that fall within the region (circular) between the real and the new anonymized location [14]. An example of calculating k-anonymity for a single location is shown in Fig 5. Higher k value indicates increased spatial privacy preservation as the probability of re-identification (given as 1/k) decreases with an increase in k.

Download:

Fig 5. Spatial K-anonymity calculations.

The red point is the original location and the blue point is the geomasked location. The green stars are other potential locations.

https://doi.org/10.1371/journal.pone.0285552.g005

In order to calculate k-anonymity when using a Zip4 centroid, we used the Cuyahoga County address data obtained from the Cuyahoga County Open Data portal (https://data-cuyahoga.opendata.arcgis.com/). The dataset contains 490,453 address locations in Cuyahoga County with their respective geographical coordinates. The k-anonymity value for each location is calculated as per the methodology described in Fig 5 and the k-anonymity value for the dataset is calculated by taking the average k value across the entire dataset. For a comparative analysis, k-anonymization was also implemented on the SS and CBG datasets.