Abstract
Privacy concerns regarding the sharing of spatially referenced household data have induced researchers and survey agencies to “scramble” geographic information by adding random spatial errors to true location coordinates. In this paper, we prove mathematically that the addition of random noise leads to a systematic overestimation of distances between households and access points of interest. We illustrate this average distance bias as well as the attenuation bias generated by random spatial errors using data on household and health facility location from a Health and Demographic Surveillance Site in rural South Africa. Given the large overall biases observed, we argue that the use of scrambled spatial data for policy making or empirical work is generally not advisable, and that alternative methods of protecting data confidentiality should be used to ensure the usability of spatial data for quantitative analysis.
Similar content being viewed by others
Notes
See http://www.measuredhs.com/What-We-Do/GPS-Data-Collection.cfm for details.
See for instance formulas 8.111#3 and 8.112#2 of: I.S. Gradshteyn and I.M. Ryzhik, Table of Integrals, Series, and Products (tr. and ed. Alan Jeffrey), New York: Academic Press 1980. The power series in γ that we obtain is probably known too, but easier to derive than to locate in the literature.
References
Arcury, T. A., Gesler, W. M., Preisser, J. S., Sherman, J., Spencer, J., & Perin, J. (2005). The effects of geography and spatial behavior on health care utilization among the residents of a rural region. Health Services Research, 40(1), 135–155. doi:10.1111/j.1475-6773.2005.00346.x.
Armstrong, M. P., Rushton, G., & Zimmerman, D. L. (1999). Geographically masking health data to preserve confidentiality. Statistics in Medicine, 18(5), 497–525.
Bärnighausen, T., Tanser, F., Herbst, K., Mutevedzi, T., Mossong, J., & Newell, M. (2014). Structural barriers to antiretroviral treatment: A study using population-based CD4 count and linked antiretroviral treatment programme data. Lancet, 382, S5.
Borsch-Supan, A., Brandt, M., Hunkler, C., Kneip, T., Korbmacher, J., Malter, F., et al. (2013). Data resource profile: The survey of health, ageing and retirement in Europe (SHARE). International Journal of Epidemiology, 42(4), 992–1001. doi:10.1093/ije/dyt088.
Center for Human Resource Research. (1997). The national longitudinal surveys NLSY79 user guide. Columbus, OH: Ohio State University.
Committee on the Human Dimensions of Global Change. (2007). Putting people on the map: Protecting confidentiality with linked social-spatial data. Washington, DC: The National Academic Press.
Cooke, G. S., Tanser, F. C., Bärnighausen, T., & Newell, M. L. (2010). Population uptake of antiretroviral treatment through primary care in rural South Africa. BMC Public Health, 10, 585. doi:10.1186/1471-2458-10-585.
Golden, M. L., Downs, R. R., & Davis-Packard, K. (2005). Confidentiality issues and policies related to the utilization and dissemination of geospatial data for public health applications. New York: The Socioeconomic Data and Applications Center (SEDAC), Center for International Earth Science Information Network (CIESIN), Columbia University.
Hyman, S. E. (2000). The needs for database research and for privacy collide. American Journal of Psychiatry, 157(11), 1723–1724.
ICF International. (2012). Demographic and health survey—Sampling and household listing manual MEASURE DHS. Calverton, Maryland, USA
Kamel Boulos, M. N., Cai, Q., Padget, J. A., & Rushton, G. (2006). Using software agents to preserve individual health data confidentiality in micro-scale geographical analyses. Journal of Biomedical Informatics, 39(2), 160–170. doi:10.1016/j.jbi.2005.06.003.
Kamel Boulos, M. N., Curtis, A. J., & Abdelmalik, P. (2009). Musings on privacy issues in health research involving disaggregate geographic data about individuals. International Journal of Health Geographics, 8, 46. doi:10.1186/1476-072X-8-46.
Kaplan, E. D., & Hegarty, C. J. (2006). Understanding GPS: Principles and applications (2nd ed.). Boston: Artech House.
Kyei, N. N., Campbell, O. M., & Gabrysch, S. (2012). The influence of distance and level of service provision on antenatal care use in rural Zambia. PLoS ONE, 7(10), e46475. doi:10.1371/journal.pone.0046475.
Linardakis, M., Smpokos, E., Papadaki, A., Komninos, I. D., Tzanakis, N., & Philalithis, A. (2013). Prevalence of multiple behavioral risk factors for chronic diseases in adults aged 50+ , from eleven European countries—The SHARE study (2004). Preventive Medicine, 57(3), 168–172. doi:10.1016/j.ypmed.2013.05.008.
Lohela, T. J., Campbell, O. M., & Gabrysch, S. (2012). Distance to care, facility delivery and early neonatal mortality in Malawi and Zambia. PLoS ONE, 7(12), e52110. doi:10.1371/journal.pone.0052110.
National Archives and Records Administration. (1996). U.S. global positioning system policy. Washington, DC: U.S. Government.
National Research Council. (2005). Expanding access to research data: Reconciling risks and opportunities. Washington, DC: The National Academies Press.
O’Brien, D. G., & Yasnoff, W. A. (1999). Privacy, confidentiality, and security in information systems of state health agencies. American Journal of Preventive Medicine, 16(4), 351–358.
Onsrud, H. J., Johnson, J. P., & Lopez, X. (1994). Protecting personal privacy in using geographic information systems. Photogrammetric Engineering and Remote Sensing, 60(9), 1083–1095.
Schwieger, V. (2003). Using handheld GPS receivers for precise positions. FIG Regional Conference Paper.
Seiber, E. E., & Bertrand, J. T. (2002). Access as a factor in differential contraceptive use between Mayans and ladinos in Guatemala. Health Policy and Planning, 17(2), 167–177.
Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15, 72–101.
Tanser, F., Bärnighausen, T., Cooke, G. S., & Newell, M. L. (2009). Localized spatial clustering of HIV infections in a widely disseminated rural South African epidemic. International Journal of Epidemiology, 38(4), 1008–1016. doi:10.1093/ije/dyp148.
Tanser, F., Hosegood, V., Barnighausen, T., Herbst, K., Nyirenda, M., Muhwava, W., et al. (2008). Cohort profile: Africa centre demographic information system (ACDIS) and population-based HIV survey. International Journal of Epidemiology, 37(5), 956–962. doi:10.1093/ije/dym211.
Wooldridge, J. M. (2002). Econometric analysis of cross section and panel data. Cambridge, MA: MIT Press.
Wooldridge, J. M. (2003). Introductory econometrics: A modern approach (2nd ed.). South-Western: Thomson.
Author information
Authors and Affiliations
Corresponding author
Appendix: Proof of proposition 1
Appendix: Proof of proposition 1
It is easy to show that in the degenerate case h = 0 the average distance 〈|h + v|〉 must exceed |h|. We shall show that this behavior 〈|h + v|〉 > |h| is typical even when |h| is comparable with, or considerably greater than, the typical size |v| of the noise vector.
The result 〈|h + v|〉 > |h| is a direct consequence of the triangle inequality (TI). Recall that the TI asserts that for any vectors x, y, we have
because the right-hand side is the distance from the origin to x + y along a straight line, and the left-hand side is the distance from the origin to x + y via x. Hence, equality |x| + |y| = |x + y| occurs if and only if x is contained in the closed line segment joining the origin to x + y, that is, if and only if one of x and y is a non-negative multiple of the other.
To prove that 〈|h + v|〉 > |h|, first note that 〈|h + v|〉 = 〈|h − v|〉 because v and −v have the same distribution. But then,
using TI with x = h + v and y = h − v in the second step and the fact that |2h| is constant in the next-to-last step. Dividing both sides of the resulting inequality by 2, we deduce 〈|h + v|〉 ≥ |h| as claimed. Moreover, we can only have 〈|h + v|〉 = |h| when h + v and h − v satisfy the equality condition in the TI for every v, which is to say when every v is on the closed line segment joining 0 to h. To evaluate 〈|h + v|2〉, we begin in the same way
and now apply the parallelogram identity
(which can be obtained by writing each of the terms |x ± y|2 on the left-hand side as an inner product (x ± y, x ± y) = (x, x) ± 2(x, y) + (y, y) and noting that the cross-terms ±2(x, y) sum to zero). Taking x = h and y = v, we then obtain
Dividing both sides by 2, we recover the identity 〈|h + v|2〉 = |h|2 + 〈|v|2〉 claimed earlier.
Now recall that any real-valued random variable X satisfies 〈X 2〉 ≥ 〈X〉2 (the difference is the variance 〈(X − 〈X〉)2〉, which is clearly non-negative). Applying this to X = |h + v|, we find
with strict inequality unless 〈|v|2〉 = 0. Hence, 〈|h + v|〉 ≤ |h| + 〈|v|2〉/(2|h|) as claimed.
So far, our analysis did not depend on the choice of distribution v or even on the dimension of the space. In practice, h and v are drawn from a two-dimensional space, though one may also consider one-dimensional problems as a simplified model (such as a community limited to a street or a long and narrow valley). We consider three possibilities:
-
1.
A one-dimensional space with v drawn uniformly from the interval [−γ, +γ] for some γ > 0, so h is replaced by a random number drawn uniformly from the interval [h − γ, h + γ] of length 2γ centered at h.
-
2.
A two-dimensional space with v drawn uniformly from the radius-γ circle |v| = γ about the origin, so h is replaced by a random point at distance exactly γ from h (a random point on the circle of radius γ about h). Even if this distribution is not used in practice, it is needed for the analysis of the next case.
-
3.
A two-dimensional space with v drawn uniformly from the radius-γ disk |v| ≤ γ about the origin, so h is replaced by a random point at distance at most γ from h (a random point in the disk of radius γ about h). In this case, our analysis requires that γ ≤ |h|, but this assumption will usually be satisfied in practice.
In the one-dimensional case, the variance 〈|v|2〉 is given by the elementary integral
so the added noise increases 〈|h|2〉 by \( \gamma^{2}/3\). The expected distance 〈|h + v|〉 remains |h| as long as γ < |h|, since then |h + v| + |h − v| = 2h always. Once γ exceeds |h|, we distinguish two possibilities. In the first, |h| still exceeds the noise magnitude |v|. This happens with probability |h|/γ, and then the average value of |h + v| in this case is still h. The other possibility is that |v| ≥ |h|, and then averaging |h + v| with |h − v| yields |v|. Since here |v| ranges uniformly from |h| to γ, its average value is (γ + |h|)/2. Combining the |v| < |h| and |v| ≥ |h| averages, weighted by their respective probabilities, we obtain
Thus, replacing h by h + v increases the expected distance by \( \frac{{(\gamma - \left| h \right|)^{2} }}{(2\gamma) } \).
In the second scenario, |v| = γ always, so 〈|v|2〉 = γ 2 and 〈|h + v|2〉 = |h|2 + γ 2. To compute 〈|h + v|〉, let θ ∊ [0, 2π) be the oriented angle from h to v. Then, θ is uniformly distributed in [0, 2π) and \( {\left| {h + v} \right|} = \ {\sqrt {\left| h \right|^{2} + 2\gamma \left| h \right|\cos \theta + \gamma^{2}}}\) by the Law of Cosines [or by expanding the inner product of |h + v|2 = (h + v, h + v) = (h, h) + 2(h, v) + (v, v)]. Thus,
This integral is no longer elementary, except in the special case where γ = 0, h = 0, or γ = |h|. [If γ = 0 or h = 0 then 〈|h + v|〉 = |h| or γ, respectively; if γ = |h|, then the identity 2 + 2 cos θ = 4 cos2(θ/2) simplifies the integral to \( \int\limits_{0}^{2\pi } {2\left| h \right| \left|\cos \left( {\theta /2} \right)\right|} d\theta = 8\left| h \right| \), whence 〈|h + v|〉 = (4/π)|h|]. Assume, then, that 0, γ, and |h| are distinct. Then, we may assume γ < |h| because our formula for 〈|h + v|〉 does not change if we switch γ with |h|. Then, our integral can be evaluated in terms of a complete elliptical integral of the second kindFootnote 2:
It would take substantial work to recover the behavior of 〈|h + v|〉 from this rather exotic formula. We thus work directly with the integral, expanding it as a power series in γ that converges in the interval |γ| < |h|.
It will be convenient to regard h and v as complex numbers in the usual way. Then, \( h + v = h ({{1 + re^{i\theta } }})\), where r = γ/h < 1. We then have
and since the complex conjugate of 1 + re iθ is 1 + re −iθ, this gives
We expand each of the factors \( \sqrt {1 + re^{ \pm i\theta } } \) using the binomial series
valid and absolutely convergent for all complex z such that |z| ≤ 1, where
and in general the coefficient a m is \( {{\left( \frac{1}{2} \right) \left( { - \frac{1}{2}} \right) \left( { - \frac{3}{2}} \right) \left( { - \frac{5}{2}} \right) \ldots \left( { - m + \frac{3}{2}} \right)} \mathord{\left/ {\vphantom {{\left( \frac{1}{2} \right) \left( { - \frac{1}{2}} \right),\left( { - \frac{3}{2}} \right),\left( { - \frac{5}{2}} \right) \ldots \left( { - m + \frac{3}{2}} \right)} {m!}}} \right. \kern-0pt} {m!}} \). Thus, \( \sqrt {1 + re^{i\theta } } \sqrt {1 + re^{ - i\theta } } \) is the sum of the terms a m a n r m+n e i(m–n)θ over all pairs (m, n) of whole numbers. The integral of such a term over 0 ≤ θ ≤ 2π is 2πa m a n if m = n and zero otherwise. Summing over m, n, we find that \( 4\left( {\left| h \right| + \gamma } \right){\bf{E}}^\prime\left( {\frac{{\left| {h - \gamma } \right|}}{h + \gamma }} \right) \) is the sum of the terms 2π|h|a 2 n r 2n over n = 0, 1, 2, 3,…, and thus that
We note in passing that the special case γ = |h| (that is, r = 1) yields the amusing formula
In the more general and final scenario 3, v is drawn uniformly from the radius-γ disk |v| ≤ γ about the origin. We integrate over this circle using polar coordinates, again using for θ the oriented angle from h to v. We then find
and thus \( \left\langle {\left| {h + v} \right|^{2} } \right\rangle = \left| h \right|^{2} + \frac{1}{2}\gamma^{2} \). For the average distance, we write
Again the integral is not elementary. As long as γ < |h|, we have ρ ≤ |h| for all ρ in [0, γ], so we can use our power series for the integral over θ and integrate each term 2π|h|a 2 n (ρ/h)2n (with n = 0, 1, 2, 3,…), obtaining
where r = γ/|h| as before. Therefore, in this case, we obtain the power series expansion
Rights and permissions
About this article
Cite this article
Elkies, N., Fink, G. & Bärnighausen, T. “Scrambling” geo-referenced data to protect privacy induces bias in distance estimation. Popul Environ 37, 83–98 (2015). https://doi.org/10.1007/s11111-014-0225-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11111-014-0225-0