Skip to main content
Log in

“Scrambling” geo-referenced data to protect privacy induces bias in distance estimation

  • Original Paper
  • Published:
Population and Environment Aims and scope Submit manuscript

Abstract

Privacy concerns regarding the sharing of spatially referenced household data have induced researchers and survey agencies to “scramble” geographic information by adding random spatial errors to true location coordinates. In this paper, we prove mathematically that the addition of random noise leads to a systematic overestimation of distances between households and access points of interest. We illustrate this average distance bias as well as the attenuation bias generated by random spatial errors using data on household and health facility location from a Health and Demographic Surveillance Site in rural South Africa. Given the large overall biases observed, we argue that the use of scrambled spatial data for policy making or empirical work is generally not advisable, and that alternative methods of protecting data confidentiality should be used to ensure the usability of spatial data for quantitative analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. See http://www.measuredhs.com/What-We-Do/GPS-Data-Collection.cfm for details.

  2. See for instance formulas 8.111#3 and 8.112#2 of: I.S. Gradshteyn and I.M. Ryzhik, Table of Integrals, Series, and Products (tr. and ed. Alan Jeffrey), New York: Academic Press 1980. The power series in γ that we obtain is probably known too, but easier to derive than to locate in the literature.

References

  • Arcury, T. A., Gesler, W. M., Preisser, J. S., Sherman, J., Spencer, J., & Perin, J. (2005). The effects of geography and spatial behavior on health care utilization among the residents of a rural region. Health Services Research, 40(1), 135–155. doi:10.1111/j.1475-6773.2005.00346.x.

    Article  Google Scholar 

  • Armstrong, M. P., Rushton, G., & Zimmerman, D. L. (1999). Geographically masking health data to preserve confidentiality. Statistics in Medicine, 18(5), 497–525.

    Article  Google Scholar 

  • Bärnighausen, T., Tanser, F., Herbst, K., Mutevedzi, T., Mossong, J., & Newell, M. (2014). Structural barriers to antiretroviral treatment: A study using population-based CD4 count and linked antiretroviral treatment programme data. Lancet, 382, S5.

    Article  Google Scholar 

  • Borsch-Supan, A., Brandt, M., Hunkler, C., Kneip, T., Korbmacher, J., Malter, F., et al. (2013). Data resource profile: The survey of health, ageing and retirement in Europe (SHARE). International Journal of Epidemiology, 42(4), 992–1001. doi:10.1093/ije/dyt088.

    Article  Google Scholar 

  • Center for Human Resource Research. (1997). The national longitudinal surveys NLSY79 user guide. Columbus, OH: Ohio State University.

    Google Scholar 

  • Committee on the Human Dimensions of Global Change. (2007). Putting people on the map: Protecting confidentiality with linked social-spatial data. Washington, DC: The National Academic Press.

    Google Scholar 

  • Cooke, G. S., Tanser, F. C., Bärnighausen, T., & Newell, M. L. (2010). Population uptake of antiretroviral treatment through primary care in rural South Africa. BMC Public Health, 10, 585. doi:10.1186/1471-2458-10-585.

    Article  Google Scholar 

  • Golden, M. L., Downs, R. R., & Davis-Packard, K. (2005). Confidentiality issues and policies related to the utilization and dissemination of geospatial data for public health applications. New York: The Socioeconomic Data and Applications Center (SEDAC), Center for International Earth Science Information Network (CIESIN), Columbia University.

    Google Scholar 

  • Hyman, S. E. (2000). The needs for database research and for privacy collide. American Journal of Psychiatry, 157(11), 1723–1724.

    Article  Google Scholar 

  • ICF International. (2012). Demographic and health surveySampling and household listing manual MEASURE DHS. Calverton, Maryland, USA

  • Kamel Boulos, M. N., Cai, Q., Padget, J. A., & Rushton, G. (2006). Using software agents to preserve individual health data confidentiality in micro-scale geographical analyses. Journal of Biomedical Informatics, 39(2), 160–170. doi:10.1016/j.jbi.2005.06.003.

    Article  Google Scholar 

  • Kamel Boulos, M. N., Curtis, A. J., & Abdelmalik, P. (2009). Musings on privacy issues in health research involving disaggregate geographic data about individuals. International Journal of Health Geographics, 8, 46. doi:10.1186/1476-072X-8-46.

    Article  Google Scholar 

  • Kaplan, E. D., & Hegarty, C. J. (2006). Understanding GPS: Principles and applications (2nd ed.). Boston: Artech House.

    Google Scholar 

  • Kyei, N. N., Campbell, O. M., & Gabrysch, S. (2012). The influence of distance and level of service provision on antenatal care use in rural Zambia. PLoS ONE, 7(10), e46475. doi:10.1371/journal.pone.0046475.

    Article  Google Scholar 

  • Linardakis, M., Smpokos, E., Papadaki, A., Komninos, I. D., Tzanakis, N., & Philalithis, A. (2013). Prevalence of multiple behavioral risk factors for chronic diseases in adults aged 50+ , from eleven European countries—The SHARE study (2004). Preventive Medicine, 57(3), 168–172. doi:10.1016/j.ypmed.2013.05.008.

    Article  Google Scholar 

  • Lohela, T. J., Campbell, O. M., & Gabrysch, S. (2012). Distance to care, facility delivery and early neonatal mortality in Malawi and Zambia. PLoS ONE, 7(12), e52110. doi:10.1371/journal.pone.0052110.

    Article  Google Scholar 

  • National Archives and Records Administration. (1996). U.S. global positioning system policy. Washington, DC: U.S. Government.

    Google Scholar 

  • National Research Council. (2005). Expanding access to research data: Reconciling risks and opportunities. Washington, DC: The National Academies Press.

    Google Scholar 

  • O’Brien, D. G., & Yasnoff, W. A. (1999). Privacy, confidentiality, and security in information systems of state health agencies. American Journal of Preventive Medicine, 16(4), 351–358.

    Article  Google Scholar 

  • Onsrud, H. J., Johnson, J. P., & Lopez, X. (1994). Protecting personal privacy in using geographic information systems. Photogrammetric Engineering and Remote Sensing, 60(9), 1083–1095.

    Google Scholar 

  • Schwieger, V. (2003). Using handheld GPS receivers for precise positions. FIG Regional Conference Paper.

  • Seiber, E. E., & Bertrand, J. T. (2002). Access as a factor in differential contraceptive use between Mayans and ladinos in Guatemala. Health Policy and Planning, 17(2), 167–177.

    Article  Google Scholar 

  • Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15, 72–101.

    Article  Google Scholar 

  • Tanser, F., Bärnighausen, T., Cooke, G. S., & Newell, M. L. (2009). Localized spatial clustering of HIV infections in a widely disseminated rural South African epidemic. International Journal of Epidemiology, 38(4), 1008–1016. doi:10.1093/ije/dyp148.

    Article  Google Scholar 

  • Tanser, F., Hosegood, V., Barnighausen, T., Herbst, K., Nyirenda, M., Muhwava, W., et al. (2008). Cohort profile: Africa centre demographic information system (ACDIS) and population-based HIV survey. International Journal of Epidemiology, 37(5), 956–962. doi:10.1093/ije/dym211.

    Article  Google Scholar 

  • Wooldridge, J. M. (2002). Econometric analysis of cross section and panel data. Cambridge, MA: MIT Press.

    Google Scholar 

  • Wooldridge, J. M. (2003). Introductory econometrics: A modern approach (2nd ed.). South-Western: Thomson.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Günther Fink.

Appendix: Proof of proposition 1

Appendix: Proof of proposition 1

It is easy to show that in the degenerate case h = 0 the average distance 〈|h + v|〉 must exceed |h|. We shall show that this behavior 〈|h + v|〉 > |h| is typical even when |h| is comparable with, or considerably greater than, the typical size |v| of the noise vector.

The result 〈|h + v|〉 > |h| is a direct consequence of the triangle inequality (TI). Recall that the TI asserts that for any vectors x, y, we have

$$ \left| x \right| + \left| y \right| \ge \left| {x + y} \right|, $$

because the right-hand side is the distance from the origin to x + y along a straight line, and the left-hand side is the distance from the origin to x + y via x. Hence, equality |x| + |y| = |x + y| occurs if and only if x is contained in the closed line segment joining the origin to x + y, that is, if and only if one of x and y is a non-negative multiple of the other.

To prove that 〈|h + v|〉 > |h|, first note that 〈|h + v|〉 = 〈|h − v|〉 because v and −v have the same distribution. But then,

$$ \begin{gathered} 2\left\langle {\left| {h + v} \right|} \right\rangle = \left\langle {\left| {h + v} \right|} \right\rangle + \left\langle {\left| {h - v} \right|} \right\rangle \hfill \\ \quad \quad \quad \;\;\, = \left\langle {\left| {h + v} \right|} + {\left| {h - v} \right|} \right\rangle \hfill \\ \quad \quad \quad \;\;\, \geq \left\langle {\left| {\left( {h + v} \right) + \left( {h - v} \right)} \right|} \right\rangle \hfill \\ \quad \quad \quad \;\;\, = \left\langle {\left| {2h} \right|} \right\rangle \hfill \\ \quad \quad \quad \;\;\, = \left| {2h} \right| \hfill \\ \quad \quad \quad \;\;\, = 2\left| h \right|, \hfill \\ \end{gathered} $$

using TI with x = h + v and y = h − v in the second step and the fact that |2h| is constant in the next-to-last step. Dividing both sides of the resulting inequality by 2, we deduce 〈|h + v|〉 ≥ |h| as claimed. Moreover, we can only have 〈|h + v|〉 = |h| when h + v and h − v satisfy the equality condition in the TI for every v, which is to say when every v is on the closed line segment joining 0 to h. To evaluate 〈|h + v|2〉, we begin in the same way

$$ 2\left\langle {\left| {h + v} \right|^{2} } \right\rangle = \left\langle {\left| {h + v} \right|^{2} } \right\rangle + \left\langle {\left| {h - v} \right|^{2} } \right\rangle = \left\langle {\left| {h + v} \right|^{2} + \left| {h - v} \right|^{2} } \right\rangle , $$

and now apply the parallelogram identity

$$ \left| {x + y} \right|^{2} + \left| {x - y} \right|^{2} = 2\left| x \right|^{2} + 2\left| y \right|^{2} $$

(which can be obtained by writing each of the terms |x ± y|2 on the left-hand side as an inner product (x ± yx ± y) = (xx) ± 2(xy) + (yy) and noting that the cross-terms ±2(xy) sum to zero). Taking x = h and y = v, we then obtain

$$ 2\left\langle {\left| {h + v} \right|^{2} } \right\rangle = \left\langle {2\left| h \right|^{2} + 2\left| v \right|^{2} } \right\rangle = \left\langle {2\left| h \right|^{2} }\right\rangle + \left\langle {2\left| v \right|^{2} } \right\rangle = 2\left| h \right|^{2} + 2\left\langle {\left| v \right|^{2} } \right\rangle . $$

Dividing both sides by 2, we recover the identity 〈|h + v|2〉 = |h|2 + 〈|v|2〉 claimed earlier.

Now recall that any real-valued random variable X satisfies 〈X 2〉 ≥ 〈X2 (the difference is the variance 〈(X − 〈X〉)2〉, which is clearly non-negative). Applying this to X = |h + v|, we find

$$ \left\langle {\left| {h + v} \right|} \right\rangle^{2} \le \left\langle {\left| {h + v} \right|^{2} } \right\rangle = \left| h \right|^{2} + \left\langle {\left| v \right|^{2} } \right\rangle \le \left( {\left| h \right| + \frac{{\left\langle {\left| v \right|^{2} } \right\rangle }}{2\left| h \right|}} \right)^{2} , $$

with strict inequality unless 〈|v|2〉 = 0. Hence, 〈|h + v|〉 ≤ |h| + 〈|v|2〉/(2|h|) as claimed.

So far, our analysis did not depend on the choice of distribution v or even on the dimension of the space. In practice, h and v are drawn from a two-dimensional space, though one may also consider one-dimensional problems as a simplified model (such as a community limited to a street or a long and narrow valley). We consider three possibilities:

  1. 1.

    A one-dimensional space with v drawn uniformly from the interval [−γ, +γ] for some γ > 0, so h is replaced by a random number drawn uniformly from the interval [h − γh + γ] of length 2γ centered at h.

  2. 2.

    A two-dimensional space with v drawn uniformly from the radius-γ circle |v| = γ about the origin, so h is replaced by a random point at distance exactly γ from h (a random point on the circle of radius γ about h). Even if this distribution is not used in practice, it is needed for the analysis of the next case.

  3. 3.

    A two-dimensional space with v drawn uniformly from the radius-γ disk |v| ≤ γ about the origin, so h is replaced by a random point at distance at most γ from h (a random point in the disk of radius γ about h). In this case, our analysis requires that γ ≤ |h|, but this assumption will usually be satisfied in practice.

In the one-dimensional case, the variance 〈|v|2〉 is given by the elementary integral

$$ \frac{1}{2\gamma }\int_{ - \gamma }^{ + \gamma } {v^{2} dv = \frac{1}{2\gamma }\left[ {\frac{{v^{3} }}{3}} \right]_{v = - \gamma }^{\gamma } } = \frac{{\gamma^{2} }}{3}, $$

so the added noise increases 〈|h|2〉 by \( \gamma^{2}/3\). The expected distance 〈|h + v|〉 remains |h| as long as γ < |h|, since then |h + v| + |h − v| = 2h always. Once γ exceeds |h|, we distinguish two possibilities. In the first, |h| still exceeds the noise magnitude |v|. This happens with probability |h|/γ, and then the average value of |h + v| in this case is still h. The other possibility is that |v| ≥ |h|, and then averaging |h + v| with |h − v| yields |v|. Since here |v| ranges uniformly from |h| to γ, its average value is (γ + |h|)/2. Combining the |v| < |h| and |v| ≥ |h| averages, weighted by their respective probabilities, we obtain

$$ \frac{\left| h \right|}{\gamma}\left| h \right| + \left( {1 - \frac{\left| h \right|}{ \gamma}} \right)\frac{\gamma + \left| h \right|}{2} = \frac{{\gamma^{2} + \left| h \right|^{2} }}{2\gamma } = h + \frac{{(\gamma - \left| h \right|)^{2} }}{(2\gamma)}. $$

Thus, replacing h by h + v increases the expected distance by \( \frac{{(\gamma - \left| h \right|)^{2} }}{(2\gamma) } \).

In the second scenario, |v| = γ always, so 〈|v|2〉 = γ 2 and 〈|h + v|2〉 = |h|2 + γ 2. To compute 〈|h + v|〉, let θ ∊ [0, 2π) be the oriented angle from h to v. Then, θ is uniformly distributed in [0, 2π) and \( {\left| {h + v} \right|} = \ {\sqrt {\left| h \right|^{2} + 2\gamma \left| h \right|\cos \theta + \gamma^{2}}}\) by the Law of Cosines [or by expanding the inner product of |h + v|2 = (h + v, h + v) = (h, h) + 2(h, v) + (v, v)]. Thus,

$$ \left\langle {\left| {h + v} \right|} \right\rangle= \frac{1}{2\pi }\int\limits_{0}^{2\pi } {\sqrt {\left| h \right|^{2} + 2\gamma \left| h \right|\cos \theta + \gamma^{2}} \;d\theta .}$$

This integral is no longer elementary, except in the special case where γ = 0, h = 0, or γ = |h|. [If γ = 0 or h = 0 then 〈|h + v|〉 = |h| or γ, respectively; if γ = |h|, then the identity 2 + 2 cos θ = 4 cos2(θ/2) simplifies the integral to \( \int\limits_{0}^{2\pi } {2\left| h \right| \left|\cos \left( {\theta /2} \right)\right|} d\theta = 8\left| h \right| \), whence 〈|h + v|〉 = (4/π)|h|]. Assume, then, that 0, γ, and |h| are distinct. Then, we may assume γ < |h| because our formula for 〈|h + v|〉 does not change if we switch γ with |h|. Then, our integral can be evaluated in terms of a complete elliptical integral of the second kindFootnote 2:

$$ \int\limits_{0}^{2\pi } {\sqrt {\left| h \right|^{2} + 2\gamma \left| h \right|\cos \theta + \gamma^{2} } \;d\theta = 4\left( {\left| h \right| + \gamma } \right){\bf{E}}^\prime\left( {\frac{\left| h \right| - \gamma }{\left| h \right| + \gamma }} \right).} $$

It would take substantial work to recover the behavior of 〈|h + v|〉 from this rather exotic formula. We thus work directly with the integral, expanding it as a power series in γ that converges in the interval |γ| < |h|.

It will be convenient to regard h and v as complex numbers in the usual way. Then, \( h + v = h ({{1 + re^{i\theta } }})\), where r = γ/h < 1. We then have

$$ \left| {h + v} \right| = \left| h \right|\left| {1 + re^{i\theta } } \right| = \left| h \right|\left( {\left( {1 + re^{i\theta } } \right)\overline{\left({1 + re^{i\theta}} \right)}} \right)^{\frac{1}{2}} , $$

and since the complex conjugate of 1 + re is 1 + re , this gives

$$ \left| {h + v} \right| = \left| h \right|\left( {\left( {1 + re^{i\theta } } \right)\left( {1 + re^{ - i\theta } } \right)} \right)^{\frac{1}{2}} = \left| h \right|\sqrt {1 + re^{i\theta } } \sqrt {1 + re^{ - i\theta }}. $$

We expand each of the factors \( \sqrt {1 + re^{ \pm i\theta } } \) using the binomial series

$$ \sqrt {1 + z} = \left( {1 + z}\right)^{\frac{1}{2}} = a_{0} + a_{1} z + a_{2} z^{2} + a_{3} z^{3} + a_{4} z^{4} + \ldots , $$

valid and absolutely convergent for all complex z such that |z| ≤ 1, where

$$ a_{0} = 1,a_{1} = \frac{1}{2},a_{2} - \frac{1}{8},a_{3} = \frac{1}{16},a_{4} = - \frac{5}{128},a_{5} = \frac{7}{256}, $$

and in general the coefficient a m is \( {{\left( \frac{1}{2} \right) \left( { - \frac{1}{2}} \right) \left( { - \frac{3}{2}} \right) \left( { - \frac{5}{2}} \right) \ldots \left( { - m + \frac{3}{2}} \right)} \mathord{\left/ {\vphantom {{\left( \frac{1}{2} \right) \left( { - \frac{1}{2}} \right),\left( { - \frac{3}{2}} \right),\left( { - \frac{5}{2}} \right) \ldots \left( { - m + \frac{3}{2}} \right)} {m!}}} \right. \kern-0pt} {m!}} \). Thus, \( \sqrt {1 + re^{i\theta } } \sqrt {1 + re^{ - i\theta } } \) is the sum of the terms a m a n r m+n e i(mn)θ over all pairs (m, n) of whole numbers. The integral of such a term over 0 ≤ θ ≤ 2π is 2πa m a n if m = n and zero otherwise. Summing over m, n, we find that \( 4\left( {\left| h \right| + \gamma } \right){\bf{E}}^\prime\left( {\frac{{\left| {h - \gamma } \right|}}{h + \gamma }} \right) \) is the sum of the terms 2π|h|a 2 n r 2n over n = 0, 1, 2, 3,…, and thus that

$$ \left\langle {\left| {h + v} \right|} \right\rangle = \left| h \right|\left( {a_{0}^{2} + a_{1}^{2} r^{2} + a_{2}^{2} r^{4} + a_{3}^{2} r^{6} + \cdots } \right) $$
$$ = \left| h \right| + \frac{1}{4}\frac{{\gamma^{2} }}{\left| h \right|} + \frac{1}{64}\frac{{\gamma^{4} }}{{\left| h \right|^{3} }} + \frac{1}{256}\frac{{\gamma^{6} }}{{\left| h \right|^{5} }} + \frac{25}{16384}\frac{{\gamma^{8} }}{{\left| h \right|^{7} }} + \cdots $$

We note in passing that the special case γ = |h| (that is, r = 1) yields the amusing formula

$$ \frac{4}{\pi } = \sum\limits_{n = 0}^{\infty } {a_{n}^{2} = 1 + \left( \frac{1}{2} \right)^{2} + \left( \frac{1}{8} \right)^{2} + \left( \frac{1}{16} \right)^{2} + \left( \frac{5}{128} \right)^{2} + \left( \frac{7}{256} \right)^{2} } \cdots $$

In the more general and final scenario 3, v is drawn uniformly from the radius-γ disk |v| ≤ γ about the origin. We integrate over this circle using polar coordinates, again using for θ the oriented angle from h to v. We then find

$$ \left\langle {\left| v \right|^{2} } \right\rangle = \frac{1}{{\pi \gamma^{2} }}\int_{\rho = 0}^{\gamma } {\rho \int_{\theta = 0}^{2\pi } {\rho^{2} d\theta d\rho = \frac{2\pi }{{\pi \gamma^{2} }}\int_{\rho = 0}^{\gamma } {\rho^{3} d\rho = \frac{2}{{\gamma^{2} }}\left[ {\frac{{\rho^{4} }}{4}} \right]_{\rho = 0}^{\gamma } } } = \frac{{\gamma^{2} }}{2},} $$

and thus \( \left\langle {\left| {h + v} \right|^{2} } \right\rangle = \left| h \right|^{2} + \frac{1}{2}\gamma^{2} \). For the average distance, we write

$$ \left\langle {\left| {h + v} \right|} \right\rangle = \frac{1}{{\pi \gamma^{2} }}\int_{\rho = 0}^{\gamma } {\rho \int_{\theta = 0}^{2\pi } {\sqrt {\left| h \right|^{2} + 2\gamma \left| h \right|\cos \theta + \gamma^{2} }} \;d \theta \; d\rho .} $$

Again the integral is not elementary. As long as γ < |h|, we have ρ ≤ |h| for all ρ in [0, γ], so we can use our power series for the integral over θ and integrate each term 2π|h|a 2 n (ρ/h)2n (with n = 0, 1, 2, 3,…), obtaining

$$ \frac{1}{{\pi \gamma^{2} }}2\pi \left| h \right|^{1 - 2n} a_{n}^{2} \int_{\rho = 0}^{\gamma } {\rho^{2n + 1} d\rho = \frac{{2\left| h \right|^{1 - 2n} a_{n}^{2} }}{{\gamma^{2} }}\left[ {\frac{{\rho^{2n + 2} }}{2n + 2}} \right]_{\rho = 0}^{\gamma } } = \left| h \right|\frac{{a_{n}^{2} }}{n + 1}r^{2n} , $$

where r = γ/|h| as before. Therefore, in this case, we obtain the power series expansion

$$ \left\langle {\left| {h + v} \right|} \right\rangle = \left| h \right|\left( {a_{0}^{2} + \frac{{a_{1}^{2} }}{2}r^{2} + \frac{{a_{1}^{2} }}{3}r^{4} + \frac{{a_{1}^{2} }}{4}r^{6} + \cdots } \right) $$
$$ = \left| h \right| + \left( {\frac{1}{8}\frac{{\gamma^{2} }}{\left| h \right|} + \frac{1}{192}\frac{{\gamma^{4} }}{{\left| h \right|^{3} }} + \frac{1}{1024}\frac{{\gamma^{6} }}{{\left| h \right|^{5} }} + \frac{5}{16384}\frac{{\gamma^{8} }}{{\left| h \right|^{7} }} + \cdots } \right). $$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Elkies, N., Fink, G. & Bärnighausen, T. “Scrambling” geo-referenced data to protect privacy induces bias in distance estimation. Popul Environ 37, 83–98 (2015). https://doi.org/10.1007/s11111-014-0225-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11111-014-0225-0

Keywords

Navigation