Elsevier

Applied Geography

Volume 49, May 2014, Pages 45-53
Applied Geography

Accidental, open and everywhere: Emerging data sources for the understanding of cities,☆☆

https://doi.org/10.1016/j.apgeog.2013.09.012Get rights and content

Highlights

  • This paper reviews the emergence of three groups of data sources for the analysis of cities.

  • These are data collected from mobile sensors carried by individuals, data extracted from internet sites and government data released in an open format.

  • The paper also assess some of the opportunities and challenges they pose for the understanding of cities, particularly in the context of the Regional Science and urban research agenda.

  • Existing projects and initiatives that conform to each class are featured as illustrative examples of these new potential sources of knowledge.

Abstract

In this paper, I review the recent emergence of three groups of data sources and assess some of the opportunities and challenges they pose for the understanding of cities, particularly in the context of the Regional Science and urban research agenda. These are data collected from mobile sensors carried by individuals, data derived from businesses moving their activity online and government data released in an open format. Although very different from each other, they are all becoming available as a side-effect since they were created with different purposes but their degree of popularity, pervasiveness and ease of access is turning them into interesting alternatives for researchers. Existing projects and initiatives that conform to each class are featured as illustrative examples of these new potential sources of knowledge.

Introduction

These are exciting times to be an urban scientist. Not only is the world as a whole becoming more and more urbanized, once the historical threshold of more people living in cities than in rural areas has been already surpased (UN Department of Economic and Social Affairs, 2008), but the ability we are gaining to look into the inner workings of urban systems grows at even faster rates (Batty, 2012). An increasing amount of aspects of human life can be traced back through diverse digital footprints and, when aggregated, can reveal emerging patterns. Many economic transactions which used to be done offline have now been moved into the web, and their archival has created, as a “side-effect”, incredible amounts of data that reflect many aspects of human behavior. Democratic governments have not been completely foreign to technological change either. Many local, regional, national and supra-national public institutions are moving parts of their infrastructure into the cyberspace and responding to the pressure of activists that demand more transparency by releasing some of those data in open formats. All of these recent societal changes did not explicitly intend to redefine the “data landscape” available to urban researchers, but they have, making possible analysis at degrees of detail and scope unthinkable only a few years ago. The traditional creativity that applied researchers (geographers, economists, etc.) have developed to measure and quantify urban phenomena in contexts where data were scarce is being given a whole new field of action.

The amount and diversity of new data sources relating to cities that is becoming available grows exponentially,1 to the point it may seem unrealistic to look at all of them as one entity. However, this paper argues that much of them share three key characteristics that make them particularly well suited to current urban research. These include: their accidental nature, their open availability to researchers, and the ubiquity of their presence in everyday urban life. First, unlike a census or an economic survey, specifically created with research and policy analysis in mind, these sources were not originally intended for this end but for other purposes. Its potential usefulness for scientists comes then accidentally, as a byproduct. Second, and partly related to the previous one, all of these sources are available to researchers without the need to pay any fee or reach exclusive deals with the company/institution providing them. Finally, given the degrees of pervasiveness that are reaching the technologies and services where they originate, new datasets relating to virtually any quantifiable aspect of human life are appearing. Similar to other fields (e.g. see Edelman, 2012, Einav and Levin, 2013 for recent reviews in the case of economics), the combination of these three factors creates a significant opportunity for urban and regional scientists to study new phenomena or to examine old questions with a new insight. Very much in line with the views of Overman (2010) in relation to Geographic Information Systems (GIS), these data can in turn help: reduce location measurement error of observations (although they may introduce other biases, see Section Challenges); avoid the issue of discretizing continuous problems; fill gaps where traditional data are unlikely to exist; and design instrumentation strategies as a source of exogenous variation.

The main line of argument is that most of these data sources fall into one of three main groups, based on the basic actor and the nature of the process at which they originate. The first category is comprised by data collected in a bottom-up approach from mobile sensors carried by humans. At an intermediate level, we can identify databases employed to provide a (usually free) service through the internet by web companies. These are typically aggregated from several primary sources and derive from businesses which either move or base their activity on the internet. The last group is characterized by the top-down fashion in which it is collected, and it has to do with data released in an open format by public and government organizations at different geographical levels. This classification is not exclusive and may be combined with other ones as well as inter-mixed (e.g. open government data collected from mobile sensors, as in what is become known as “civic apps”). It is based on the intrinsic nature of the data origins and, although simple, it can be powerful to better interpret their attributes and, particularly, the type of processes or phenomena they may be reflecting. Ultimately, it is the good understanding of what the data can and cannot “tell” that makes it possible to incorporate them into meaningful studies.

Although potentially very advantageous, the use of these data is not free of challenges. Most of them derive from their accidental nature, from the fact they were not originally intended for this use. In particular, the major flaw may relate to the quality of the data: depending on what it is that we are trying to measure, the degree of completeness and bias in the population samples can compromise results and lead to misleading conclusions. But those are not the only hurdles to be confronted. Because often times they were not intended to be used in bulk, collection can be tricky and require some programming and database skills to access the sources. Once collected, the characteristics of the data may require methodologies and techniques not very familiar to the field yet. In some cases, as in what is come to be known as “big data”, the size and lack of structure of the datasets is such that applying traditional techniques may not be the preferred solution and other methods, such as machine learning (Bishop, 2006) or knowledge discovery from databases (KDD) techniques (Miller, 2010), as well as advanced visualizations (Batty & Cheshire, 2012), may prove more fruitful. Section Challenges will discuss these issues more in detail.

When dealing with such a broad topic, it is almost as useful to explicitly state what is not included as much as it is to describe what is covered. It is important to make clear that the main aim of this paper is neither of the following. First, it does not intend to be an exhaustive survey of all the literature that has already taken advantage of these new kind of data. Although not vast (yet), the amount of publications using any of these three sources is large and sparse enough that any attempt would be incomplete. Instead, I provide a few illustrative projects as an example of the advantages to be benefited from and challenges to be assumed. Second, this piece is not about any possible new source of data that is becoming available through the web or from public governments. The three categories in which the data sources featured are conceptualized are fairly broad and do include many of the new kinds of data appearing nowadays; however there exist alternative ones that are not best conceptualized into either of the three labels proposed in this work.2 Third, this will not deal with opportunities arising from the use of these data in contexts other than academic research in the fields of urban and regional science. This is not to say those are nonexistent or irrelevant; on the contrary, applications in other fields can be highly beneficial, both in private (e.g. geo-targeted marketing) and social (e.g. disaster management, social services efficiency) terms. However, the strength of this paper is on bringing into the attention of those two academic communities these new advances in the hope it will ease their adoption for future research and, as such, it will be confined to that specific end.

This paper takes a practical approach by exposing the nature of these data sources in an accessible way. This is done purposely to reach as many potentially concerned regional and urban researchers as possible and stir their interest. For the advanced reader, a more explicit treatment of ontological and epistemological aspects of the use of this kind of data can be found in Warf and Sui, 2010, Boyd and Crawford, 2012 or Crampton et al. (2013). Equally important aspects such as its political economy or issues underlying their production can be found in Leszczynski (2012) or in a recently compiled edition by Lisa Gitelman (2013). The rest of the text is structured as follows: Sections “Citizens as sensors”: collecting data from the bottom-up, Businesses moving online (and creating data in the process), Open governments, open data describe the emergence and characteristics of the three different categories mentioned above, suggest how they can be helpful for researchers interested in urban issues and feature projects and initiatives led by different actors that serve as real illustrations; Section Challenges discusses some of the challenges that these new data sources pose when contrasted with the ones traditionally used by the social sciences; and Section Concluding remarks concludes with a few remarks and highlights.

Section snippets

“Citizens as sensors”: collecting data from the bottom-up

The invention of the internet and its ubiquitous presence nowadays, particularly reinforced with the emergence of mobile devices3 such as smartphones and tablets, has created a platform in which every aspect of life is subject to leave a digital trace. Not only obvious ones like internet behavior (browsing patterns) or economic activity

Businesses moving online (and creating data in the process)

Not only individuals' lives are moving online, companies are also hopping on the internet train. In certain sectors, the popularization of the web has created important challenges but also opportunities to the traditional business model. Some firms have embraced them and have significantly increased their productivity and efficiency. Although this technology has been inserted in many diverse ways at different stages of the production chain, its inclusion as an additional factor has always been

Open governments, open data

Opposite to data in Section “Citizens as sensors”: collecting data from the bottom-up, the last family of sources is the reflection of a “top-down” process, in which public organizations release some of their internal data in open format. In effect, governmental organizations, from the national level down to local authorities, are making available increasing parts of the data they collect while developing their activities. This process if fueled mainly by four main strategic drivers (Shadbolt,

Challenges

So far, this paper has stressed only the benefits of these new sources of data. The reader up to this point would be tempted to wonder why, other than their novelty, they have not been used more intensively in urban research. One of the most obvious answers is that they, as many other types of data typically used, also have some drawbacks and imperfections that may prevent their use in some contexts. In order to obtain a full picture of the characteristics and nature of the data reviewed above,

Concluding remarks

This paper has reviewed the emergence of three new sources of data that may be useful for the regional and urban scientific communities. These are data coming from individuals carrying location-aware devices, from businesses moving (some of) their activity online and from governments releasing an increasing share of their data in open formats. For each source, a detailed characterization has been given as well as a real world case that serves as an example. A particular focus has been set on

References (62)

  • C.M. Bishop

    Pattern recognition and machine learning (information science and statistics)

    (2006)
  • D. Boyd et al.

    Critical questions for big data

    Information, Communication & Society

    (2012)
  • Cabinet Office

    Open data white paper. Unleashing the potential

    (2012)
  • Z. Cheng et al.

    Exploring millions of footprints in location sharing services

  • J.W. Crampton et al.

    Beyond the geotag: situating big data and leveraging the potential of the geoweb

    Cartography and Geographic Information Science

    (2013)
  • J. Cranshaw et al.

    The livehoods project: utilizing social media to understand the dynamics of a city

  • J. Cranshaw et al.

    Bridging the gap between physical location and online social networks

  • D. DeLyser et al.

    Crossing the qualitative- quantitative divide II: inventive approaches to big data, mobile methods, and rhythm analysis

    Progress in Human Geography

    (2012)
  • D.T. Duncan et al.

    Validation of Walk Score® for estimating neighborhood walkability: an analysis of four us metropolitan areas

    International Journal of Environmental Research and Public Health

    (2011)
  • B. Edelman

    Using internet data for economic research

    Journal of Economic Perspectives

    (2012)
  • L. Einav et al.

    The data revolution and economic analysis

    (2013)
  • Facebook, Inc. (2012, Accessed 05.09.12)....
  • M. Feldman et al.

    Innovative data sources for regional economic analysis

  • Front Seat

    Walk Score methodology. White paper

    (2011)
  • FrontSeat (2012, Accessed 05.09.12)....
  • M. Goodchild

    Citizens as sensors: the world of volunteered geography

    GeoJournal

    (2007)
  • M. Graham et al.

    Augmented realities and uneven geographies: exploring the geolinguistic contours of the web

    Environment and Planning A

    (2013)
  • J. Gray et al.

    The data journalism Handbook

    (2012)
  • G. King

    Ensuring the data-rich future of the social sciences

    Science

    (2011)
  • R. Kitchin

    The real-time city? Big data and smart urbanism

    (2013)
  • Cited by (0)

    This article belongs to New Urban Worlds: Application, Policy, & Change.

    ☆☆

    This manuscript was prepared for the special session “Urban Futures 2050”, held in August at the 2012 ERSA meeting in Bratislava, Slovakia. The author would like to thank Julia Koschinsky, Ellen Schwaller and Emmanouil Tranos for the comments on a previous version of the paper. All the possible errors remain responsibility of the author.

    View full text