Delineating urban agglomerations across the world: a dataset for studying the spatial distribution of academic research at city level

In this data paper, we provide a global dataset of urban perimeters that has been built in order to study the evolving geography of scientific activity at city level across the world. The method developed for building these agglomerations associates the distribution of population density and the distance between geolocalized scientific publications (issued between 1999 and 2014), whose author’s addresses have been systematically geocoded. The location of scientific production is obtained by processing bibliographic data retrieved in the Web of Science Core Collection (Clarivate Analytics). In the first part of the article, we detail the geocoding stage of our methodology. Next, we discuss the importance of delineating homogeneous urban perimeters to study the world geography of science production and we detail the methodology used to build these perimeters (the clustering stage). After discussing the extent to which our work can be re-used and enriched, we propose to use this dataset in order to capture the worldwide distribution of Cybergeo publications at the city level between 2015 and 2017. construire ces périmètres du passage aux agglomérations).

Delineating urban agglomerations across the world: a dataset for studying the spatial distribution of academic research at city level Définir des agglomérations à l'échelle mondiale : un ensemble de périmètres urbains pour étudier la répartition des activités de recherches Marion Maisonobe, Laurent Jégou and Denis Eckert Introduction: Defining a set of comparable agglomerations at the world scale 1 Attempting to delineate urban areas at the world scale is a challenge because of the lack of homogeneous datasets at this scale. This problem is well known in comparative urban research (Moriconi-Ebrard, 1991;Pumain et al., 2015). In order to build comparable international datasets at city level, geographers frequently aggregate small spatial units which can be grouped into bigger entities (agglomeration, functional urban region, etc.) that make more sense for data interpretation. At the European scale, an important work has been done to delineate functional urban regions using data on home-work commutes (Guérois et al., 2014). Unfortunately, information required to delineate functional urban areas are not available with a homogeneous quality for all countries in the world and there are few datasets with a high resolution at the scale of the entire world, except for simplified land use or population density. To study the distribution of scientific activity in the world we decided to develop a new methodology. Provided some adjustments, we believe this methodology could now serve for various purposes. Indeed, our aim was to produce some universal delimitation criteria, and not divisions corresponding to a juxtaposition of national criteria (for example, using SMSAs for the United States, Aires Urbaines in France, etc.). To do so, we Delineating urban agglomerations across the world: a dataset for studying the...
Cybergeo : European Journal of Geography , Data papers used both the spatial distribution of scientific affiliations indexed in a bibliographic database (localities from which scientific publications are authored) and the distribution of population density (fine-tuned raster data). For the world's 500 most "publishing" localities, the delineations have been double-checked by specialists of the regions or countries needing verifications and by members of our team. 3 In this data paper, we give access to the cartographic information delineating agglomerations that are participating in more than 80% of the world scientific production between 1999 and 2014. First, we explain the specificity of the distribution of scientific production and detail the source and the geocoding stage, then we detail the methods used to delineate a set of comparable agglomerations. After a short discussion, we show how these spatial delineations can be used to analyse the spatial distribution of Cybergeo's publications between 2015 and 2017.
The specificity of the spatial distribution of scientific activities and its geocoding 4 Science is done in different types of settings: universities, research centres, R&D departments of enterprises, hospitals, academies, observatories, NGO, etc. The places of science are diverse and some of them have been settled down centuries ago (Livingstone, 2003). As a result, the spatial logics explaining the distribution of scientific activities are diverse and this distribution does not fit perfectly that of the population (Maisonobe, 2015). For instance, university cities such as Oxford, Cambridge, College Station etc. are mostly populated by students and professors. These locations are often nonmetropolitan. Further, when scientific venues are located in metropolitan areas, they can be found both in old city centers and in suburban areas. Sometimes, the remoteness is justified by scientific purposes: for instance, observatories should be localised in elevation and marine observatories nearby the sea. In other cases, it is economic or urban planning rationales that are driving the decisions, such as the project of delocalizing Parisian universities in the Greater Paris suburbs. Thus, in a single metropolitan area, the scientific production can be authored from a high diversity of postal addresses. In countries where the administrative fragmentation of the territory is very developed such as in France (almost 37 000 municipalities), metropolitan areas can encompass dozens of "publishing" localities. 5 To study this phenomenon in an international comparison, we used the Web of Science Core Collection, which is one of the most comprehensive bibliographic data sources. This database is indexing more than 1 million scientific publications (articles, reviews and letters) annually. The base element of this source is the bibliographic record, listing the authors, their institutions and the postal addresses of their institution. In a frame of a partnership with the French organism of statistics OST-HCERES (Observatoire des Sciences et Techniques), we retrieved all the authors' addresses indexed from 1999 to 2014 and geocoded them. The bulk of the geocoding work was done by supervised automatic matching, using a twofold process. First, we used available geographical databases like GeoNames 1 and Nominatim 2 (the gazetteer of the OpenStreetMap project) to assign geographical coordinates to authors addresses. These digital resources allowed us to geocode approximately 80% of the total number of addresses, those with easily recognizable city-Delineating urban agglomerations across the world: a dataset for studying the...
Cybergeo : European Journal of Geography , Data papers province-country triplets. The problem was to localize the remaining 20% addresses, about 40,000 items. 7 We faced several geocoding challenges: • Somewhat small localities not mentioned in open gazetteers; • Confusions between street names, neighbourhoods and cities in the source data; • Confusion between institutions' name and cities in the source data; • General lack of province or sub-country level information; • Homonyms between city names and province of the same country.

8
The second phase of the geocoding process used automatic services such as Google Maps API and Microsoft's Bing Maps API, to further improve the geocoding. These web services have access to a much larger toponyms database than free gazetteers and can automatically resolve ambiguities between several alternative homonym locations, helped by spatial hints like the country or the province. A human operator supervised this procedure: he could input significant parameters like the country or geographical zone of search, but most especially examine and correct the results, if needed. To help working with these web services and embed their use into a more integrated and fluid geocoding procedure, we developed a range of web applications, from data correction to geocoding and the evaluation of results (Jégou, 2014). The next phase of the work is the spatial clustering of these "localities" in order to build scientific "agglomerations".
Delineating urban agglomerations across the world: a dataset for studying the...

Cybergeo : European Journal of Geography , Data papers
Construction of a set of urban agglomerations for international comparison 12 Given the worldwide comparative scope of our work, we consider the locality level as inadequate for international comparison. The characteristics of the mail address, originally designed for postal use, the geographical variability of the postal reference systems and the great diversity of administrative geographical segmentation, prevent any direct comparison between the "scientific localities" (postal addresses from which publications are authored). Our team addressed this problem by building globally comparable geographical entities at an agglomerated level.
13 Surprisingly, most authors that deal with the spatialisation of scientific activity by using publication data do not address this issue. Following the spatial turn in scientometrics studies (Frenken et al., 2009), many articles in scientometrics present results at the level of geocoded addresses without clustering them into urban areas (eg. Waltman et al., 2011;Pan et al., 2012;Masselot, 2016;Csomós, 2018). These articles do not consider the issue of the statistical heterogeneity of the geographical entities they compare.
14 Yet some exceptions can be mentioned: • Matthiessen et al. (2002• Matthiessen et al. ( , 2010 address the issue but only focus on few metropolitan areas ("world cities of scientific knowledge"), mainly in the US and Europe; • Comin (2009)  to the one presented here, by using also population density data.
• Several scholars in regional economy use the European nomenclature NUTS (level 2 or 3).
15 In order to cluster scientific localities, we exploit global data sets that are highly finetuned and of comparable quality for the whole world.
16 Different global data sets met these conditions at the time we began the study: there was on one side land occupation data, such as ESA Iona GlobCover (ESA, 2005) or Global UrbanExtent (Schneider et al., 2009); and datasets focussing on population densities on the other. Comparing urban areas obtained by using data from land artificialization and data on population densities, we concluded that land artificialization was not the best criterion for our purpose (Eckert et al., 2013). This holds particularly true in continuously built coastal areas (especially in touristic places), which do not necessarily match continuous human occupation, and are even less likely to harbour areas of scientific activity. On the contrary, data on population density allowed us to delineate urban spots that better correspond to the limits of "local innovation systems" (Bathelt et al., 2004).
17 To delineate urban zones by taking into account the distribution of population density we used a dataset produced by the SEDAC 4 (the Socioeconomic Data and Applications Center of the NASA) for the year 2005 5 (SEDAC, 2005).
18 Due to the extreme variability of density, it was impossible to define a single and universal threshold value enabling to differentiate urban areas, in particular in the most densely urbanised areas of the world. To mitigate this problem, we decided to reason relatively. The solution was to use an indicator to identify gradient slope changes in the spatial distribution of population density. In spatial analysis, one could use several indicators for this purpose: Local Indicators of Spatial Association (LISA), including the Delineating urban agglomerations across the world: a dataset for studying the...
Cybergeo : European Journal of Geography , Data papers local I Moran. It delimits spatially significant areas called "density nuclei" (Anselin, 1995) using the population density distribution in a homogeneous way over the territory. The obtained urban areas, automatically delineated by applying this indicator, were combined with the spatial distribution of scientific localities. 19 For the denser urban spaces both in terms of population and scientific production, we checked the delimitation on a case-by-case basis, sometimes questioning local experts. For instance, we took into account the spatial distribution of close scientific localities (e.g. including a small locality near an existing agglomeration). We also sometimes considered the presence of key transport infrastructures connecting close urban zones (highways, bridges, or ferries). The relevance of such a meticulous work to define the boundaries of the most populated agglomerations is no doubt. In 2012, this procedure allowed us to delineate 376 agglomerations encompassing the 500 most publishing localities (in decreasing order of the total number of scientific publications in 2008). Among the list of these 500 localities, some were affected to the same agglomeration, like Manchester and Liverpool. Thus, the number of manually delineated agglomerations actually in the final dataset is inferior to 500.
20 For the publishing localities associated with smaller volumes of publications, a fully automated procedure was chosen. The localities located outside densely populated areas have been grouped together to form agglomerations when their distance to each other was less than 40 km. This criterion was applied following a descending order, that is to say that the centres from which to apply the distance criterion were selected in the order of the most publishing to the least publishing. In order to apply this criterion, our team used spatial queries using the PostGIS extension of the PostgreSQL database system. 21 For each science center, this query returned the list of localities fulfilling the following conditions: • not being already agglomerated to another center; • being less than 40 km from this center; • being associated with fewer publications than this center.
22 As a result, two types of spatial objects were obtained: polygons grouping at least three localities and lines grouping only two localities. Thus, this automatic procedure enabled us to group St. Andrews and Dundee, which are nearby places, too far away from Edinburgh to be integrated into its urban agglomeration.
23 It is important to specify that our objective was to draw agglomerations aiming at capturing and clustering "publishing" localities. The course of the boundaries of the lines or polygons thus obtained have no significance per se. Their function is solely to gather the punctual information of the localities pertaining to a given agglomeration (Figure 1).
Delineating urban agglomerations across the world: a dataset for studying the...
Cybergeo : European Journal of Geography , Data papers Figure 1. From "publishing" localities to "scientific agglomerations" Design: Laurent Jégou 24 The most important justification for our approach is that it helps homogenizing the spatial entities compared in our research. Before clustering, we could distinguish between different configurations (Figure 2). Although these configurations could be interesting to study per se, they are too often the result of the varying levels of cities' administrative fragmentation in the world. Sometimes, the spatial distribution is very simple: a single center of publication, stable in time (no new centers of production in the agglomeration during the period under study). Corresponding to our type 1, we can think of Beijing in China or Kiev in Ukraine. The simplicity of this pattern can be easily explained by the administrative structure: one single huge municipality. 25 On the contrary, when the administrative structure is more fragmented, we often observe the type 3 where many smaller but significant scientific localities are adjacent to the main center (typically Paris urban region). It is especially important to take this configuration into account when the main urban center accounts for a smaller part of the total scientific output of the agglomeration (typically Washington DC and surroundings). Consequently, taking only into account this center locality instead of the whole multicenter agglomeration can lead to important data misrepresentation.

Design: Laurent Jégou
Delineating urban agglomerations across the world: a dataset for studying the...

Cybergeo : European Journal of Geography , Data papers
Discussion and limits 26 Our dataset was built to be used only at the global level, for urban comparisons and in the domain of scientific production. It will require specific data verification if used for other purposes, especially at larger geographical scales or smaller areas (i.e. a single state, a single region or a single metropolitan area). 27 The delineations produced and used so far are likely to evolve with the update of publication data as well as population density data. Nevertheless, we consider the delineations of the world agglomerations from which are authored more than 80% of the publications between 1999 and 2014 to be robust enough to be shared and reused both for spatializing a corpus of scientific publications (next section). These delineations can also serve for exploring other types of academic activities, or relating scientific publication to other indicators (students, academic staff, funding…). Besides, these perimeters could be tested for their potential adequacy to the display of other human activities (like tourism, road traffic, pollution…). These perimeters would not be adequate per se, but their delineations might still be adapted to the specific distribution of these activities. More generally, we believe that the overall methodology could inspire geographers aiming at studying several types of world distributions.
28 Given the fact that the publication activity is increasingly distributed at the world level and that a growing number of places are contributing to this activity, the global share of these top 495 agglomerations is diminishing. Nevertheless, it still contributes to more than 80% of the world production in 2013 (Table 2). As a whole, the dataset shared in this data paper encompasses 495 agglomerations among which 376 were carefully designed by experts's hand (their ID name begins by "AD" as in "drawn") and the 119 others are automatic clusters of close publishing localities (their ID name begins by "AA" as in "automatic"). 30 As an example of the use of these perimeters, we propose a spatial bibliometrics' analysis of the Cybergeo journal.
Delineating urban agglomerations across the world: a dataset for studying the...
Cybergeo : European Journal of Geography , Data papers Application to analyse the spatial distribution of Cybergeo publications (2015-2017) 31 To study the spatial distribution of the authorship of Cybergeo papers, we need a list of institutional addresses. To obtain it, we can use the Web of Science since this database includes authors' addresses. Cybergeo entered the Web of Science in 2015 so that it is possible to extract the bibliographic records of this journal from 2015 to 2017. . We can download all these 183 records (Record content: "full record") by using the option "save to other file formats" and by choosing "Tab-delimited" with "UTF-8" as the text encoding option ( Figure 3).
33 As a result, we obtain a table of 68 columns among which columns specifying, for each record, the authors' name, the publication title, the publication issue, the publication year, the number of references as well as the authors' addresses. To retrieve the spatial information of this dataset, we decide to focus only on one address by publication, which is the address of the corresponding author. Among the 183 records, 16 do not have any associated addresses (8 editorial materials, 1 book review and 7 articles). All the 167 remaining publications can be geocoded. To do so, we start by selecting only the ending part of the addresses (city name, province name and country name) which makes much easier for a geocoding tool to get relevant results. To geocode our list of 88 distinct location names (here "Paris, France" is considered to be distinct from "F-75005 Paris"), we use the "batch geocode tool" of the "Map Developers" website 6 . We retrieve all the distinct locations and we obtain a list of geographical coordinates. We enter these data in the GIS software QGIS and it allows us to derive a shapefile of all the publishing localities from which Cybergeo publications have been signed. Finally, we cluster the publishing localities (notably the one from "Paris, France" and from "F-75005 Paris") into agglomerations by using our dataset of 495 agglomerations' shapes.
Delineating urban agglomerations across the world: a dataset for studying the...
Cybergeo : European Journal of Geography , Data papers 34 Following this step, we found that 28 publications (among which 18 publications from France) are not located in one of the 495 most publishing agglomerations of the world. Nevertheless, all the publications that have been authored in dense urban areas are clustered thanks to our agglomerations' dataset. The remaining publications are authored from localities that are not in dense urban areas and therefore can be counted as separate punctual agglomerations. The only remaining problem occurring is with "Avignon" since three publications have been signed from Avignon but one specifying "F-84029 Avignon 1_ France" and the two others "Avignon_ France". As a result, the geocoding tool returns two different pairs of geographical coordinates, albeit very close to each other. To cluster these two points into one agglomeration, we construct a buffer around each isolated point and only keep one of the two buffers created around "Avignon".  Figure  4). This distribution can also be explained by the important involvement of several French laboratories in the European field of theoretical and quantitative geography (Cuyala, 2013).

ABSTRACTS
In this data paper, we provide a global dataset of urban perimeters that has been built in order to study the evolving geography of scientific activity at city level across the world. The method developed for building these agglomerations associates the distribution of population density and the distance between geolocalized scientific publications (issued between 1999 and 2014), whose author's addresses have been systematically geocoded. The location of scientific production is obtained by processing bibliographic data retrieved in the Web of Science Core Collection (Clarivate Analytics). In the first part of the article, we detail the geocoding stage of our methodology. Next, we discuss the importance of delineating homogeneous urban perimeters to study the world geography of science production and we detail the methodology used to build these perimeters (the clustering stage). After discussing the extent to which our work can be re-used and enriched, we propose to use this dataset in order to capture the worldwide distribution of Cybergeo publications at the city level between 2015 and 2017.