GEODATA ACTUALLY HAVE THE QUALITY THEY DECLARE ? THE CASE STUDY OF MILAN , ITALY

In the past number of years there has been an amazing flourishing of spatial data products released with open licenses. Researchers and professionals are extensively exploiting open geodata for many applications, which, in turn, include decision-making results and other (derived) geospatial datasets among their outputs. Despite the traditional availability of metadata, a question arises about the actual quality of open geodata, as their declared quality is typically given for granted without any systematic assessment. The present work investigates the case study of Milan Municipality (Northern Italy). A wide set of open geodata are available for this area which are released by national, regional and local authoritative entities. A comprehensive cataloguing operation is first performed, with 1061 geospatial open datasets from Italian providers found which highly differ in terms of license, format, scale, content, and release date. Among the many quality parameters for geospatial data, the work focuses on positional accuracy. An example of positional accuracy assessment is described for an openly-licensed orthophoto through comparison with the official, up-to-date, and large-scale vector cartography of Milan. The comparison is run according to the guidelines provided by ISO and shows that the positional accuracy declared by the orthophoto provider does not correspond to the reality. Similar results are found from analyses on other datasets (not presented here). Implications are twofold: raising the awareness on the risks of using open geodata by taking their quality for granted; and highlighting the need for open geodata providers to introduce or refine mechanisms for data quality control.


INTRODUCTION
The term "open data" has gained an increased popularity over the last years.The Open Definition defines open data as data that "can be freely used, modified, and shared by anyone for any purpose" (http://opendefinition.org).Similarly to the concept of open source (which is applied to software), the licenses for open data must ensure permission for use, modification, separation, redistribution, compilation, non-discrimination, propagation, application to any purpose, and no charge; some conditions may be required such as attribution, integrity, share-alike, notice, source, technical restriction prohibition, and non-aggression (http://opendefinition.org/od/2.1/en).
The doctrine of openness has had a strong impact on multiple contexts and disciplines.As an example, the need of more clear and transparent information has pushed more and more public administrations to adopt open government principles.The Open Data Charter issued by the former G8 governments in 2013 is a plan for transparency and development which relies on the open release of high-value governance datasets on national geoportals (G8 leaders, 2013).Since 2013 the increase of open government datasets is annually monitored by the Global Open Data Index (http://index.okfn.org),which provides the most comprehensive snapshot of the global state of open data.In its most recent edition (2015) the index recorded 122 countries worldwide with 156 open datasets on a total of 1586 datasets released (9%).The country with the highest percentage (78%) of open data released is Taiwan, while Italy -which is of interest in this work because of the case study presented later -ranks 17th with 55% of its datasets being open.
At the European level a reference source of information is the European Data Portal (http://www.europeandataportal.eu), that harvests open government metadata from public data portals of European countries with the overall purpose of improving open data accessibility and increase their value.In turn this should bring economic benefits and improve transparency.According to the estimations provided, the direct market size of open data for the 28+ EU Member States in 2016 is 55.3 bn EUR, with an expected increase of 36.9% to a value of 75.7 bn EUR in 2020 (http://www.europeandataportal.eu/en/content/usingdata/benefits-of-open-data).Additional benefits of the re-use of open data in Europe were quantified by Carrara et al. (2015) using indicators such as number of jobs created, cost savings, and efficiency gains.Therefore, the cumulative total market size of open data between 2016 and 2020 is expected to be around 1200 bn EUR.Estimates are based on the so-called open data maturity, that is attributed to each EU Member State by looking at the progress made so far in terms of open data.Countries are ranked in three groups: trend setters, followers and beginners.Italy belongs to the group of trend setters, that among the others includes Spain -the first EU Member State with a national open data portal.It is expected that by 2020 almost all EU 28+ Member States will have a fully operating portal and thus will become trend setters.

Open geodata and their quality
A special subset of open governance data consists of geospatial open data, which from here on are referred to as open geodata.Open geodata are usually released by governmental bodies, e.g. the National Mapping Agencies (NMAs), as well as other types of institutions and in some cases also private companies.In this context it is also worth mentioning the OpenStreetMap (OSM) project, the largest and most complete open geospatial database of the world which is daily created and updated by volunteers (http://wiki.openstreetmap.org/wiki/About_OpenStreetMap).On a total of 13 domains of open governance data, Carrara et al. (2015) recognized open geodata as the domain with the highest commercial value, as maps can be used in all the other domains to create visualisations.The geospatial sector was thus rated as the one where the impact of open data can be maximum.The Italian picture about the availability of open geodata was traced in a recent survey by Andreozzi et al. (2014).The percentage of Italian entities (regions, provinces and municipalities) releasing open geodata is still low.Sometimes geodata are published without any license; other times they are not openly licensed although available for download in interoperable formats.
The present work focuses on open geodata by investigating one of the most crucial aspects for their re-use and exploitation, i.e. quality.As a matter of fact the outputs of studies and analyses which make use of open geodata -being them decision-making actions, scientific results or other derived (geospatial) datasetsstrongly depend on the quality of the geodata used.Quality is specified in the associated metadata and is typically given for granted without further verifications.Thus, the purpose of this study is to raise awareness on the importance of geodata quality.This topic is addressed by considering the case study of Milan Municipality (Northern Italy), where the quality of a number of open geodata is checked against the quality declared in their metadata.The checking is performed using reference guidelines provided by the International Organization for Standardization (ISO), in particular the ISO standard 19157:2013"Geographic Information -Data Quality" (ISO, 2013a).This provides rules to assess a number of data quality parameters for geospatial data such as positional accuracy, completeness, logical consistency, thematic accuracy and temporal quality.This standard is as well the reference for ensuring data quality in INSPIRE (Tóth et al., 2013).
The remainder of the paper is structured as follows.First, the results of a comprehensive cataloguing operation on the open geodata available for Milan Municipality are presented.Next, an example of quality evaluation is shown for a specific dataset (an orthophoto) and a specific quality parameter (positional accuracy).Results and their managerial implications about the exploitation and re-use of open geodata are finally discussed.

OPEN GEODATA IN MILAN MUNICIPALITY
As mentioned above this study is focused on the open geodata available for Milan Municipality, which is located in Lombardy Region (Northern Italy) and has an area of about 180 km 2 .For this region an extensive collection of open geodata exists which are published by local, regional, national, European and extra-European institutions.Due to the vastness of these datasets and the extreme difficulty of finding and cataloguing them all, the following analysis is only based on open geodata released by Italian institutions.With this premise, a cataloguing operation executed in April 2016 records a total of 1061 openly available geospatial datasets.We consider a geospatial dataset as the minimum unit of geospatial content which is available either for download or as a service.• Topographic data: roads, railways, buildings, Digital Terrain Models (DTMs), hydrographic data, etc.
• Environmental data: land use maps, geological maps, forest maps, maps of landslides and earthquake susceptibility, etc.
• Governance data: data related to population census, culture, tourism, education, services for citizens, etc.
• Airborne observations: LiDAR data, orthophotos, aerial and satellite imagery, products derived from SAR data, etc.
As shown in Figure 2a, the greatest number of datasets (35.3%) fall in the environmental category, followed by the topographic category (28.9%) and the governance category (22.2%).Airborne and sensor observations correspond to smaller percentages of the available datasets.Depending on their format, open geodata for Milan Municipality can be classified as: • Vector datasets: shapefiles, text file formats specifying coordinates such as CSV and JSON, etc.
The 86% of available geodata are in vector formats; the 13.7% are available as GeoWeb services, while less than 1% are in raster formats (see Figure 2b).Based on their provider, open geodata for Milan Municipality are available in a variety of scales classified into national, regional, and local.As shown in Figure 2c, the great majority of open geodata have a regional CC-BY-NC-ND 2.0 IT (Creative Commons, 2016f), and v2.0 (Formez PA, 2012).
Table 1 details the differences between these licenses in terms of permissions on re-use and exploitation.Figure 2d highlights that almost half of the datasets are available under the IODL v2.0 license.Among the remaining datasets a high percentage is released under the CC-BY-NC-SA 3.0 IT license.The third most used license is CC-BY-SA 3.0 IT.Some datasets (less than 1% of the total) have no license at all, while all the datasets from ARPA (the 9.7% of the total) are available under a custom open license specified by the provider.Finally Figure 4 gives a better idea of the categories of datasets published by each Italian provider.The main source of geodata, i.e. the Lombardy Region Geoportal, is focused on topographic, environmental and governance information.Governance data receive as well a significant contribution from the Lombardy Open Data portal.Open geodata from the National Geoportal of the Italian Environmental Ministry are mainly environmental data and airborne observations, while almost all the available sensor observations are provided by the Lombardy section of ARPA (see Figure 4).

METHODS FOR QUALITY ASSESSMENT
As mentioned above, the procedure for assessing the of the open geodata available for Milan Municipality is shown for one dataset and one quality parameter.More in detail, we focus on the evaluation of the positional accuracy of the orthophoto of year 2012 published on the National Geoportal of the Italian Environmental Ministry.This is the most up-to-date orthophoto available at the Italian national scale.It was produced by the Italian Agency for Agricultural Supplies (AGEA -Agenzia per le Erogazioni in Agricoltura, http://www.agea.gov.it), it has a resolution of 50 cm and is available under a CC-BY-SA license as a WMS (server URL: http://wms.pcn.minambiente.it/ogc?map=/ms_ogc/WMS_v1.3/raster/ortofoto_colore_12.map).The Italian regulation governing the evaluation and acceptance of the produced orthophoto imposes a planimetric accuracy which is equal or higher than 4 m (Ministry of Public Administration and Innovation, 2011).In other words, the distance between the real position of a point and its position in the orthophoto should not exceed 4 m.In addition, AGEA declared that supplementary measures have been adopted to ensure an improvement of the planimetric accuracy to 3 m (AGEA, 2011).The assessment of the positional accuracy of the orthophoto in Milan is performed using as the reference dataset (ground truth) the building roof layer of the official vector cartography of Milan Municipality.This is the most up-to-date ( 2012), large-scale (1:1000) and accurate (20 cm) dataset available for the area of interest.The building roof layer is chosen because it provides plenty of points (the corners of building roofs) that are easily identifiable on the orthophoto.The quality evaluation is executed according to the reference guidelines provided by ISO (ISO, 2013a).

Sampling
ISO (2013a) recommends to choose the size of the sample used for quality assessment according to the size of the population, based on a hypergeometric distribution (ISO, 2013b) with a significance level equal to 95%.Once the sample size has been decided, the sample is extracted according to an area-guided approach, which, in contrast to a feature-guided approach, is based on spatial considerations (i.e.specific areas are chosen) instead of the feature non-spatial attributes.In particular, an hexagonal grid is defined on the area of Milan Municipality.
Compared to a traditional rectangular or square grid, this type of grid provides the advantage of closely representing a circle while providing the same complete coverage of the study area (Hecht et al., 2013).A stratified random sampling is then used as the sampling procedure for each grid cell.Notably it provides greater precision than a non-stratified random sampling in the estimation of both mean and variance (ISO, 2013a).
Finally, for each point randomly sampled on each grid cell, the closest reference building is selected.On this building three roof corners are considered: the nearest to the point, the farthest from the point, and a corner approximately halfway between the two.The coordinates of these three points are then also extracted from the orthophoto through manual digitization.Sometimes it happens that, for some buildings of the reference dataset, the corresponding building on the orthophoto cannot be found, e.g. because it is covered by vegetation.In this regard ISO (2013a) indicates the maximum number of non-found objects such as the sample is still valid, according to the testing significance level known as Acceptance Quality Limit (AQL).Therefore, iffor a certain AQL -the number of buildings which cannot be found in the orthophoto exceeds the threshold provided by ISO (2013a), the sample is rejected and a new one must be taken.

Evaluation of positional accuracy
The coordinates of the selected building roof corners, extracted from the reference building dataset and the orthophoto, allow to compute a number of measures of positional accuracy.First, for each building roof corner the planimetric error e is computed as the distance between its position on the reference dataset and the orthophoto.The mean μ, the median Me, the standard deviation σ, the minimum min and the maximum max of the planimetric errors are then computed together with the number of points n for which the error e exceeds the declared accuracy value of 3 m.These indexes can be computed as well on the eX and eY components of the error in the X and Y directions.Finally assuming that errors are normally distributed a confidence area is computed that represents the circular area around the sampled points where the reference points are located with a fixed probability.ISO (2013a) suggests the following basic measures: CE39.4,CE50, CE90, CE95 and CE99.8.They correspond to the radius of the circles where the reference point lies with a probability of 39.4%, 50%, 90%, 95%, and 99.8%, respectively.

RESULTS FOR QUALITY ASSESSMENT
As the number of buildings available in the reference dataset for Milan Municipality is approximately equal to 89000, according to ISO (2013a) the sample size should be around 500.Thus, a grid composed of 250 hexagonal cells with side equal to 570 m is created (see Figure 5).Two points are randomly sampled on each cell, and the closest reference building to each point is selected.A total of 500 buildings form the sample.In order for this sample not to be rejected, considering an AQL equal to 5% there must be at most 34 buildings which cannot be found in the orthophoto (ISO, 2013a).After the rejection of some samples, a valid sample is extracted which satisfies this threshold.
Figure 5. Hexagonal grid and reference building roofs for Milan Municipality When available, three building roof corners are extracted on the orthophoto for each randomly sampled point.Sometimes only one or two corners can be extracted.This is mainly due to two reasons: the non visibility of the corners on the orthophoto (e.g.due to vegetation or shadow) and the fact that some buildings in the reference dataset do not exist in the orthophoto.In turn this can be due to changes occurred over time (although both the orthophoto and the building layer were released in 2012) and to real mistakes present in the official dataset.The number of homologous building corners extracted is about 1450.
The planimetric errors on each homologous point are computed, and the resulting indexes of positional accuracy (described in Subsection 3.2) are summarized in Table 2 and Table 3 The values of Table 2 and Table 3 clearly show that the actual positional accuracy of the orthophoto is much worse than the one declared by its provider (AGEA).The mean, median and standard deviation of the planimetric error e are all higher than the declared accuracy (3 m).Also this value is exceeded for 758 points, i.e. more than half of the total number.The radius of the confidence circles are all very high as well.As an example, to be pretty sure to find a reference point around a point in the orthophoto, a circle with radius of 14.53 m needs to be drawn.These considerations lead to the conclusion that the orthophoto has been poorly rectified in the production process.The scarce quality of the product is more and more evident as the height of the buildings increase.Figure 6 provides a visual representation of the orthophoto distortions.Readers should consider that in a correctly referenced orthophoto the building facades should not be visible.
From Table 1 it is also clear that the orthophoto displacement is much worse in the Y direction than the X direction.The values of μ, Me, σ, max and n on the Y direction are at least twice those on the X direction.The difference in displacement between the two directions is also visually shown in Figure 6.

DISCUSSION AND CONCLUSIONS
The release of openly licensed governance datasets from public bodies is a crucial step towards societal and economic growth.The trend is favourable at both national and international levels, where a number of regulations and initiatives such as the G8 Open Data Charter (G8 leaders, 2013)    A possible reason why -despite these results -this product has passed the accuracy compliance tests and has been considered suitable for release, is that the accuracy checks were performed using a random sampling approach (the same described in this work) but at a national level.In fact, the orthophoto is available for the whole Italian territory and the sampled points checked may have not been extracted in the area of Milan.Therefore, the results of this study are only valid for Milan Municipality, and further tests on other Italian areas are useful to evaluate whether they can be generalized.From the methodological point of view, a small limitation consists in the fact that the digitization error (i.e. the error committed when manually identifying the building roof corner on the orthophoto) was neglected.However, recent tests have shown that this error is at least one order of magnitude smaller than the accuracy target (3 m) and therefore it is actually neglectable.In the same way the positional uncertainty (20 cm) of the building roof dataset used as a reference is neglectable.
Although the paper has presented only one example of quality assessment (i.e. on one dataset and for one quality parameter), similar results were found for Milan Municipality on other open datasets and other quality parameters.Therefore the main lesson learned from this study is the need for all open (geo)data users not to take the data quality for granted, as it can highly impact on any output derived from data re-use (being it a decision, a service or other derived datasets).On the other side, the work has shed light on the need (for the very same reason) for open (geo)data providers to introduce refine the mechanisms for data quality control.

Other
Web portals offering open geodata for Milan Municipality exist, but the data provided are retrieved from the same portals listed above.Examples are the open data portal for Italian Public Administrations (http://www.dati.gov.it) and the Italian open data portal (http://www.datiopen.it).Figure 1 shows the distribution of open geodata for Milan Municipality according to the Italian institutions providing them.More than half of the available datasets (676 in total) are published on Lombardy Region Geoportal.134 datasets (corresponding to the 12.6% of the total) are provided by the National Geoportal of the Italian Environmental Ministry.The Lombardy Open Data portal and ARPA release around 10% of the total datasets each, while only the 3.1% are published by Milan Municipality.

Figure 1 .
Figure 1.Distribution of available open geodata for Milan Municipality according to their providers A second classification of the available open geodata looks at their contents, i.e. the kind of information they represent.Data contents are grouped in the following five categories: scale.These mainly correspond to the datasets provided by the Lombardy Region Geoportal and Lombardy Open Data portal.Another interesting classification of open geodata looks at their license.The available datasets for Milan Municipality have the following licenses: CC0 (Creative Commons, 2016a), CC Public Domain (Creative Commons, 2016b), CC-BY (Creative Commons, 2016c), CC-BY-SA 3.0 IT (Creative Commons, 2016d), CC-BY-NC-SA 3.0 IT (Creative Commons, 2016e),

Figure 2 .
Figure 2. Classification of the open geodata available for Milan Municipality according to: category (a), format (b), scale (c) and license (d) A final classification of the open geodata available for Milan Municipality is made according to the year of publication.As shown in Figure 3, the oldest datasets were published in 1987.These few datasets were all published by the Lombardy Region Geoportal and were never updated since then.A significant increase can be outlined in the number of open geodata released over the last five years.This is mainly a consequence of both the participation of Italy in the G8 Open Data Charter (G8 leaders, 2013) and the implementation of an Italian law which promoted the development of services for the digital economy and digital culture (President of Italian Republic, 2012).The number of datasets released in 2016 is expected to further Figure 3. Classification of the open geodata available for Milan Municipality according to the year of publication.

Figure 4 .
Figure 4. Proportion of open geodata available for Milan Municipality according to their category and provider.
of checking the positional accuracy of the orthophoto published on the National Geoportal of the Italian Environmental Ministry against the positional accuracy declared.The assessment was performed according to the guidelines provided by ISO (2013a) by using the official building roof layer of Milan Municipality as the reference dataset for comparison.Results have shown that the real accuracy of the orthophoto is much worse than the one declared (3 m) due to a very poor rectification operation.

Table 2 .
. Statistics on the planimetric error e and its components eX and eY in the X and Y directions