Open government data portals in the European Union: A dataset from 2015 to 2017

Open government data (OGD) portals are official websites where governments can publish OGD in a controlled way. OGD portals foster discoverability, accountability, and reusability for stakeholders. This data article presents the data collected while monitoring the OGD portals in the 28 countries of the European Union. Several parameters and indicators observed over a period of 3 years in the official national open data portals were located and recorded to create this dataset. Data were manually obtained from existing public data sources and official OGD portals freely available on the Web. Clustering techniques using Density-based spatial clustering of applications with noise (DBSCAN) were applied to elaborate a dataset showcasing similar countries with respect to different parameters and indicators. Cluster data validation was carried out using the Davies–Bouldin index. The data presented in this article are related to the research article entitled “Open government data portals in the European Union: Considerations, development and expectations” [1].


Data
According to the Open Knowledge Foundation [2], open data refers to data that may be "… freely accessed, used, modified, and shared by anyone for any purpose". In accordance with the Organization for Economic Co-operation and Development, open government data (OGD) is "a philosophy-and increasingly a set of policies -that promotes transparency, accountability and value creation by making government data available to all" [3]. The geopolitical context is a crucial factor in the development of OGD, since the effectiveness of open government policies is influenced by cultural, geographical or regulatory factors tied to the country [4,5]. Open data portals are "web-based interfaces designed to make it easier to find re-useable information" and are "an important element of most open data initiatives" [6].
Specifications Table   Subject Computer Science (General) Specific subject area Open Data, Public administration management Type of data The dataset presented in this paper was created with the objective of supporting the analysis of the development of OGD portals in the 28 countries of the European Union [1]. The dataset combines the socioeconomic statistics about the countries of the OGD portals and the data about the OGD portals. OGD portals "suffer from the large number of diverse data structures that make the comparison and aggregate analysis of government data practically impossible" [7]. In addition, the lack of a single point of access to the OGD portals makes it difficult to locate and access the open data they provide. However, in order to foster comparability of data published across OGD portals, it is needed to collect and store data in a common format.
The dataset has been published in Mendeley Data. 1 It comprises 16 Excel files, as described in Table 1.
The file "Data aggregation.xlsx" contains the primary dataset composed from 10 public socioeconomic data sources (see Table 2), and 4 indicators gathered from the OGD portals of the 28 countries of the European Union over a period of three years, from July 2015 to December 2017, once every 13 months. The data is presented in the following spreadsheets: Variable description: Definition of the variables; it includes the acronym of the variable; its definition and the source of the data (see Table 2). 2015 July: Data acquired and collected in July 2015 (see an example of this spreadsheet in Table 5). 2016 October: Data acquired and collected in October 2016 (see an example of this spreadsheet in Table 5). 2017 December: Data acquired and collected in December 2017 (see an example of this spreadsheet in Table 5).  The files "2015 July outX-score.csv", "2016 October outX-score.csv", and "2017 December outXscore.csv", log the clustering of the data in "Data aggregation.xlsx", where X ranges from 2 to 6, for all the combinations of X variables from the dataset of the whole group of variables (see Table 6). These files are in the comma-separated values (CSV) format.

Experimental design, materials, and methods
Extensive online searches and exploring websites of the key organizations were used to identify the public data sources for the socioeconomic statistics about the countries of the OGD portals on the file "Data aggregation.xlsx". A detailed description of the data collected from the public data sources, including the acronym of the variable, the type and scale of the variable, the description of the variable and the source of the variable, is shown in Table 2. The exact URL of each source is included in the spreadsheet "Variable description" of the file "Data aggregation.xlsx".
Besides, the following four variables were compiled: AGE, the age of the OGD portal, measured in years since its publication. The number of datasets published on the OGD portal. The number of organizations or publishers announced on the OGD portal.
The number of applications published on the OGD portal.
These data were collected manually by means of an online review of the official OGD portal of each country. Table 3 shows the URL of the official OGD portals of the EU countries. The official OGD portal of Lithuania could not be located, although thorough searches were conducted. Table 4 shows the data collected in the case of Spain for the sake of illustration. Its official OGD portal was launched 2 in October 2011.   Table 5 shows an excerpt of the data collected from OGD portals and public data sources. The file "Data aggregation.xlsx" contains the dataset created before the pre-processing for July 2015, October 2016, and December 2017. Pre-processing includes different tasks such as to identify and correct possible errors (missing values and outlier values), and to normalize all values for effective comparison. The structure of the spreadsheets available in the file "Data aggregation.xlsx" follows the schema illustrated in Table 2.
As it was mentioned before, the files "2015 July outX-score.csv", "2016 October outX-score.csv", and "2017 December outX-score.csv" are generated by the clustering of the dataset in the file "Data aggregation.xlsx". The clustering applies the Density-based spatial clustering of applications with noise (DBSCAN) algorithm [8] which requires two input parameters: a) epsilon, the maximum radius to test the distance between data points; and b) minpoints, the minimum number of points needed to create a cluster. The clustering is iteratively performed with epsilon taking values from 0.5 to 3.0, with steps of 0.1, and minpoints taking integer values from 2 to 7. If the distance between two points is lower or equal to epsilon, then those two points will be in the same cluster. To validate the clusters obtained, the DavieseBouldin index (DBI) [9] is applied, which basically measures the compactness and separation of the clusters. The selection process applies the following heuristics: To maximize the number of portals (countries) grouped in each cluster, to classify as many portals as possible.
To maximize the number of portals (countries) grouped in each cluster, to make a fine classification of the portals. To minimize the DBI, to find natural partitions among the portals.
The structure of these files follows the schema in Table 6: To create this dataset, a methodology consisting of 9 steps was carried out, which is visually summarized in Fig. 1:   3. Manual acquisition of data from public data sources. 4. Manual collection of data from official OGD portals 5. Supervised preparation by an analyst of the data aggregated for the final dataset. A data cleaning process prevented the apparition of possible errors in the dataset. The main tasks performed during the data cleaning process were: o Transformation of thousands and decimal separators: some systems use the dot "." as thousands separator and the comma "," as the decimal separator, whereas other systems do the opposite. o Transformation of data formats, e.g., numbers-as-text transformed into real numbers. 6. Automatic pre-processing by a PHP script. The components of the dataset present different dimensions and magnitudes; therefore, there is a need to normalize all components into dimensionless for effective comparison. 7. Automatic clustering by a PHP script. An iterative process calculates all the combinations of k variables from the set of n variables and the clustering divides data into groups with similar values. The clustering applies the DBSCAN algorithm [8]. 8. Validation of clustering by the DavieseBouldin index [9]. 9. Creation of the clustering data files.