Dataset for corruption risk assessment in a public administration

This data article describes a dataset of corruption approach and possible variables related, and this dataset was created by integrating eight different systems of Brazilian federal government and Federal District. We present real data from civil servants and militaries to comply with GDPR legislation, the attributes that could identify a person were removed, making the data anonymized.


Value of the Data
• This dataset contains data from eight different databases from the Brazilian federal government and Federal District. • This dataset benefits researchers working in the field of corruption risk assessment and also applied machine learning. • Researchers working in the field of corruption risk assessment may find this dataset benefited and could also apply machine learning. • The analysis of this data could help identify corruption risk factors and assist in the definition of overseen planning on focus on the activities of the greatest risk for Public Administration, such as cases with a high probability of occurrence and a high financial or social impact.

Data Description
The dataset provided in this paper offers valuable information on public administration and allows research in the corruption area. A few datasets regarding corruption are available, Al-Jundi [1] presents a survey dataset on determinants of administrative corruption, Peerthum et al. [2] related to corruption in Mauritius, and Oguntunde et al. [3] deal with selected crime data in Nigeria, including corruption.
Literature was consulted to determine attributes for administrative corruption. Other researchers can reuse the dataset and can be easily downloaded from the Mendeley Data repository. 1 The data in this article are composed of all civil servants from Federal District Government (Brazilian Public Administration), involves the reported cases of dismission by corruption, and aggregate 26 attributes related to four domain areas extract from eight databases.
These four domains are related by sources provided and are: • Corruption Domain (C) aggregate data corresponding to illegal acts committed by civil servants or militaries or companies that they are owners; • Employment domain (E) provide servant's registrations from Human Resources Management System like income and number of coordination roles; • Political Domain (P) covers data related to political activities; and • Business Domain (B) presents company features that civil servants and militaries are owners.

The descriptive statistics
The dataset is composed of 27 attributes, part of them are integer and numeric attributes ( Table 2 ), other attributes are categorical ( Table 3 ), and a few of them are Boolean ( Table 4 ).
All boolean attributes ( Table 4 ) belong to Corruption Domain. Table 3 presents categorical attributes from Political and Business Domains, and  Table 2 shows the main statistic description from Employment and Business domains with integer or numeric attributes.

Experimental Design, Materials and Methods
This section gives Data Sources aggregated information; Related Literature to compose the dataset features, Descriptive Statistics, and the Preprocessing (Data Enrichment and Data Cleansing).

Data sources
The dataset was composed of eight different sources from Brazilian public administration. After consolidation, the attributes were classified by four domain areas for better understand, described by: corruption(C), Employment (E), political (P), and Business (B) that are related by sources.
The dataset was created after an ETL process collected from these different data sources: These data represent the information from civil servants, militaries, and pensioners of The Federal District, a Brazilian Public Administration, in total are 303,036.
Federal District is a legal entity of internal public law, which is part of the politicaladministrative structure of Brazil, of a nature sui generis, because it is neither a state nor a municipality, but a special entity that accumulates the legislative powers reserved to the states and the municipalities, which gives it a hybrid nature of state and municipality. Illustrates the pipeline of ETL process (extract, transform and load) from different data sources integrated into a dataset aggregated by four domains and was submitted to a pre-processing (Data Enrichment and Data Cleansing).

Domains
These four domains ( Fig. 1 ) are: Corruption Domain (C) aggregates data corresponding to illegal acts committed by civil servants or militaries, or companies that they are owners.
Employment domain (E) is composed of base integration of two payment databases that have information from Public Security workers (policemen and firefighters) in Integrated Human Resource Management System (SIAPE) and from other civil servants (Education, Health, and other areas in Resource Management System (SIGRH).
Base integration work took place for the SIGRH, and SIAPE, since the same civil servant or military from the Federal District could be included in both databases due to the possibility provided by the Brazilian Federal Constitution to allow the accumulation of certain public offices.
Political Domain (P) has information from TSE, The Brazilian Superior Electoral Court, and provides information about candidates like level of education, party, marital status.
Business Domain (B) is composed of information from The Secretariat of Brazil's Federal Revenue -SRF/ME, about companies whose owners are civil servants and militaries.

Related literature
The decision about which attributes to compose this dataset was defined considering studies carried out on corruption literature, and all of them were identified and classified by previous domains defined ( Table 2 ).
It is essential to bring the concept of corruption adopted for this dataset and represented by the variable "C.CorruptionTG". It was described in Brazilian Law No. 8429/92, which defines corruption as an act of improbity that, under the influence or not of the position, causes illicit enrichment, causes or not mandatory, will be used to the purse or violate Public Administration principles [20] and is described on Table 1 .
The data obtained from these sources ( Fig. 1 ) provided by different public organizations were aggregated in SAS Enterprise. They were outlined by their attributes classified by the four domains defined.

Pre-processing (Data Enrichment and Data Cleansing)
The data preparation is the stage in which the data must be processed and prepared in a way that can demonstrate the understanding of the business, in this case for corruption. Integrating different data sources could be a challenge because, in general, the data comes from sources of transactional systems or measurements or also from real-world situations, and the data set obtained must converge to understand the business.
Data cleaning and construction of attributes were carried out to generate treated and adequate data to enable the development of predictive models.   Carvalho [7] Data cleansing is the process of attempting to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data [21] . It aims to alleviate two critical problems of data acquisition processes: the existence of missing values and the existence of noisy values (noise values).
The missing values occur when for the attributes of a dataset there is no determined value for some specimens or when a data set does not have values for an attribute of interest or even presents aggregated values concerning that attribute.
As a solution to the missing values, it was possible to remove observations with this characteristic, manually fill in values, or auto-fill.
The noisy values refer to changes from the original values and, therefore, consist of measurement errors or values considerably different from most of the other values in the data set, known as outliers. For example, we can mention cases that should be positive and negative values occur or a change in the behavior of the values of an attribute without explanation. Few observations were removed by specialist decision.
For the solution of noisy values, there is the inspection with the manual correction or automatic identification and cleaning implemented by algorithms that soften or cancel noise.
Data enrichment is the process of enhancing collected data with relevant context obtained from additional sources [22] . This dataset aggregates information from different databases that could benefit from a holistic approach. In addition, some features were elaborated in a specific way to provide information for business understanding.
The feature construction allows the elaboration of features that can generate relevant information according to the understanding of the business from the original data.
In this scenario, a feature construction was the transformation of categorical attributes into counting attributes.
This procedure was performed because the attribute, when expressing quantity, has meaning in the context of business understanding, while the categorical value does not express benefit in the context of corruption. For example, a categorical attribute that means the positions that the civil servant or military man/woman occupied in Public Administration has no meaning for this investigation. However, many positions he/she had occupied could inform that this one does not have a stable condition and could represent an anomaly.
For machine learning research, it is essential to address the data imbalance problem. The relevant feature for research that should be the independent variable of this investigation (C.CorruptionTG) presents in the class of interest 428 records and in the dominant class 302,608 records, a relation that keeps the proportion of 1: 707, in percentage terms 0.14% of the class of interest in the population.
C.CorruptionTG is a dichotomous variable and is an important variable that has to be analyzed from other variables available for identifying risk factors that could be addressed to mitigate corruption in public administration.
Possible ways of dealing with this scenario are explained by Zhu et al. [23] that suggests solving the problem of learning on imbalanced datasets with two possible solutions: data-level solutions and algorithm-level solutions.
It is vital to inform that to comply with the GDPR legislation, the attributes that could identify a person were removed, making the data anonymized.

Ethics Statement
The authors declare that they have observed all ethical requirements for publication in Data in Brief.