SPNeoDeath: A demographic and epidemiological dataset having infant, mother, prenatal care and childbirth data related to births and neonatal deaths in São Paulo city Brazil – 2012–2018

SPNeodeath dataset includes births and deaths of infants during the neonatal period from São Paulo city between 2012 and 2018, containing more than 1.4 million records. The dataset was created from SINASC and SIM Brazilian information systems for births and deaths respectively. SINASC comprises information about demographic and epidemiological data for the infant, mother, prenatal care and childbirth. SIM collects information about mortality, and it is used as the basis for the calculation of vital statistics, such as neonatal mortality rate. SIM was only used to label records from SINASC, when the death happened until 28 days of life. SPNeodeath has 23 variables with socioeconomic maternal condition features, maternal obstetrics features, newborn related features and previous care related features, besides a label feature describing if the subject survived, or not, after 28 days of life. In order to build the dataset, DBF files were downloaded from DATASUS ftp repository and converted to CSV format, the R programming language, and then the CSV files were processed using Python programming language. Features with incorrect values and unknowing information were removed.

R programming language, and then the CSV files were processed using Python programming language. Features with incorrect values and unknowing information were removed.
© 2020 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license.
( http://creativecommons.org/licenses/by/4.0/ ) Table   Subject Public Health and Health Policy Specific subject area Demographic and epidemiological data for the infant, mother, prenatal care and childbirth of Births and Neonatal Deaths Type of data text/csv How data were acquired

Specifications
Official records of the national healthcare system.

Data format
Mixed (raw, analysed and filtered).

Parameters for data collection
Demographic and epidemiological data from the infant, mother, prenatal care and childbirth of Births and Neonatal Deaths from São Paulo city Brazil between the years of 2012 and 2018

Description of data collection
The data were extracted from SINASC and SIM. SINASC collects information from births that happened in all national territory, both in the public and private health sectors and in households and it is done in the municipal context. Its main instrument is the declaration of live births (DN) -right after the birth in the place where the birth occurred, a health professional, properly trained must fill all the fields in the DN. SIM collects death information and uses the death declaration form (DO Value of the Data • SPNeoDeath is a dataset that provides more than 1.4 million samples representing births and deaths in the city of São Paulo-Brazil between 2012 and 2018. • Dataset intends to support research focused in understanding neonatal mortality (NM) and its associated factors, providing a set of 24 features associated with NM, divided in 3 main groups: (1) socioeconomic maternal conditions features, (2) maternal obstetrics features and,

Data description
SPNeodeath dataset is based on secondary data of births and deaths of infants (from neonatal period only, i.e., when the child died within the first 28 days of life) from the city of São Paulo -Brazil between 2012 and 2018, comprising 1,427,906 rows and 24 columns. The data came from Mortality Information System (SIM -Sistema de Informação de Mortalidade) and the National Information System on Live Births (SINASC -Sistema de Informação de Nascidos Vivos), both from DATASUS (Health Informatics Department of the Brazilian Ministry of Health). ( continued on next page ) SINASC is fed using the Live Birth Statement (DNV -Declaração de Nascido Vivo) [1] . It comprises information about demographic and epidemiological data from the infant, mother, prenatal care and childbirth. Similarly, we have the Death Certificate (DO -Declaração de Óbito) that is the document used to collect information about mortality and it is used as the basis for the calculation of vital statistics, such as the calculation of the Brazilian neonatal mortality rate. SIM has the main goal of supporting the collection, storage and management process of death records in Brazil [2] , and was used to label records from SINASC, where death happened until 28 days of life, by using DNV as an association key, since it is a common field in both systems.
Each sample in our final dataset comprises some features from SINASC, and a label feature describing if the subject survived, or not, after 28 days of life. The other 23 features can be categorized in four groups: (a) socioeconomic maternal conditions features: includes features such as mother's age, years of schooling, marital status and race/skin color; (b) maternal obstetrics features: number of live births, number of previous fetal losses, number of previous pregnancies, number of normal and caesarean labors and type of pregnancy; (c) newborn related features: birth weight, number of pregnancy weeks, Apgar score at 1st minute, Apgar score at 5th minute, congenital anomaly and type of presentation of the newborn; and (d) previous care related features: number of prenatal consultations, labor type, childbirth care and Robson 10groups classification. A detailed description of features is shown in Table 1 .
A brief insight on dataset features values distribution is presented here using graphs. Considering that relevant differences can be observed between the two classes, survivors or neonatal death, the graphs show values separated by each of these classes. For the quantitative continuous features maternal age, newborn weight and gestational weeks, histograms are presented in Fig. 1 , boxplot quartiles in Fig. 2 and data distribution in

Experimental design, materials and methods
The raw data from SINASC and SIM can be obtained directly from DATAUS website. Originally, the files are on DBF format, a standard database file used by dBASE database management system. In order to read the DBF files and convert then to CSV format, a library from R programming language was used. Then the CSV files were loaded into a development environment using Python programming language, and by using Pandas library, all data manipulation was performed. The SPNeoDeath dataset is available in CSV format.
SINASC and SIM datasets are not initially linked, so to associate birth and death records, a simple combination between the datasets was performed using a common variable for both systems, Number of Live Birth Statement (NUMERODN). Even though filling out the DNV and the       DO is mandatory, there is a significant deficiency in data quality due to many situations such as loss when sending the data from hospitals to the city health offices, fields filled with incorrect values and unknown information by the person answering.
After the combination, a new field was added in the resultant data set to label the samples as being a neonatal death (deaths occurred before the first 28 days of life) or not. This was achieved by calculating the difference between the birth date (from SINASC) and the death date (from SIM).
SIM data are applied just for labelling purposes, so for each SINASC record, SIM data were used to label the sample as dead or alive class, makingit possible to construct a big annotated   dataset. After the linkage between SIM and SINASC, the key used on the joining operation was removed from the resultant dataset, as well as many other fields that could be used to reidentify individuals. As SIM data is used just to allow data set labelling, after this process all SIM fields were also removed from the final dataset.
As mentioned, in the context of Brazilian public health data, occurrence of missing or inconsistent data is common and it mostly happens due to the incorrect filling of handwritten forms. Rows having fields with inconsistent values were removed, and a general approach for demographic studies to deal with missing values were used based on approaches of similar studies [3][4][5]. All the features had less than 12% of missing values and basically, two different techniques

Ethics statement
This paper uses publicly available data (SIM and SINASC) that has been de-identified and was deemed exempt from approval from a human research ethics committee.