Revisiting social vulnerability analysis in Indonesia data

This paper presents the dataset about the social vulnerability in Indonesia. This dataset contains several dimensions which rely on previous studies. The data was compiled mainly from the 2017 National Socioeconomic Survey (SUSENAS) done by BPS-Statistics Indonesia. We utilize the weight to obtain the estimation based on multistage sampling. We also received additional information on population, the number, and population growth from the BPS-Statistics Indonesia's 2017 Population projection. Furthermore, we provide the distance matrix as the supplementary information and the number of populations to do the Fuzzy Geographically Weighted Clustering (FGWC). This data can be utilized to do further analysis of social vulnerability to promote disaster management. The data can be accessed further at https://raw.githubusercontent.com/bmlmcmc/naspaclust/main/data/sovi_data.csv.


Specifications
Geography Specific subject area Disaster management and risk reduction, social vulnerability Type of data Table  How data were acquired The data was acquired from the 2017 National Socioeconomic survey from BPS-Statistics Indonesia and Indonesia 2013 Geospatial Map Instruments: Rstudio Data format Raw Analyzed Filtered Parameters for data collection We consider to use the district level data and match the existing data with the available districts in the maps Description of data collection We collect the raw data of the 2017 National Socioeconomic survey from BPS. Subsequently, we aggregated the data using the appropriate rules and used the weight to represent the sampling method. Moreover, we obtain the distance matrix from the data processing from the district-level map. Data

Value of the Data
• The dataset provides the development and disaster indicators from 511 districts in Indonesia and the distance matrix between districts. • The dataset can be used to compare and evaluate the development of districts in Indonesia, followed by an elaboration in social vulnerability context as one of them. • The availability of the dataset can help policymakers initiate responses to natural disasters by considering the regional developments and conditions. • The dataset can identify the deeper regional development and hazards resilience for future studies, specifically using a spatial approach. • The dataset can be combined with data from other study fields, such as public health and transportations, to obtain a deeper understanding of regional development in various contexts.

Data Description
Indonesia is one of the countries prone to various natural disasters, considering that geographically Indonesia is located in the Pacific Ring of Fire and is located at the meeting point of the world's three main tectonic plates [1] . Therefore, all districts in Indonesia are prone to natural disasters such as earthquakes, tsunamis, and volcanic eruptions. Furthermore, social vulnerability plays an important role in analyzing the impact of disaster, which refers to a community's susceptibility to the natural hazard damage, affecting its ability to recover [2] . Social vulnerability studies have emerged at the national level in Indonesia since Siagian et al. [3] . Furthermore, research by Nasution et al. [4] also analyzed social vulnerability by clustering districts in Indonesia using FGWC with Intelligent Firefly Algorithm (IFA). The method is available in an R package called naspaclust [5] . This study disseminates the dataset used in Nasution et al.'s research, entitled "Revisiting Social Vulnerability Analysis in Indonesia: an optimized spatial clustering approach" [4] . The study analyzed 511 districts that came from the calibration with the geographic map of Indonesia in 2013. The calibration was used because, the number of districs were different between two years (511 districts in 2013 and 514 districts in 2017). As a result, it was essential to adjust the 2017 districts into the 2013 districts to obtain spatial information. Based on the expansion history, the districts need to be adjusted were Buton (now South Buton and Central Buton) and Muna (now Muna and West Muna). The primary data source used in the research was the 2017 National Socio-Economic Survey (SUSENAS) [6] . Meanwhile, population and growth data were obtained from Indonesia's population projection in 2017 [7] . Table 1 , and Table 2 shows the description of the variables in the dataset and the sample of the data from seven districts respectively.
Other than the social vulnerability analysis, the data can be used for many purposes. For example, it can be elaborated to analyze the development condition in Indonesia in districts level. The analysis could offer a deeper understanding of the condition of Indonesia in a specific manner. Moreover, the dataset can also be used to identify the priority areas based on the available indicators, particularly, the social vulnerability. The characteristic of districts in Indonesia tends to be different so that it is necessary to make a deeper analysis for policymaking. Subsequently, the distance matrix could be harnessed to perform spatial analysis to investigate the interregional development. Lastly, this dataset can be combined with the dataset from different fields to create a well-crafted and deeper multidisciplinary analysis, particularly the social vulnerability in other sectors' contexts.

Brief information about SUSENAS
National socioeconomic survey (SUSENAS) is a survey conducted by BPS-Statistics Indonesia to collect the primary data about household's welfare from social and economic characteristics. The data was collected by interviewing the selected households directly with multi-stage sampling (see [8] ) for details. Many crucial indicators are estimated based on data from SUSE-NAS, mainly per capita expenditure, poverty, and Gini ratio. Other indicators such as education, health, and demographic characteristics are also calculated based on the data from this survey. The estimation are usually done annually at the district level. As a result, the data is useful as the basis of national and regional development and planning.
To obtain the social vulnerability-related variables, we selected related information based on the SUSENAS questionnaire (see [6] for more details). The variables and associated questions can be seen in table 3 . There were three components in estimating the SUSENAS' indicators: the region, relevant data, and weight. The data aggregation was done by utilizing the weight to do a cross-tabulation between the areas and the data. This study used the dplyr [10] and descr [11] package to transform and cross-tabulate the data from its raw form, respectively.

Distance matrix formation
The distance matrix in this study was constructed from a geographic map of Indonesia in 2013. The map format was in shapefile -a file format which stored the geometric location and geospatial information. Consequently, we needed to pre-process the file into numerical form to construct the distance matrix. The distance matrix was also constructed using R, specifically the package rgeos , rgdal , and sp [12,13] . First, we read the shapefile using the readOGR function, followed by gcentroid to obtain each district's center coordinate along with its region code. Subsequently, the distance matrix was calculated using spDists function which returns the distance in kilometres. Then, we matched the districts' code from the map to the districts' code in SUSENAS data. Due to the difference in the number of districts, the unmatched district in SUSENAS data were joined with its parent district in 2013. Finally, the distance matrix were matched with the available district code. Figure 1 shows the distribution of social vulnerability characteristics in Indonesia in 2017. Boxplot was used to assess the distribution of data. In Figure 1 , all variables are distributed in percentage, except the population that is distributed using logs to simplify data distribution. The colors in the figure represent the available areas in the plot legend. The pink box plot represents the distribution of national-level data in Indonesia.

Data Condition
Based on Figure 1 , it can be seen that almost all variables have outliers. It indicates that there was inequality among regions in Indonesia in the context of social vulnerability. It was supported by the results' details, which disseminate different interregional characteristics such as demography. Eastern Indonesia, namely Maluku and Papua, tend to have a low female and elderly population. In contrary, the children and family size in these regions tend to be higher (also with relatively high dispersion) than the other regions. Indonesia had an asymmetrical distribution of the population (in log, while the growth dispersion was asymmetric from the population aspects. The Java, Bali, and Nusa Tenggara region had smaller population growth due to the large population in each district. Moreover, the same pattern was followed by the rest of the regions. Maluku and Papua had the most problems among the other regions in Indonesia. The nonelectricity, poverty, illiteracy, and non-sewer variables, in these two areas were higher compared to those variables in the other regions. All regions tend to have high percentage of households with no disaster training. Unfortunately, most of these regions were also those that were prone to disaster. On the other hand, some regions had lower percentages of disaster-prone households, which considered outliers (Sumatra and Java, Bali, and the Nusa Tenggara region). Regarding the housing, the Sumatra region had the highest percentage of people renting houses.

Ethics Statement
There is no conflict of interest. The data is available in public domain.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article.