Longitudinal small and medium enterprise (SME) data on survival, research and development (R&D) investment, and patent applications in Korea's innovation clusters from 2008 to 2014

This article contains survey data from 588 firms on 1) their length of survival, 2) technological innovation related information, such as research and development (R&D) investment, research manpower, and the number of patent applications, along with 3) other basic data on firm size and affiliated industry sector. The dataset was extracted from firms residing in three different innovation cluster regions of Korea. All the data in this article are based on firm level questionnaire in the innovation cluster regions, with the exception of the firm survival information extracted from the National Tax Service of Korea and industry information from “Statistics Korea”. The related research article using the current dataset was published under the following title: “Does R&D investment increase SME survival during a recession?” Jung et al., 2018.


Data
The cross-sectional dataset consists of 588 firms that were filtered as participating in multiple surveys during 2008e2014. Firms with fewer than 500 employees are included in the dataset. Even though the data collection in Korea's innovation clusters began in 2006, the current dataset in this data article only contains collections from 2008, which is due to the associated research [1] demanding the dataset after the 2008 financial crises. The dataset is valuable since it is hard to obtain innovation related information and business closure information together on small and medium enterprises (SMEs). In the case of small firms, researchers can obtain such information only if they are public firms listed on stock market. Each firm's data contains the descriptive variables shown in Table 1. Table 2 describes the variables that have postfix year-name variables. Each variable in Table 2 has a series of seven variables; for example, Sales 09 means the total firm sales of 2009. The variable constitutes a series from Sales 08 to Sales 14. The dataset of the first supplementary file, described in Table 2, is a specific form of "wide-file" for executing SPSS cox regression with time-varying covariates. According to Guo [2], SPSS and SAS require "wide-file" format where time-varying covariates are organized by several variables. However, STATA requires "long-file" format where each subject occupies more than one data line for varying covariates. For the purpose of sharing dataset with wider academic communities, we provide a separate supplementary file with "year" as an independent variable for those who prefer standardized panel data format (and for STATA users) without complex postfix year name. These twin supplementary files constitute main dataset files of this article. In addition, a separate supplementary file (the third supplementary file) on industry concentration and industry growth Specifications  [1].

Value of the data
The dataset presents rare information on the lifespan of 588 firms over 72 months from 2008 to 2014, including basic information on firm size measured in terms of sales and employees. This dataset is useful for the survival analysis of small and medium enterprises whose information is not publicly available. The dataset contains longitudinal firm level data of small and medium enterprises (SMEs). Most of this information is related to innovation such as patent applications, annual R&D investment (total R&D and internal R&D use), and annual R&D manpower. Consequently, the dataset is particularly useful for those who study innovative SMEs and performance of SMEs.
The information on innovative venture business certificate issued by Korean government and information on export status also provides valuable opportunities to study both traditional vs. innovative SMEs and domestic vs. non-domestic SMEs.
The data could be useful if future research investigates regional innovation clusters as it contains the regional code of three innovation clusters.
during the period is also presented in this data article, and is illustrated in Table 3. Finally, the fourth supplementary file contains exemplary SPSS processing algorithm for survival analysis.

Experimental design, materials and methods
Korean government is keen to create a venture ecosystem in innovation cluster regions, and government agencies implement annual surveys and the target sample is all member organisations in the regions. Firm selection was based on two criteria: the size of the firm and the multiple responses. Firms with fewer than 500 employees at any time during the survey period e 2008e2014 e were included. The size criteria was chosen with the aim of implementing a potential wider international comparison study in the future (e.g. definitions of manufacturing SMEs: 200 employees in the EU and 500 in the U.S.). As for the   multiple responses, firms with a response frequency of five or more and exit firms with two or more were included in the dataset; this is because the associated research article [1] conducted a survival analysis demanding longitudinal data. There are five innovation clusters under the management of government agencies, but only three innovation clusters were included. The other innovation cluster regions (Busan and Jeonbuk) were excluded due to a lack of data collection before 2013, thus they do not satisfy the condition of multiple responses: five or more participation of the annual survey. Finally, as very young firms are prone to fail, firms with foundation year after 2005 were excluded. Upon our request of technology and performance related firm questionnaire variables, government agencies of administrating three innovation cluster regions provided the coded survey results on the selected variables, where the identity of individual firm should be unnoticeable. Without special arrangement, the survey data cannot be accessed.
As for processing the dataset, we obtained the export sales and venture certificate statues of individual firms and converted them into dummy variables each year. The standard industry codes of the firms were assigned according to those obtained from the 2012 and 2008 surveys. It contains 2-digit KSIC code [3]. The growth of the industry was obtained from the Korea Statistics Office's website (KOSIS) [4] and followed a simple calculation {(Industry production i e Industry Production i-1 )/Industry Production i-1 }. The concentration ratio of the industry was calculated from the three firm concentration ratios (CR3) that appear on Korea Development Institute (KDI)'s biannual "market structure report" under the surveillance of the government (Korea Fair Trade Commission). The report presents the associated CR3 Excel data file at 5-digit sub-industry level in recent two years, so it was possible to  Korea's National Tax Service (NTS) compiles the closure of businesses and the information is easily accessible through the CRETOP ® credit information agency. The exact closure date was obtained but coded monthly in the current dataset to specify the lifespan of each firm.
The essential value of the data lies in the monthly records of firm survival from 2008 to 2014. Upon processing the dataset, survival patterns had been checked alongside the cumulated survival function graphs; the graphs, according to the region and firm status, are presented in Figs. 1 and 2, respectively.
Basic statistics of closed (exit) firm vs. survived firm is presented in Table 4. The closed 58 SMEs are smaller and lower in technological activities, when compared with survived 530 SMEs. The dataset summarized in Table 1 was prepared for "SPSS cox regression with time varying covariates", where time varying covariates were organized as several variables.
The major utility of "wide format" datatset of Table 2 is as follows. After loading dataset into SPSS, a time program code, SPSS script file, similar to those used for the related research article [1], was constructed following the method of tutorial [5], and had been executed. Table 5 presents exemplary survival regression output from the SPSS script. We can observe that size (positive impact) and industry concentration (negative impact) are the most important for the SME survival amongst four variables. The script for exemplary analysis of Table 5 is also provided in a separate supplementary file (DiB_coxreg.sps in the zip file) in this data article.

Acknowledgments
Jungtae Hwang gratefully acknowledges the support of Hallym University HRF-201603-010 for both utilities and processing the current article.

Transparency document
Transparency document associated with this article can be found in the online version at https:// doi.org/10.1016/j.dib.2019.103967.