The dataset of Japanese patents and patents’ holding firms in green vehicle powertrains field

In 2020, the Government of Japan declared “2050 carbon neutral” and launched a long-term strategy to create a “virtuous cycle of economy and environment”.1 Japanese firms possess many technologies that contribute to decarbonization, which is important to expand investment for Green Technology (environmental technology) development. As automobiles are major contributors to greenhouse gas emissions [1], the technological shift towards vehicle powertrain systems is an attempt to lower problems like emissions of carbon dioxide, nitrogen oxides [2]. On the other hand, patent data are the most reliable business performance for applied research and development activities when investigating the knowledge domains or the technology evolution (Wand, 1997). Our paper describes a Japanese patents dataset of the vehicle powertrain systems for hybrid electric vehicle (HEV), battery electric vehicle (BEV) and fuel cell electric vehicles (FCEV). In this paper we create a method of bombinating international patent classification (IPC) and keywords to define “green” patents in vehicle powertrains field, using patent data which were applied to Japan Patent Office recorded on EPO's PATSTAT database during 2010∼2019 year. When analyze patents, it is necessary to consider the social situation of each country including language background, we collect patents description documents (abstracts and titles) not only written in English but also in Japanese. Finally, we build a database includes 6025 green patents’ description documents and 266 patents’ holding firms. With which we then identify 3756 HEV patents, 1716 BEV patents, and 553 FCEV patents. Data about patent holding firms is also appended. The full dataset may be useful to researchers who would like to do further search like natural language processing and machine learning on patent description documents, statistical data analysis for empirical economics.


a b s t r a c t
In 2020, the Government of Japan declared "2050 carbon neutral" and launched a long-term strategy to create a "virtuous cycle of economy and environment". 1 Japanese firms possess many technologies that contribute to decarbonization, which is important to expand investment for Green Technology (environmental technology) development. As automobiles are major contributors to greenhouse gas emissions [1] , the technological shift towards vehicle powertrain systems is an attempt to lower problems like emissions of carbon dioxide, nitrogen oxides [2] . On the other hand, patent data are the most reliable business performance for applied research and development activities when investigating the knowledge domains or the technology evolution (Wand, 1997). Our paper describes a Japanese patents dataset of the vehicle powertrain systems for hybrid electric vehicle (HEV), battery electric vehicle (BEV) and fuel cell electric vehicles (FCEV). In this paper we create a method of bombinating international patent classification (IPC) and keywords to define "green" patents in vehicle powertrains field, us-ing patent data which were applied to Japan Patent Office recorded on EPO's PATSTAT database during 2010 ∼2019 year. When analyze patents, it is necessary to consider the social situation of each country including language background, we collect patents description documents (abstracts and titles) not only written in English but also in Japanese. Finally, we build a database includes 6025 green patents' description documents and 266 patents' holding firms. With which we then identify 3756 HEV patents, 1716 BEV patents, and 553 FCEV patents. Data about patent holding firms is also appended. The full dataset may be useful to researchers who would like to do further search like natural language processing and machine learning on patent description documents, statistical data analysis for empirical economics. ©

Value of the Data
• Our dataset has a high level of completeness, it includes documents both in English and Japanese. • We propose a method of bombinating IPC and keywords to define "green" patents in vehicle powertrain systems. • Our dataset makes it possible to survey Japanese firms' financial data and their holding patents simultaneously. • Our dataset can be used for merging further information or connecting with other databases. • Our dataset is useful to researchers who would like to do further research like natural language processing and machine learning, statistical analysis. • Our dataset is meaningful for forecasting development of new technology and encouraging more environmental innovation.

Data Description
A Japanese patents dataset of the vehicle powertrain systems for HEV, BEV and FCEV. We define "green" patents in vehicle powertrains field, using patent data which were applied to Japan Patent Office recorded on EPO's PATSTAT database during 2010 ∼2019. We summarize data into several sheets according to their attributions. The first "patent" sheet is the original data includes whole information we collected from PATSTAT. In the 2nd "consolidated accounting" sheet and the 3rd "consolidated accounting" sheet, we surveyed the financial conditions for the 266 "green" patent-holding firms in the fiscal year 2021. We collected patent holding firms' banking data that are available in their annual securities reports. 2 In the next sheet we categorized patents holding firms into nine groups using their company sizes. Finally, we make comparison tables of keywords and firms' names between English and Japanese in the last two sheets of the dataset.
We make Table 1 , Figs. 1-3 base on the 1st "patent" sheet of dataset. As shown in Table 1 , our database includes 6025 "green" patents and 266 green patents' holding firms. With which we identified 3756 HEV patents, 1716 BEV patents, 553 FCEV patents. Then we can observe in Fig. 1 , the number of HEV patents increased and peaked in 2013 and showed a downward trend lately, especially was the least in 2017. The trend of BEV is almost the same with HEV but peaked in 2011. Furthermore, the number of FCEV is the least and showed a relatively flat trend.     Table 2 is a list of financial indicators commonly used by patent holding firms in 2nd "consolidated accounting" and 3rd "non-consolidated accounting" sheet of our dataset, which are useful for assessing the performance of innovative activities of patent holding firms.
To summarize the financial effort s on the patents, we categorized these 97 firms into nine groups using their company sizes. Fig. 4 shows the box-and-whisker plot of the number of employees. We observe a right-skewed distribution of employees with 12 extreme values, which is due to diversity of these firms. We categorized firms which have extreme values into a big-size group. Furthermore, we computed the bandwidth (10,755) by using the plug-in approach [3 , 4] .
The plug-in approach constructs an estimator of the unknown roughness R ( f (2) where f (r) is defined as the rth derivative of the density function f , and k is the kernel. It is well-known that this estimator depends on a bandwidth and an unknown roughness. To estimate R ( f (2) ) , we need to estimate R ( f (4) ) to obtain an asymptotically valid bandwidth. The estimation of R ( f (4) ) further requires a bandwidth ˜ h , which in turn would depend on R ( f (6) ) . Assuming normality, the roughness of the 6th derivative of a density belonging to the N( 0 , σ 2 ) family is estimated by ˆ R ( φ (6) where ˆ σ is the estimate of the standard deviation of the random variable x , and R ( φ ( 2 r Then the optimal bandwidth for estimating R ( f (2) ) can be estimated by where κ 2 (k ) is the second moment of the kernel. The Eq. (2) coincides with the estimator proposed in [4] . Using ˜ h , we can estimate the roughness ˆ R ( f (2) ) via (1). Finally, the optimal bandwidth is constructed by After excluding all extreme values, we calculated the optimal bandwidth using (3) and the result is 10,755.
Using the bandwidth, we obtained nine groups labelled by alphabets "A" to "I" in the order of increasing sizes, the details are in the 4th Group (A ∼I) sheet of the dataset. The R&D expenses (Billions of yen) and R&D expenses ratio to sales (%) of each group are summarized in Fig. 5 . We can find that small-size firms tend to have a higher R&D expenses ratio to sales, considering their levels of R&D expenses.

Acquire data from Patstat
We acquire data from patstat, which is a worldwide patent statistical database created and maintained by the European Patent Office (EPO) 3 [5] . Data acquiring methods are in Table 3 Fig. 5. R&D expenses/R&D expenses ratio to sales. Year of the application filing date APPLN_KIND A (patent) Specification of the kind of application APPLN_AUTH JP The competent authority, which is the national, international or regional patent office responsible for the processing of the patent application APPLN_TITLE Title of application APPLN_ABSTRACT Abstract of application IPC_CLASS_SYMBOL IPC symbol (IPC 8th edition) DOC_STD_NAME Standard name attributed to applicant and inventor names Table 4 Searching strategy. vehicles_classification

Searching strategy with Python
We search patents' title and abstract acquired last section using IPC classification and keywords. Green patents' IPC classifications are supplied by IPC GREEN INVENTORY, 4 keywords are referred to some former researches [2 , 6 , 7] . We also search patents' title and abstract written in Japanese, the comparison table of keywords may be found on the final sheet of our dataset.

Append financial data of patents' holding firms
We collected the financial (banking) data of 97 green patent-holding firms using both the consolidated and non-consolidated accounting in the fiscal year 2021. These data can be effectively used to measure the performance of patent-holding firms and provide useful insights into understanding new vehicle powertrain industry. For example, previous studies on efficiency and productivity analysis such as [8 , 9] analyzed both the consolidated and non-consolidated data in either a parametric (e.g., stochastic frontier analysis [10] ) or a nonparametric approach (e.g., data envelopment analysis [11] ).