Imputation of missing information in worldwide patent data

We present a general method for imputing missing information in the Worldwide Patent Statistical Database (PATSTAT) and make the resulting datasets publicly available. The PATSTAT database is the de facto standard for academic research using patent data. Complete information on patents is essential to obtain an accurate picture of technological activities across countries and over time. However, the coverage of the database is far from complete. Our data imputation method exploits detailed institutional knowledge about the international patent system, and we codify it in a SQL algorithm. We provide two datasets related to the imputation of missing country codes and missing technology classification. We also release the algorithm that can be easily adapted to impute other pieces of information that are missing in PATSTAT.


a b s t r a c t
We present a general method for imputing missing information in the Worldwide Patent Statistical Database (PATSTAT) and make the resulting datasets publicly available. The PAT-STAT database is the de facto standard for academic research using patent data. Complete information on patents is essential to obtain an accurate picture of technological activities across countries and over time. However, the coverage of the database is far from complete. Our data imputation method exploits detailed institutional knowledge about the international patent system, and we codify it in a SQL algorithm. We provide two datasets related to the imputation of missing country codes and missing technology classification. We also release the algorithm that can be easily adapted to impute other pieces of information that are missing in PATSTAT.
© 2020 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) Table   Subject Social Sciences (General) Specific subject area Innovation policies, regional studies, strategic management Type of data

Value of the Data
• The Worldwide Patent Statistical Database PATSTAT, provided by the European Patent Office (EPO), has become the de facto standard for researchers working with patent data. A critical issue with the database, however, is that its coverage is far from being complete. • Complete patent data are crucial to delivering an accurate picture of innovation activities around the globe. Researchers and policymakers use patent data for many purposes, but often refer to a selected set of patent offices or to incomplete data. By using our data / our suggested imputation method, a more accurate picture of innovation activities can be obtained. • Patents in the same family offer an abundant reservoir of information to fill in potential missing data. We provide a systematic approach to replenish missing pieces of information by browsing different pools of subsequent filings. The code can be easily adapted to other cases of missing information in patent data. • In general, our data show the usefulness of imputation and the need to reflect carefully on all selection decisions when working with patent data.

Data Description
We draw on de Rasssenfosse et al. (2013, 2019) [ 1 , 2 ] who argue that the first filing of a patent family for a given invention is the relevant entity to look at. Indeed, first filings are the first occurrences of the invention, and, loosely speaking, second filings correspond to 'replicates' of the first filings that extend patent protection in other jurisdictions. If any information is missing for the first filing, it is possible to infer it from the subsequent filings in the same family. Our algorithm detects the data gaps in the first filings and browses the relevant subsequent filings in order to fill in the missing information.
We provide three different datasets in our Dataverse ( https://dataverse.harvard.edu/dataverse/ imputation _ worldwide _ patent _ data ) where we have applied the algorithm in order to provide complete data on country codes of inventors and applicants and technology classification: 1. Imputation of missing technology classification in worldwide patent data. 2. Imputation of missing applicant country codes in worldwide patent data. 3. Imputation of missing inventor country codes in worldwide patent data.
All files contain application identifiers for first filings (corresponding to APPLN_ID in PAT-STAT), the first filing date and year, and a column with the desired information (1. technology classification, 2. applicant country codes, or 3. inventor country codes, for details see below). The TYPE column indicates the type of the first filing (see next section). Datasets 1. and 2. also contain a PERSON_ID. This is also a PATSTAT ID that can be used to identify inventors and applicants and to join more detailed address information using the respective PATSTAT tables.
The datasets are zipped. The unzipped files are very large (between 3 and 11GB) and cannot be opened with conventional text editors or spreadsheet software. For inspecting the files, EmEditor -a text editor for Windows that supports large amounts of data -can be used. For Mac users there are suitable alternatives that can be found in the World Wide Web. In any case, we suggest using a SQL database.
Missing information have been imputed from equivalents and other second filings (see next section). The SOURCE column indicates the respective source of information.
The information is directly retrieved from the relevant PATSTAT tables. Details, definitions and links to references can be found in the PATSTAT data catalog [3] . In particular, for the data mentioned above, we have retrieved the following fields from PATSTAT: We codify the algorithm in SQL for PostgreSQL 9.6.6. The algorithm runs with any PATSTAT version newer than Autumn 2016, with only minor adaptations. However, careful inspection of the PATSTAT data catalog [3] of the respective PATSTAT version is warranted to adjust to changing data schema (minor changes such as relabeling of columns and tables can happen over time).
The SQL code can be found in our GitHub repositories: 1. https://github.com/seligerf/Imputation-of-missing-IPC-codes-and-technology-informationfor-worldwide-patent-data 2. https://github.com/seligerf/Imputation-of-missing-location-information-for-worldwidepatent-data There, you can also find code in order to build a "bridge" table in order to assign any patent filing to its respective first filing as defined in our work.

Experimental Design, Materials and Methods
PATSTAT's coverage suffers from two significant limitations: is satisfactory, important bibliographical information is missing for a significant proportion of patent filings. Fields for which information is often missing (especially for earlier years) comprise abstracts, technological classifications, citations, as well as applicant and inventor addresses and countries. Table 1 provides an overview of data available for selected patent offices, years, and fields. The share of available inventor country codes for patent applications filed at the French patent office is lower than two percent before 2003, but almost complete starting in 2009. The situation is the reverse at the Chinese patent office. Concerning address data for inventors and applicants, only data from the EPO and USPTO are available on a large scale.
A useful feature of patent data in our context is that many patent applications for the same invention (or a close enough version of it) are filed in different jurisdictions, thus forming an international patent family. Therefore, the chances are high that information gaps on a focal patent can be retrieved from other members of the patent family. However, one needs detailed institutional knowledge about the patent system to understand how to fill these gaps accurately. The imputation algorithm that we propose implements a solution that exploits this knowledge. We publicly release it so that other scholars can replicate our approach, and possibly further refine it-or tailor it to specific use cases. Fig. 1 provides the algorithm's flowchart. The first step involves the creation of a table with all first filings of interest (regardless of whether the information is missing). First filings are the patent applications with the earliest application filing date within a patent family at any patent office. By default, we include first filings from patent offices from all OECD countries, including all EU28 countries ( + Switzerland and Norway), BRICS countries, the EPO, and the 'International Bureau' of the World Intellectual Property Office (WIPO). Patent applications filed at those offices account for almost all patent activity around the world. The 'pool of first filings' constitutes source 1 .
The identification of first filings requires detailed knowledge of PATSTAT and the patent system. We gather first filings from PATSTAT in the broadest sense, i.e., all filings that have been applied for the first time for a given invention. First, we use all priority filings as defined in the strict sense, namely the 'Paris Convention' priorities. The 1883 Paris Convention for the Protection of Industrial Property allows the applicant of a first application filed in one of the contracting states to seek protection in any of the other contracting states within 12 months. We also added Patent Cooperation Treaty (PCT) filings to our pool of first filings. The PCT makes it possible to seek patent protection in a large number of countries simultaneously. Finally, we can identify two other kinds of first filings: 'Parent applications' of so-called 'Application continuations'; and filings based on 'Technical relations' that define some kind of family-relationship. The PATSTAT data catalog offers technical definitions [3] , more details can be also found in de Rassenfosse et al. (2019) [2] . We have included a TYPE column in our data so that it is possible to select specific types of first filings, e.g. only priority filings filed according to the Paris Convention.
Next, we create several tables that contain all necessary information to be used in the imputation when the information is not available from source 1 . The imputation exploits the pool of all subsequent filings that relate to the first filings. Subsequent filings are patent applications filed in other jurisdictions than the first filing (except for continuals and technical relationships that do not constitute international patent families). In the case of PCT applications, we refer to information from the National or Regional Phase, where the applicant seeks protection at national or regional offices. If the information is not directly available ( source 1 ), the algorithm will first look into direct equivalents of the first filing ( source 2 ). These are subsequent filings that refer to exactly one first filing in a given office. 1 The number of first filings they refer to can be retrieved from the PATSTAT tables mentioned above. If several equivalents exist, we select the equivalent with the earliest filing date.
If the information is not available from equivalents, the algorithm will look into other subsequent filings ( source 3 ) and again select the filing with the earliest filing date. If the information cannot be retrieved from source 3 , it is declared missing, i.e., the respective patent filing cannot be used in the statistical analysis. Table 2 Share of available information for inventor countries before and after imputation (sources 1 to 3).

Table 3
Share of available information on IPC before and after imputation.