Functional performance improvement data and patent sets for 30 technology domains with measurements of patent centrality and estimations of the improvement rate

This article accompanies the study presented in Triulzi et al. (2020) [1]. It briefly describes and makes available the data on functional performance for 30 technology domains, their patent sets, the measurement of patent centrality and the method to estimate the yearly technology performance improvement rate (TIR) that underly that study. Some of this data (performance time series and the lists of patents for 28 domains) has been collected by other authors for previous studies but were previously unavailable to the public. Measurements of patent centrality and other patent-based indicators for the 30 domains, and for 5.259.906 utility patents granted by the United States Patent and Trademark Office between 1976 and 2015 are novel data contributed by Triulzi et al. (2020) [1]. Here we organize, describe and make available the collection of data in its entirety. This allows anyone interested to replicate the study or use the method to estimate the improvement rate of a given technology for which patents can be identified. For a detailed description of the data and methods see Triulzi et al. (2020) [1].

• Time series of functional performance for 28 different technologies were acquired by Christopher L. Benson and Christopher L. Magee by looking for long series of performance data points in reputable sources (scientific articles, magazines, industry reports, etcetera) for as many technologies that could be found. The method is described in Benson (2014) [2] and Benson and Magee (2015a) [3] .
• Performance data for Hybrid Corn were collected by Maryam Barry, Giorgio Triulzi and Christopher L. Magee by analysing multiple sources: patent data where field trials were described and yields data from different US states. The collection method is described at length in Barry et al. (2017) [5] .
• Patent sets for 28 of the 30 domains were collected by Christopher L. Benson and Christopher L. Magee by applying a novel Classification Overlapping Method described in Benson and Magee (2013) [6] and (2015b) [7] .
• The patent set for Magnetic Materials was collected by Subarna Basnet through an application of the same method. The process is described in Basnet (2016) [4] .
• The patent set for Hybrid Corn was collected by Maryam Barry, Giorgio Triulzi and Christopher L. Magee through a combined used of patent classes and keywords. The method is described in Barry et al. (2017) [5] .
• Raw data on patent information and citation relationships for 5.259.906 patents granted by the United States Patent and Trademark Office (USPTO) between 1976 and 2015 were downloaded from patentsview.org.
• Normalized and unnormalized patent-based measures, such as patent centrality, number of citations received or the age of the cited patents, that are tested as predictors of the improvement rate were calculated following the methodology described in   [1] . Data format • Raw (performance time-series) • Processed (patent-based indicators) • Analyzed (empirical technology yearly improvement rates and estimated ones based on patent-data) Parameters for data collection • Data on performance time series were collected from various sources (scientific articles, specialized websites, industrial magazines or reports) according to the criteria of availability of long time series and credibility of the source.
• Patent data only includes utility patents granted by the USPTO between 1976 and 2015. Description of data collection • Performance data was copied or downloaded from the sources described in the section "How data were acquired" of this table.
• Patent data was downloaded by using two different platforms Patsnap (for retrieving patent numbers according to queries that respected the COM method described in Benson and Magee (2013 [6] and 2015b [7] ) and Patentsview (to download data used to compute the different variables described in this article).   [3] as benchmarks for predicting power. It can also be used to analyse the internal structure of relationships between inventions, inventors or assignees within a domain over time or across domains. • The added value of the patent data provided, compared to patents retrieved for given technology classes (using patent classification systems such as the International Patent Classification or the Cooperative Patent Classification) is that our data is grouped in technology domains, whose definition includes artefacts that achieve the same function and use the same scientific principles, as opposed to commonly used classification systems that only rely on one of the two definitions.

Performance time series
In Table 1 , we summarize the information on the data used to empirically measure the TIR for 30 technologies. For each of these technologies, the table describes how many data points the performance time-series has and its year range, as well as the performance variable that describes the time-series. We also report the data source of each time series as the paper in which it was first used. In that paper the reader can find more information on how each one was collected. The time series are available in the file "performance_time_series.csv", which contains 398 performance observations in total, for all 30 domains over time and five columns (Year, Data, Domain, Metric and Units). Fig. 1 shows, using four examples, how the empirical TIR was estimated using the time series described in Table 1 . Log-linear plots of the performance variable against time were made and a linear fit of the data was performed. The slope of the line is the TIR (which correspond to the rate variable of an exponential curve). As explained by Benson (2014) [2] and Benson and Magee (2015a) [3] , the estimation of the empirical TIR (second column of Table 2 ), is obtained by looking only at record-breakers and, when possible if the time series was long, only post-1976 data points, to match them with the period for which patent data is available. However, in the file 'performance_time_series.csv" we make all data points available. Table 2 reports the empirical TIRs for each technology domain, obtained as shown in Fig. 1 , the R 2 of the linear fit on a log-linear plane, as a measure of the goodness of fit of the exponential hypothesis, and the estimated TIR coming from patent data. The latter is obtained using the method briefly summarized in Section 3.      Fig. 2 shows a bar plot of the empirically observed improvement rate for the 30 technology domains (using the second column of Table 2 ). Table 3 contains a variable dictionary for the data included in the file "Do-mains_patent_info.csv" (i.e. the description of the content of each column of the file). The file contains information on different variables computed for USPTO granted patents belonging to the 30 technology domains. It has one record per patent (511.570 records in total). The file "All_patents_info.csv" includes the exact same variables listed in Table 3 for 5.259.906 USPTO utility patents granted between 1976 and 2015, except for the domain information (i.e. the first raw of Table 3 does not apply). Table 4 reports the mean values for a series of SPNP centrality-based patent variables computed for patents in each technology domain: the average centrality of the patents cited by patents in the domain, the centrality of the domain's patents measured after three years from filing and their centrality in 2015. All three are normalized in two different ways, one through the randomization of the entire USPTO patent citation network and the other by taking the rank percentile of the value for each patent, compared to other patents granted in the same year. These two normalization methods and their advantages and disadvantages are discussed at length in   [1] . Data in Table 4 , as it is presented, is available in the file "DF_means_centrality.xlsx". The file "DF_means_all_variables.xlsx" makes available means by do-   Search Path Node Pair (SPNP) centrality value measured 3 years after filing, normalized as rank percentile compared to patents filed in the same year meanSPNPcited_1year_before_RankPerc_by_year average raw Search Path Node Pair (SPNP) centrality value of the patents cited by the focal patent, normalized as rank percentile compared to patents filed in the same year log_meanSPNPcited_1y_before log of the average raw Search Path Node Pair (SPNP) centrality value of the patents cited by the focal patent SPNP_count_2015_randomized_zscore

Patent data
Search Path Node Pair (SPNP) centrality value as per December 2015, normalized as a z-score compared to 10 0 0 randomization meanSPNPcited_1year_before_randomized_zscore average Search Path Node Pair (SPNP) centrality value of the patents cited by the focal patent, normalized as z-score compared to 10 0 0 randomization SPNP_count_t2_randomized_zscore Search Path Node Pair (SPNP) centrality value measured 2 years after filing, normalized as z-score compared to 10 0 0 randomization SPNP_count_t3_randomized_zscore Search Path Node Pair (SPNP) centrality value measured 3 years after filing, normalized as z-score compared to 10 0 0 randomization SPNP_count_t5_randomized_zscore Search Path Node Pair (SPNP) centrality value measured 5 years after filing, normalized as z-score compared to 10 0 0 randomization SPNP_count_t8_randomized_zscore Search Path Node Pair (SPNP) centrality value measured 8 years after filing, normalized as z-score compared to 10 0 0 randomization SPNP_count_2015_randomized_zscore_RPbyYear Search Path Node Pair (SPNP) centrality value as per December 2015, normalized as rank percentile of the z-score value generated by the randomization process compared to patents filed in the same year SPNP_count_t2_randomized_zscore_RPbyYear Search Path Node Pair (SPNP) centrality value measured 2 years after filing, normalized as rank percentile of the z-score value generated by the randomization process compared to patents filed in the same year SPNP_count_t3_randomized_zscore_RPbyYear Search Path Node Pair (SPNP) centrality value measured 3 years after filing, normalized as rank percentile of the z-score value generated by the randomization process compared to patents filed in the same year SPNP_count_t5_randomized_zscore_RPbyYear Search Path Node Pair (SPNP) centrality value measured 5 years after filing, normalized as rank percentile of the z-score value generated by the randomization process compared to patents filed in the same year SPNP_count_t8_randomized_zscore_RPbyYear Search Path Node Pair (SPNP) centrality value measured 8 years after filing, normalized as rank percentile of the z-score value generated by the randomization process compared to patents filed in the same year meanSPNPcited_1year_before_randomized_ zscore_RPbyYear average Search Path Node Pair (SPNP) Centrality value of the patents cited by the focal patent, normalized as rank percentile of the z-score value generated by the randomization process compared to patents filed in the same year bwd_self_cit number of citations made by the patent that were directed to patents assigned to the same organization (harmonized assignee name must have exact same spelling) ( continued on next page ) share of total number of citations received within 3 years that come from other patents that have the same assignee (harmonized assignee name must have exact same spelling) count_citations_made_RANK_PERC_BY_YEAR total number of backwards citations made normalized as a rank percentile compared to patents filed in the same year CITE3byOthers number of citations received within 3 years from filing from patents that have a different assignee from the one of the focal patent CITE3byOthers_RANK_PERC_BY_YEAR number of citations received within 3 years from filing from patents that have a different assignee from the one of the focal patent, normalized as a rank percentile compared to patents filed in the same year mains for each variable described in Table 3 . It has 30 rows, one per domain, and 48 columns including the mean values for each of the variables in the rows of Table 3 . Fig. 3 shows the scatter plot of the observed improvement rate for each domain (the second column in Table 2 ) against the domain's mean centrality of the patents cited by the domain's patents (the second column in Table 4 ). The figure clearly highlights the strength of the relationship, which is used by   [1] to train a regression that can estimate the improvement rate of any technology domain for which a reliable set of patents can be identified.
Finally, Fig. 4 shows the data processing and analysis flowchart, to help visualized the process followed, which is described in Section 3.1.

Experimental Design, Materials, and Methods
Patent sets for the 30 technology domains were used to compute several patent variables, which, in turn, were tested as predictors of the technology yearly improvement rate (TIR). Each variable was computed in its raw form and in a normalized form. They were then included as independent variables in a regression that estimated TIRs. The full description of the methods used can be found in   [1] . Here, we report a synthesis of it. Fig. 4 summarizes the process followed to create the datasets and process the information. For 28 of the 30 technology domains, we used patent sets provided by Benson and Magee (2015a) [3] , which they retrieved using the Classification-Overlapping Method (COM) described in Benson and Magee (2013 [6] and 2015b [7] ). The list of patents belonging to Magnetic Materials was provided by Basnet (2016) [4] , and the one for Hybrid Corn was retrieved by Barry et al. (2017) [5] . Patent identifiers (i.e. grant number) for these 30 sets were retrieved from Patsnap ( https://www.patsnap.com/ ). Then, basic information on their filing and grant year, classifications and their citations (made and received) were downloaded from Patentsview   ( https://www.patentsview.org ). We then removed from the list re-issued patents, applications and non-utility patents. After that, we computed raw and normalized versions of the variables described in Table 3 and tested a subset of them, based on a theoretical selection, as candidate predictors of TIRs through a Monte Carlo cross-validation (MCCV) exercise (see next section). The subset is described in   [1] . Here, we make all the variables computed publicly available, in case users would like to experiment with them.

Estimation of improvement rate
For each variable included in the file "Domains_patent_info.csv", for each technology domain, we computed the mean value including only patents granted up to that year. We then performed a Monte Carlo Cross-Validation exercise in which we sampled randomly half of the 30 domains (creating a training set), trained a regression with that single variable as predictor of the improvement rate and then tested the performance of the regression to predict the improvement rate for the testing set of the remaining half of the domains. We did this for all years up to 2015. This exercise allow determining that two centrality variables were the predictors that ensured the most accurate estimation of the improvement rate and the least reliable on the domains included in the training test or the period of time for which the mean patent variables were computed. Finally, we estimated the full regression coefficients when including all data at disposal (i.e. all domains and patents from 1976 to 2015) and selecting the best predictor only. That regression, combined with data in the file "All_patents_info.csv", can then be used to estimate the improvement rate for technology domains for which we only have patent data and no empirical observation of their functional performance. The estimating equation and its coefficients can be found in   [1] .

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.