UvA-DARE ( Digital Academic Repository ) Is Europe Falling Behind in Data Mining ? Copyright ’ s Impact on Data Mining in Academic Research

With the diffusion of digital information technology, data mining (DM) is widely expected to increase the productivity of all kinds of research activities. Based on bibliometric data, we demonstrate that the share of DM-related research articles in all published academic papers has increased substantially over the last two decades. We develop an ordinal categorization of countries according to essential aspects of the copyright system affecting the costs and benefits of DM research. We demonstrate that countries in which data mining for academic research requires the express consent of rights holders, data mining makes up a significantly smaller share of total research output. To our knowledge, this is the first time that an empirical study identified a significant negative association between copyright protection and innovation. We also show that within countries where DM requires express consent by rights holders, there is an inverse relationship between rule of law indicators and the share of DM related articles in all research articles.


| INTRODUCTION
This paper discusses the effect of copyright on data mining (DM) by academic researchers. Hand et al. (2001) broadly define DM as "the discovery of interesting, unexpected or valuable structures in large datasets." 1 With the proliferation of digital data, DM is widely expected to increase the productivity of many types of research activities and to become a main driver of economic growth (Einav & Levin, 2014;OECD, 2014OECD, , 2015Varian, 2014). For an overview of DM applications in various aspects of the economy, see Dean (2014). That DM already has commercial value and contributes to economic growth is easily illustrated. DM plays an important role in eliciting value from data, and for instance, according to a recent report for the European Commission, the "overall impact of the data market on the economy as a whole" in 27 European Union (EU) Member States was €325 billion in 2019, up 7% since 2018, and accounting for 2.6% of GDP (Cattaneo et al., 2020, p. 13). What is more, among the 10 most valuable companies in 2015 according to Fortune 500 (Gandel, 2016), at least two were founded quite recently as suppliers of "free" online services-Alphabet (formerly Google) ranked second and Facebook ranked fifth-and initially relied on the collection and analysis of user data for generating rapidly growing revenues. 2 Academic research is another area in which DM is expected to foster value creation, and as we will show, DM has been the topic of an increasing share of total academic research output over the last two decades.
Copyright relates to a trade-off regarding DM. Effective copyright protection should increase the supply of potential DM input works but can also increase the costs of using existing data for those not holding relevant copyrights. DM by means of digital information and communication technology (ICT) technically requires the reproduction of input works and may thus fall under copyright, even if only aspects of individual input works are relevant for a DM project. We analyze bibliometric data to establish how various copyright policies affect the application of DM in academic research. We show that in countries in which DM for academic research requires the express consent of rights holders, DM-related articles make up a significantly smaller share of total research output.
How copyright is applied to DM will continue to affect many academic researchers in coming decades. The evidence presented in this paper relates to a policy debate in particular in the EU. Under current EU legislation, DM requires prior authorization of rights holders even if the potential user has lawful access to the research articles and databases in question (Directive 2001/29/EC, 2001 3 and 5). The situation will change with the implementation, by June 7, 2021, of Article 3 of Directive (EU) 2019/790 which expressly allows text mining and DM to take place for the purposes of scientific research carried out by research organizations and cultural heritage institutions. The United States have a more permissive copyright policy regarding DM and recent rulings seem to confirm greater scope for DM without express consent by rights holders. 3 Other countries like the United Kingdom and several Asian countries have recently introduced relatively permissive legislation, the application of which will probably be defined further in the courts.
As yet, the situation is uncertain for many academic researchers and other stakeholders.

| THEORY
DM is a novel technology to conduct research. According to standard economic theory, researchers will conduct DM as long as expected returns exceed the opportunity costs of the best alternative allocation of researchers' resources. The uptake of DM should be affected by demand conditions, the price and characteristics of inputs (including suitable data) as well as of related goods and services, the conditions of production, competition, and government policies and regulations including relevant aspects of intellectual property (IP).
However, incentive schemes for academic research often diverge from typical markets (Dasgupta & David, 1994) so that an application of production theory is not entirely straightforward. In particular, there is no conventional demand formation and thus only incomplete market coordination. Academic research has public good attributes, and in many territories, it is largely financed through public means.
Academic researchers' returns depend less on sales of research output but come in the form of research funding and long-term employment with prestigious universities or research institutes. These types of returns hinge on peer recognition for which the publication record is central.
We assume that researchers seek to maximize the (qualityadjusted) number of articles they publish by employing the most efficient technologies available to them. As with any new technology, there may be uncertainty, and DM uptake per country may be affected by the specific characteristics of domestic researchers and research organizations. Nevertheless, in the aggregate choices of researchers between various technologies should provide the best available indication of the optimal allocation of resources under specific circumstances within countries. This paper documents the effect of copyright law on this choice.
Like other types of IP, economists often address copyright as a means to mitigate market failure in the private provision of goods with public good attributes (Arrow, 1962;Novos & Waldman, 1984;Samuelson, 1954). The explicit aim is to promote the supply of valuable copyright works by endowing those investing in the development of relevant works with temporary market power. Effective copyright protection has ambiguous effects on the supply of new creative works: on the one hand, it increases returns to rights holders; on the other, stronger copyright protection increases the total cost of input works to potential DM researchers due to higher prices and greater transaction costs compared with a situation where data are available without an explicit, additional license from the rights holders (Landes & Posner, 1989). From a welfare economic perspective, copyright thus fights fire with fire: it mitigates one source of market failure (underprovision of public goods) with another (market power and underutilization of public goods).
Our empirical work is based on several related assumptions. First, DM is often conducted by researchers, who are not the rights holders of all adequate data. 4 Second, DM by academic researchers increases in the quantity and quality of supply of suitable data. Third, academic DM decreases in the costs of accessing relevant data. Fourth, effective copyright protection affects the supply of suitable data and/or the full economic costs of accessing data and conducting DM. We thus hypothesize that variations in relevant copyright policy between countries will affect the amount of DM by researchers residing in those countries. Because copyright has ambivalent effects on followup use of protected works, the direction of copyright's effect on DM is unclear at the outset. Tables 1 and 2 give an overview of variables used in this paper. 5  The Boolean searches on the WoS database were defined by three simultaneous restrictions: (1) "data mining" entered in inverted commas in the field "Topic"; (2) a country name according to the format used on WoS in the field author's "Address," which relates to the country of residence of the first or main author; and (3) a year of publication in the field "Year Published." Search results were further restricted by ticking the option "Articles" in the user interface of WoS, so that results only contain academic journal articles rather than conference proceedings, book reviews, and the like. For each country and year, we recorded the number of different items in the WoS database that fulfill these search criteria. We also collected data on the total number of research articles published for the same set of countries and years to generate the variable "Research Output." Search parameters were the same as reported above, except that no "Topic" was specified. This brought up 23,802,650 articles for the entire panel. Over the 22 years covered, 0.77‰ of all articles had DM as a topic.

| DATA
In our empirical analysis, for each country and year, we used the ratio of "DM Output" and "Research Output" as the dependent variable, multiplied by 1000 to avoid dealing with very small fractional numbers. This variable is referred to as "DM Share." We thus mitigate one of the major problems in using bibliometric data to assess coun- Yearly scores for "DM Output" and "DM Share" have increased substantially since 1992. See Figures 1 and 2 for an illustration and Appendix A for an overview of the data by country.
We classify countries according to the type of copyright law that applies to DM, similar to an approach pioneered by Ginarte and Park (1997). See the Supporting Information for a detailed discussion.
We use two aspects of the copyright system: (1) whether copyright exceptions or limitations are in place that could apply to DM by academic researchers who have lawful access to potential input works and (2) whether there is relevant case law specifying the applicability of existing exceptions and limitations. Table 3 gives an overview of the four country categorizations from 1992 to 2014 according to DM-related copyright law. Table 4 presents descriptive data regarding "DM Share" in the copyright categories.
There is often a discrepancy between IP law and social practice, because IP is hard to enforce. 7 We therefore incorporate a "Rule of Law" indicator as reported by the Worldwide Governance Indicators (WGI) project (World Bank, 2015a, 2015b) and documented in Kaufmann et al. (1999) and Kaufmann et al. (2010). This indicator is defined as "the extent to which agents have confidence in and abide by the rules of society" (World Bank, 2015b), including the quality of contract enforcement and property rights. We use it as a proxy for the level of enforcement of quasi-property rights such as copyright. 8 We further use GDP per capita, population size, and broadband penetration as control variables. The raw data are available in a replication data set.
To prepare our econometric analysis, Figure 3 displays differences between the average "DM Share" of 23 countries in the "Consent Required" copyright category and 14 countries with more permissive copyright legislation (mostly "Probably Required" and "Probably not during any year for which data is available. The gray line represents 11 countries with "Rule of Law" scores lower than 1.2. Between 1996, when DM publications gradually became more numerous, and 2014, all countries from the "Consent Required" category display relatively low "DM Share" (À37% with "Rule of Law" > 1.2 and À28% with "Rule of Law" < 1.2). Since 2005, "Consent Required" countries with low "Rule of Law" seem to be catching up with countries from other copyright categories, and in 2014, their average "DM Share" was 8% below that in countries in different copyright categories. "Consent Required" countries with higher "Rule of Law" do not exhibit any consistent trend towards catching up. In 2014, the "DM Share" in "Consent Required" countries with a high rule of law was 37% lower than in countries in other copyright categories. This descriptive analysis provides some indication that DM by academic researchers is sensitive to observed variations in copyright law and that this effect is moderated by the rule of law within countries. We address these issues more systematically in the econometric analysis in Section 5.

| RESEARCH DESIGN
We adopt quasi-experimental research designs, with "DM Share" as dependent variable, the copyright category "Consent Required" as control group, and other copyright categories as treatments. There is no verifiable random assignment of treatments across our panel. 9 We thus construct several complementary quasi-experiments, each with its own strengths and weaknesses as a means to test for the effect of copyright on DM research (Meyer, 1995;Shadish et al., 2002). To mitigate challenges to validity, we also make use of control variables, multilevel models, interactions between independent variables, and difference-in-difference (DID) models exploiting the panel structure of our data. The specific quasi-experimental setups and their relative merits are discussed in Section 5.

| Multilevel regressions with the full copyright categorization
In a first quasi-experimental design, we use all four copyright categories with "Consent Required" as reference category/control group.
There are virtually no pretest observations, as only one territory switched from "Consent Required" to any other copyright category (England in 2014). Observations were excluded when a territory was nonclassifiable for any year. In the time period covered, eight countries switched between other copyright categories: six countries switched at various times from "Probably Required" to "Probably not Required"; Japan switched from "Probably Required" to "Not Required" in 2014 (see Table 3). Thus, bias due to self-selection and simultaneity is a concern in this first setup, but it does capture relatively much variation in copyright law. control variables, because of missing data for instance on "Rule of Law" and "Broadband." 10 As expected, the "Not Required" category rarely yields significant coefficients: it contains merely six observations and we report it only for completeness.
"Probably not Required" yields significant positive coefficients in Models 1a, 1b, 2a (p < .01), and 2b (p < .05). This suggests that a more permissive copyright framework is associated with more DM research.
"Probably Required" only yields significant coefficients at the .05 level in Models 1a and 2a, without random effects. Coefficients for "Probably not Required" are consistently larger than for "Probably Required." Results are in line with our ordinal categorization: there is a stronger and more reliably significant coefficient for the category that differs more from the reference category "Consent Required." These results suggest that DM share is lower in countries in the "Consent Required" category than in countries with more permissive DM-related copyright.
In Models 2a and 2b, the log-transformed total "Research Output" has a positive and significant coefficient. Countries with a high share of DM articles in total research output also tend to have larger total research output. There is no indication that DM would reduce incentives for other types of research within the same country. 11 EU membership captured by the variable "EU15" has no significant effect. Apparently results hold throughout the "Consent Required" category.

| Multilevel regressions with a binary copyright categorization
In a second setup, we use a binary distinction between "Consent Required" countries (control group) and countries in all other copyright categories, referred to as "Not Definitely Required," as a single treatment group. As in the first setup, there are no useful pretest observations, but there is a closer approximation of random assignment. The distinctive feature of countries in the "Consent Required" category is that there is a closed list of copyright exceptions or limitations that does not contain any provisions for DM. Thus, in these countries, DM without specific consent by rights holders is definitely in breach of copyright. All other countries have at least an open list of copyright exceptions and limitations that could apply to DM, see either. Furthermore, the dependent variable "DM Share" has a score of 0 for all countries any time before the period investigated, so that F I G U R E 3 The difference in "DM Share" (‰) between "Consent Required" countries and countries in all other copyright categories, subject to "Rule of Law" (unclassifiable countries excluded) there is no concern with prior trends. The major challenge to validity in this setup are omitted variable bias and the crude categorization of copyright into two types only, which does not fully capture all relevant variations in the treatment on which information is available.
In Table 6 coefficients for "Not Definitely Required" are consistently positive. Without random effects in Models 1a and 2a, "Not Definitely Required" yields significant positive coefficients (p < .01).
With random effects and thus better control for constant, unobserved country differences, Model 1b yields no significant effect of "Not Definitely Required." Model 2b with further controls but fewer observations yields a weak significant coefficient for the same dummy variable (p < .1). These results suggest that there is a weak positive effect on DM Share when academic researchers are not definitely obliged to acquire specific consent of rights holders to conduct lawful DM. Results are not as conclusive as in Table 1. This may be due to the greater number of observations for the "Probably Required" category (222)-which differs less from "Consent Required" and has no consistent effect according to results displayed in Table 5-than for the "Probably not Required" category (57) and "Not Required" (6), which differ more from "Consent Required." There could also be less bias due to self-selection and simultaneity in this quasi-experimental setup than in the results presented in Table 5. However, incorporating interactions between copyright categories and "Rule of Law" leads to a different result, as discussed in the following section.

| Multilevel regressions with interaction terms between copyright categories and "Rule of Law"
Among our control variables, of particular interest is "Rule of Law" as a proxy for the enforcement of and cultural propensity to adhere to legal norms. Greater rule of law should make copyright law more effective. To test for this, Table 7 includes a multiplicative interaction between copyright categories and the rule of law indicator. In these models, the coefficients of the variables that constitute the interaction (the categories of copyright regulation and "Rule of Law") are no longer to be interpreted as unconditional marginal effects. 13 The main coefficients of interest in these models are those for the interaction terms, which illustrate any moderating effect of "Rule of Law" on the association between "DM Share" and copyright categories. copyright categories are hardly significant. This suggests that where "Rule of Law" is 0, and thus lower than in most countries in our panel, copyright protection has no effect on "DM Share." However, as "Rule of Law" increases, countries in the "Consent Required" reference category exhibit a lower "DM Share." Overall, the results in Table 7 suggest that "Rule of Law" moderates the effect of restrictive copyright law. In particular the combination of strong copyright law and strong rule of law reduces academic researchers' DM performance.  Table 7, it plots the marginal effects (due to interaction) of "Rule of Law" on the coefficient for "DM Share" for countries in the "Not Definitely Required" copyright category with "Consent Required" as reference category. There is no significant difference for countries with low levels of "Rule of Law." With "Rule of Law" scores of about 0.6 and higher (just above the scores of Italy and Malaysia for much of the time period covered), countries in the "Not Definitely Required" category exhibit significantly higher "DM Share" than "Consent Required" countries.

| The effects of switching between copyright categories
In a fourth quasi-experimental setting, we document the effects of several switches (treatments) from "Probably Required" to "Probably not Required." Only this type of switch has occurred frequently enough for us to meaningfully address its consequences; see Table A2 for a list of the switching countries and some of their characteristics. 14 Table 8 reports the results for DID regressions with two dummies: "Switch-Yes" marking all full calendar years after a switch in copyright category and "Switcher-Yes" marking all countries that underwent the relevant switch at some point in time. We use two nonequivalent control groups to check whether results are consistent.
First, in Models 1a and 1b, we use all 13 countries that were initially in the "Probably Required" copyright category (except for Japan, who underwent a different type of switch). 15 Second, in Models 2a and 2b, we use all 37 countries within our panel that be classified into copyright categories (excluding Japan and England, who underwent other switches). 16 With these two panels, we can isolate the effects of switching on "DM Share" controlling for (1) prior trends in countries that switched from "Probably Required" to "Probably not Required" over the period investigated; (2) pretreatment and posttreatment changes in nonswitching countries from the "Probably Required" category; and (3) changes in all countries for which data are available. In contrast to the other experimental setups reported on in Tables 5-7, here, there are useful pretest observations. DID is a relatively effective means to mitigate challenges due to endogeneity.
The main independent variable of interest is "Switch-Yes," which yields significant and positive coefficients in all models (p < .01). (This also holds where Japan and England, who underwent other switches, are included.) Switches from "Probably Required" to "Probably not T A B L E 6 Regressions with of "DM Share" as dependent variable and the binary copyright categorization between "Consent Required" and "Not Definitely Required" Required" are associated with greater growth in "DM Share." Furthermore, "DM Share" is consistently higher for countries, who underwent this specific switch ("Switcher-Yes"; p < .01). This gives some indication of reversed causality: countries with higher "DM Share" have been more likely to switch to a copyright category with less obligation or researchers to attain specific rights holder consent for DM. Nevertheless, switching has a significant positive effect with this control. Figure 5 illustrates the effect of switching with all controls and 95% confidence intervals. Overall, there is strong evidence that the share of DM research in total research output increases, where researchers do not need to acquire specific consent by rights holders.

| DISCUSSION
This paper documents an inverse association between copyright strength and DM uptake: countries in which academic researchers must acquire the express consent of rights holders to conduct lawful DM exhibit a lower share of DM research output in their total research output. That result transpires reasonably consistently across a number of complementary quasi-experiments. This implies that an application of copyright exceptions or limitations that establish the right to mine for academic researchers-if they have lawful access to input works and irrespective of explicit rights holder consent-boosts DM research. In this section, we discuss four potential challenges to the validity of this interpretation of our results.

| Measurement validity
We employ a plain method to identify relevant articles, and no more definitive measure of the number of DM publications is available for comparison. We further discuss the dependent variable and measurement validity in the Supporting Information. Among other things, there, T A B L E 7 Regressions with "DM Share" as dependent variable and interactions between copyright and "Rule of Law" F I G U R E 4 Marginal effect of "Rule of Law" on the coefficient for the "Not Definitely Required" copyright category with "DM Share" as dependent variable (including 95% confidence intervals) T A B L E 8 Difference-in-difference (DID) regressions with "DM Share" as dependent variable regarding switches from "Probably Required" to "Probably not Required" All countries initially in the "Probably Required" category (except Japan) All countries (except England and Japan) we document that "data mining" is the most popular and central term used for the research practice addressed in this paper and that there is no indication that other (combinations of) search terms would have improved the validity of our results in terms of measuring the trend in the number of relevant articles per country and copyright category.
We have encountered three specific criticisms regarding measurement validity. First, articles to do with the definitive research methods of DM will not be identified in our data collection if they do not prominently feature the expression "data mining." However, for our regression results to validly reflect any association between DM and copyright, no perfect absolute measure of the number of articles concerned with DM practices is required. For our purpose it is sufficient that omission or exclusion error in our variable "DM Output" is reasonably constant at least across copyright categories, because we do control for constant country differences with "Research Output" and varying country intercepts. To be sure, a formal proof that this holds is not feasible.
Second, researchers in countries with restrictive copyright could have an incentive not to prominently signal that they conducted potentially copyright infringing data collection practices by including the term "data mining" in the title, abstract, or key words. However, we can at least control for any constant propensity of "hiding" by including varying country intercepts. 17 Third, with our identification method, we can classify many articles at low cost, but we cannot distinguish between applications of DM and conceptual or methodological papers. Copyrights regarding potential input works can directly affect incentives to apply DM applications.
Copyright should have less of an effect on researchers' conceptual or methodological work on DM. If so, our data underestimates the effect of copyright on applied DM research. The inverse association between copyright strength and DM output would be more pronounced where only DM applications are concerned. Therefore, this concern constitutes no major challenge to the validity of our results.

| Endogeneity and omitted variable bias
As discussed in Sections 4 and 5, with our data, no single quasiexperimental design can provide entirely conclusive results, and our strategy is to construct several, complementary quasi-experiments to mitigate that challenge. Our main results are clearly significant where we use relatively refined copyright categorizations (Table 5) and where we check for the consequences of relevant changes in copyright law (switches), including controls for self-selection and simultaneity (Table 8). Our results are weaker (p < .1 with all controls) where we exclude self-selection and simultaneity by classifying countries into a binary and virtually permanent distinction into "Consent Required" and "Not Definitely Required" countries according to relevant copyright law (Table 6). However, where we include the interaction between "Rule of Law" and copyright, we attain the reasonably clear result that the combination of the "Consent Required" copyright category with high "Rule of Law" scores-as observed in most EU Member States-leads to lower DM activity by academic researchers (Table 7).
With country panels, omitted variable bias is hard to exclude.
There are already challenges in the application of economic theory to specify determinants of research output and the adoption of new research technologies, because incentives for publishing academic articles are not shaped in conventional markets. What is more, many available indicators do not perfectly correspond to one specific theoretical determinant. Nevertheless, we do have good controls for the most outstanding factors determining DM uptake by academic researchers. With our dependent variable "DM Share" (the quotient F I G U R E 5 Effects of switching according to the difference-in-difference (DID) analyses in Table 8 and with 95% confidence intervals of the number of DM articles and the total number of articles published by authors from a country), we do not only have an effective control for the resources available for domestic research and the productivity of domestic researchers (in terms of articles produced relative to research resources). 18 With this derived dependent variable and varying country intercepts, we also have some control for constant, unobserved country differences in incentives for DM uptake, for instance, due to different compositions of research activities within countries that could affect the efficient scale and scope of DM. 19 Furthermore, broadband penetration should be correlated with the costs and quality of ICT and the propensity and skills of residents to use digital ICT. Competition between researchers may be positively correlated with our control variable "Population" assuming that there is a disutility of researchers to relocate to another country. The availability of relevant input works irrespective of copyright should also be positively correlated with country size, assuming that researchers are more likely to use data on domestic phenomena.
No satisfactory indicators are available on some potential determinants of "DM Share," for instance the costs of tradable DM inputs such as specialized ICT hardware and software and to some extent even labor. For these, we can only control for constant country differences, which is more effective for determinants that change slowly over time within countries. Our panel mostly consists of large, diversified economies, which makes substantial and sudden changes within reasonably populated copyright categories less probable.
Furthermore, extensive integration of many of the economies studied here make it improbable that prices of inputs would diverge very substantially over time between the treatment group(s) and the control group.
Other specific determinants for which we have no controls are the following. First, changes in the share of various academic disciplines in countries' research activities or academic cultures to do with technological innovation could affect "DM Share" but are unlikely to trend rapidly over time. Second, tastes and preferences of researchers regarding innovative research methods could be somewhat controlled for by data on the demographics of academic researchers. To the best of our knowledge, there is no such data available for a sufficient number of countries. Third, there are no data available on how DM would affect working conditions of researchers, except for the effect of greater productivity on career prospects and legal risks associated with copyright captured by our main predictor. Perhaps the most worrying omission is that no suitable data are available on targeted funding of academic DM within specific territories.
There is potential missing data bias, as data on all controls are only available for 55% of yearly observations from countries that could be classified according to their DM-related copyright law. However, due to the high degree of statistical significance and power of our main results, the probability is high that the main results hold in spite of any remaining omitted variable bias. Although the coefficients of determination in our regressions are in a respectable range, it is noteworthy that these are deflated, because we incorporate our main control variable "Research Output" in our dependent variable "DM Share."

| Cross-country effects
According to the literature on patents, even where there is no positive effect of IP on domestic innovation, IP may still increase "technological transfer"-the influx of new ideas from other countries (Branstetter et al., 2006;Hall & Harhoff, 2012;Helpman, 1993;Jarvorcik, 2004). However, pure information goods suitable for DM are less excludable than patents and the underlying technologies.
Then, strong domestic copyright protection may inhibit transfer and use of input works into countries, whereas valuable data will be accessible in territories with less copyright protection. High protection countries may get the worst of both worlds: extensive unauthorized use of domestically produced data abroad and high costs of conducting DM domestically.
In talks with DM practitioners, we were even told that it is common practice to deliberately locate DM activities in territories with weak de facto copyright protection and to seek out suitable partners from such territories in international DM cooperations. Therefore, it is not clear whether a strong DM performance of some countries is selfsufficient or whether it is due to strategic decisions by researchers and/or free riding on data produced in other territories. To investigate this further requires a content analysis of DM-related research output.
In particular, future research should establish to what extent input works for DM research come from countries with more or less restrictive copyright regulation. This is beyond the scope of this paper.

| Generalizability
No database on research output is comprehensive in the sense that it would cover all valuable research output. Greater coverage is not even necessarily better. Publications in top journals are typically regarded as many times more valuable than publications in lower ranking journals. 20 There is no widely accepted metric of value that would allow for valid weighting of publications across all disciplines. 21 WoS is the most selective of the major research databases and provides the standard assessment of impact factors of journals (the Science Citation Index). 22 We rely on their inclusion criteria to cover a reasonably stable share of the most valuable total research output and DM research output per copyright category.
For the purpose of measuring countries' research output, WoS has no superior alternative (Burnham, 2006). Scopus is the only reputable alternative database covering virtually all academic disciplines (Falagas et al., 2008). 23 Scopus initially did not fully cover publications prior to 1996, however (Archambault et al., 2009). For publications after 1996, the coverage of Scopus and WoS overlap to a great extent, and Archambault et al. (2009)  There is a sizable literature on the extent of bias in WoS (as well as Scopus) due to uneven coverage across countries, languages, and disciplines. However, it is common practice for prestigious academic journals publishing in any language to include English abstracts. Thus, language bias should be weaker in our data than if we had assessed full articles with search terms in English. Furthermore, the evidence is that any bias in WoS has been rather stable over time for any period systematically investigated; see Mongeon and Paul-Hus (2016) for a recent summary of the literature. Therefore, varying country intercepts should provide a reasonable control for the combined effect of biased coverage in WoS and stable country characteristics.
Our results may be less valid regarding the humanities and some social sciences. On the one hand, in these disciplines, book publications often have a relatively greater weight in determining individual researchers' career prospects, and these are not covered in our data.
On the other hand, journals in these disciplines are relatively less likely to be included by WoS (or Scopus) (Mongeon & Paul-Hus, 2016).
Another aspect of this is that qualitative empirical research is not covered well by our data. The data in this article covers (quantitative) DM and not qualitative methods of text mining.

| CONCLUSIONS
DM is the topic of an increasing number of academic journal articles.
Copyright protection of data in EU Member States is relatively strong.
We document that this is associated with less DM research output by academic researchers.
DM research often draws on many input works to which others hold copyrights. In virtually all EU Member States, as well as a couple of other countries, there are no relevant exceptions or limitations to copyright, so that DM requires express consent of rights holders.
With this regulation, academic DM research has fallen behind developments in other territories. The benefits of allowing DM for all users, who have lawful access to data, seem to be greater than any adverse effects of weaker copyright protection on the creation of new input works for DM. Our results suggest that there has been market failure regarding the licensing of data for academic DM. Copyright does not appear to attain its ostensible goal of fostering innovation in this particular context. To our knowledge, this study is the first to empirically document an adverse net effect of IP on innovation, in the sense that there is strong evidence for stricter copyright hindering the wide adoption of novel ways to build on copyright works and generate derivative works.
As new technologies mature, early leadership can give rise to stable advantages, so that the stakes during formative years are high. For as long as DM continues to offer productivity increases in academic research, researchers in the EU and other territories with similar copyright law risk become less competitive because of greater copyright restrictions for this novel type of research.
Our results do provide a better evidence base for policy than has been available so far. The results of several, complementary quasiexperiments presented in this paper are reasonably consistent.
Nevertheless, there is clearly scope for further research. For instance, the identification of DM output in this paper is efficient but plain. Furthermore, cross-country effects require further attention, as data produced in one territory may be analyzed elsewhere. connotation regarding a type of malpractice in applied statistical data analysis (e.g., Sullivan et al., 2001;Feelders, 2002;Rockey & Temple, 2016). This is not the common use of the term and not what this paper is about. In a random sample of 250 DM-related articles from our corpus, none used the term exclusively in this sense; see the Supporting Information, Section C.1.1. Furthermore, DM is typically associated with data analysis of structured, quantitative, or nominal data, rather than text mining that concerns processing of unstructured/qualitative data and may be a preparatory step for DM in text and data mining (TDM) research projects. Table 1. 11 The addition of "Research Output" in the model could raise multicollinearity issues with "GDP/capita" and "Population." However, the correlation between these variables and total research output is low in our data (.37 and .31, respectively) so that collinearity is unlikely.

ACKNOWLEDGMENTS
12 Simultaneity or reverse causation could bias results if an observed or expected relatively strong performance regarding domestic DM research (or lobbying by stakeholders) would have affected changes in national copyright legislation over the period studied. 13 For instance, the coefficient of the "Probably not Required" constitutive term (0.865) represents the effect of this type of copyright regulation only when rule of law is zero (about the level of Turkey and India in many years). In our panel, there are only four observations in the data with rule of law between À0.01 and 0.01 (there are no exact zero matches), which are South Africa in 1996, Argentina in 1997, and Brazil in 2010 and 2011.
comprehensive list of scholarly/academic periodicals (Ulrich's periodical directory with over 60,000 journals), WoS and Scopus cover a greater share of the journals on natural sciences, engineering, and biomedical research than of those on the social sciences and humanities. It is unclear whether this is justified by consistent and adequate quality criteria. 20 WoS only covers journals that have continuously operated for several years, employ effective peer reviewing, and have been reasonably frequently cited in other high-quality research publications. By using WoS, we thus focus on those publications that have been deemed by authors, editors, and reviewers to be original and valuable enough to merit publication in reasonably prestigious periodicals. We regard that to be an advantage over alternative databases with lower quality thresholds. 21 Citation counts vary widely across disciplines and entail the problem that many very prestigious journals are only read/comprehensible for a small number of experts in a specific research area. 22 Researchers in many academic disciplines have strong incentives to publish their highest quality works in journals covered by WoS and preferably in the most reputable journals, with reputation hinging to a large extent on this impact factor.

SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of this article.
How to cite this article: Handke C, Guibault L, Vallbé J-J.
Copyright's impact on data mining in academic research.  (Especially Taiwan exhibits extremely high scores for "DM Share" over later years. Due to missing data for this country on the variables "Rule of Law" and "Broadband," Taiwan is not included in any model with control variables.) It is a common assertion that developing countries have an interest in lower levels of intellectual property (IP) protection (Lorenczik & Newiak, 2012) and that this affects de jure or de facto variations between countries in IP protection (Eicher & Garcia-Penalosa, 2007). However, according to our data, highly developed countries also exhibit greater adoption of novel research methods with lower levels of copyright protection. The majority but not all countries that did switch had high "DM Share" compared with countries from the same initial copyright category or the global average.

Manage Decis
(All switches were into copyright categories with fewer obligations for academic researchers to clear rights specifically for data mining, and six out of eight switches occurred from "Probably Required" to "Probably not Required.") This provides some indication that self-selection or simultaneity is a concern in particular when comparing the "Probably Required" category to the "Probably not Required" category.
Finally, the five countries that could not be classified according to our copyright categories also exhibit low "DM Share" scores on average (0.44), albeit with considerable variance.