Estimating outcomes in the presence of endogeneity and measurement error with an application to R&D 

We adopt a Bayesian econometric technique to address issues of endogeneity and measurement error when estimating outcomes while also tackling censoring. We motivate our study based on the theoretical framework laid out by Dasgupta and Stiglitz [1980] to highlight the endogeneity issue by investigating the relationship between market structure and innovation. We apply our method to estimate the R&D expenditures for Chinese manufacturing firms to highlight the importance of the econometric issues. Reduced-form results suggest a nonlinear relationship between market concentration and R&D expenditures, while our approach suggests a strictly positive relationship consistent with canonical theoretical models built on oligopolistic competition.


Introduction
Estimating outcomes using ordinary least squares (OLS) in the presence of endogeneity and/or measurement error can result in biased and inaccurate findings. For example, industrial economists have long considered the relationship between incentives to innovate and market structure in trying to determine market conditions that encourage research and development (R&D) activity. Understanding this relationship is important as R&D by firms leads to innovation, from which companies can obtain economic rents. 1 Schumpeter [1942] hypothesized that large firms in concentrated markets are likely to innovate. Many researchers have since studied this relationship-now known as the Schumpeterian hypothesis. 2 One takeaway from this research is that there is still a need to account for issues of simultaneity between market structure and R&D. For example, Cohen [2010] noted "an area where this literature on the tie between R&D and firm size is relatively mute is the endogeneity of firm size with respect to R&D and innovation." Further, as mentioned by Aghion et al. [2005], accurately accounting for the number of competitors is also challenging and can lead to measurement error. Empiricists have documented (using Wu-Hausman tests) that, in such instances, orthogonality conditions required for OLS estimation do not hold. 3 In a recent paper, Li et al. [2021] also mention the bias in OLS estimators. This could be exacerbated when using a sample of survey data to construct variables to control for agglomeration or market concentration ratios where least squares estimation may be inappropriate and richer econometric strategies are needed.
Hence, in this paper, we adopt an alternative approach based on econometric techniques developed by Schennach [2005Schennach [ , 2014 that addresses endogeneity and measurement error while paying attention to censoring issues (for example, when firms do not invest in any R&D efforts). Thus, our econometric contribution is twofold in the sense that we (i) deal with measurement error in critical variables of the model, and (ii) deal with endogeneity-in this case, of rivals' R&D, market concentration, and the 1 R&D efforts can focus on either reducing the cost of producing a product (R&D related to process innovations) or on improving the end-product, as well as introducing new products (R&D related to product innovations) by building directly upon existing products or by introducing new varieties. 2 Cohen [2010] recapped empirical research that has considered hypotheses in the Schumpeterian tradition, updating his previous surveys (Cohen and Levin [1989] as well as Cohen [1995]). In addition, Gilbert [2006] surveyed literature specifically related to the Schumpeterian hypotheses around market structure and firm size while Ahuja et al. [2008] surveyed related research in the management literature. 3 See, for example, Levin and Reiss [1984] as well as Levin et al. [1985]. number of rival firms. We use a coherent and unified framework by combining Entropic Latent Variable Integration via Simulation (ELVIS) (which deals with measurement error in nonlinear regressors) with Bayesian Exponentially Tilted Empirical Likelihood (BETEL) (which deals generically with moment conditions using instruments) which we organize in a general formulation that can, in turn, be estimated using fast Markov Chain Monte Carlo-based (MCMC) methods.
Considering the literature on R&D, the common approaches researchers have adopted-recognizing the potential simultaneity between innovative efforts and market concentration-include instrumenting for concentration or estimating multi-equation models that treat concentration and R&D as endogenous. 4 Blundell et al. [1999] proposed an approach for addressing the endogeneity issue concerning market structure and R&D by using lagged variables in panel data. They found that concentrated industries produce fewer innovations though, within an industry, larger firms generate more innovations. One challenge with this approach is that researchers may have access to limited data (for example, a panel of data might span only a couple of years) which might preclude the ability to use lagged variables.
We apply our estimator to Chinese firm-level data for which a short panel exists so constructing lagged variables removes important observations. The focus of our research is on inputs or efforts-R&D expenditures-and not on output or productivity of R&D (for example, patents per dollar of R&D spending). Our data is firm-level and firms are assigned to industries at the four-digit level under the Standard Industrial Classification (SIC) codes. 5 Reduced-form (OLS) results suggest a nonlinear relationship between measures of market concentration and R&D expenditures. This finding is consistent with those of the literature, though we apply this approach to a new data set.
Estimating analogous models using our hybrid results leads to similar estimates for a number of covariates that are plausibly exogenous. However, our hybrid Bayesian approach leads to very different results for the factors that we argue are endogenous or plagued by measurement error like measures of market concentration. 4 As an example of the former, Levin et al. [1985] instrumented for market concentration (four-firm concentration ratio) in estimating the effect of market structure on R&D intensity. For the latter, Reiss [1984, 1988] as well as Connolly and Hirschey [1984] used a four-equation system specifying relationships between profits, R&D, advertising, and concentration, allowing for nonlinearities. 5 As detailed in the survey by Cohen [2010], data collection on R&D efforts and output measures in the United States is often not disaggregated enough to consider. Instead, research has focused on data from Canada and Europe (for example, Community Innovation Survey data) primarily. Specifically, our approach suggests a strictly positive relationship exists in our datathe more concentrated the industry, the more expenditures on R&D. A positive relationship is consistent with canonical theoretical models built on oligopolistic competition such as that of Dasgupta and Stiglitz [1980] and offers support of Schumpeter's hypothesis. Further, we follow our estimation results by shutting down elements of our approach and by performing MCMC exercises which encourage confidence in our approach when issues of endogeneity and measurement error are present in the data.
Our hybrid approach addresses a number of issues: censoring, measurement error in explanatory variables, and endogeneity. The latter two challenges are particularly important as measurement errors can result not only from poor data but when variables are constructed using a subset of all firms while endogeneity is prevalent in similar applications. For example, one firm's R&D expenditures might depend on the R&D expenditures of rival firms. However, that aggregate value can also be plagued by measurement error if all firms are not observed or if firms are not correctly classified into industries. This issue could be exacerbated, especially if they participate in multiple industries but variable construction is built around only their primary industry.
We acknowledge that endogeneity can be dealt with in a generalized method of moments (GMM) framework using moment conditions. However, there is increasing evidence that GMM can behave erratically in finite samples (see the initial work of Hansen et al. [1996] for example) and its behavior is not ideal when instruments are weak or invalid. Hence, our hybrid approach that is based on the BETEL of Schennach [2005] can be thought of as a Bayesian version of GMM which addresses the shortcomings of a GMM strategy. In this framework, although we cannot entirely solve the general problem of weak or invalid instruments, we have, at least, a principled way to test these assumptions. In our application, we use MCMC methods to provide access to the posterior implied by moment conditions, as suitable instruments are available.
Measurement errors complicate the analysis considerably. There is no standard approach to the problem, though Lewbel [1997] proposed one solution in which functions of the data can be used as instruments in multiple-stage least-squares regression using higher moments of the data. We use the ELVIS method of Schennach [2014] and specialized MCMC algorithms based on Girolami and Calderhead [2011] to access the posterior of the model. ELVIS deals explicitly with measurement error problems and can be embedded into BETEL, allowing us to address these issues by employing an estimator (from the classical perspective) that has both good asymptotic and finite-sample properties.
Our paper is structured as follows: in Section 2, we appeal to a simple theoretical model which highlights the simultaneity between market structure and R&D investment choices. That model motivates our econometric concerns and allows us to demonstrate the inadequacy of OLS. In Section 3, we discuss firm-level panel data which we summarize, noting interesting R&D investment patterns and relationships. We initially ignore our econometric concerns by formally estimating empirical models in Section 4 using standard techniques including instrumental variables (IV) method to establish baseline results. Our hybrid Bayesian estimation strategy, meant to address the concerns around censoring, measurement error, and simultaneity, is articulated and estimated in Section 5 where we find important differences in the main takeaways. In Section 6, we present simulation evidence that allows us to compare our approach with nested models which include OLS as well as specifications which account for just measurement error or only endogeneity. Lastly, In Section 7, we summarize and conclude our research.

Conceptual framework
In this section, we simplify the classic Dasgupta and Stiglitz [1980] and which, letting ≡ / , can be transformed to yield the Lerner Index (LI) where (⋅) is the price elasticity of demand. 6 In a model with identical firms, * = 1/ * , so equation (3) can be rewritten as which makes clear that an increase in the number of firms reduces the output of each individual firm. However, the marginal benefit of R&D expenditures is proportional to the output level of an individual firm as reflected in condition (1). Thus, as the number of firms increases, each individual firm reduces output, which reduces the marginal benefit from R&D expenditures, meaning R&D spending falls for firms active in the industry. The model then suggests that individual firms in industries with more rivals will spend less on R&D; the lower the number of competitors (the more concentrated the market in our symmetric model), the more firms will individually spend on R&D.
Moreover, aggregating across firms and using zero profit conditions means * * ( * ) * = 1 * ( * ) which relates the amount of R&D spending in an industry to the share of industry sales.
An important insight of this model is that industrial concentration and research intensity are simultaneously determined-measures of market concentration and R&D efforts are endogenous. Firm investment decisions are simultaneous in this model, which means including variables capturing rivals' R&D efforts in empirical work will imply the variables are correlated with error terms.

Data
In our application, we use yearly data from The Annual Survey of Industrial Firms  Lu and Tao [2009] noted that this upward trend in manufacturing firms is due to the rapid growth in manufacturing sectors during the sample period, increasing the number of firms with annual sales which exceed the five million RMB threshold for inclusion in the ASIF dataset.
In addition to information on the total expenditures on R&D, the data include information on production activities (employment, capital, intermediate inputs, sales), balance sheet statements (current and total assets, liabilities, inventories, financing costs, taxes paid, operating costs, profits), and firm characteristics (industry classification for primary and secondary products, location, ownership type).
Moreover, the ASIF data contain firm-level trade data which allow us to distinguish exporters from non-exporters. We define the variables which we work with in Table   A.1 and present pairwise correlations in Table A.2, in the Appendix A.
For each firm in the ASIF data set, the location information details its address as well as the name of city, district, and province where it is located. Naturally, the more precise information on a firm's location allows for more accurate construction of agglomeration and other geographic-based variables; see, for example, Rosenthal and 7 These data were also used by Bai et al. [2006], Cai and Liu [2009], as well as Lu and Tao [2009].
Strange [2003]. We define geographic units to be at the 2010 zip-code level which is matched with each firm address in the data and identifies 31,046 areas where manufacturing firms are located. 8 We also have information on each firm's primary industry code at the four-digit SIC level, for which we see 525 different industries represented in our data. This is important as empirical researchers in this literature have found that industry fixed effects explain a large share of variance in R&D-related dependent variables; for example, see Scherer [1967] and Wilson [1977]. Geroski [1990] even found that his fundamental result of a positive relationship between competition and innovation was reversed if industry effects were not included.
In our approach, we go beyond this by controlling not just for industry effects but, because we have a panel of data, for firm-level fixed effects. Moreover, for a given firm, we construct industrial agglomeration measures not only at the zip code level, but within a given SIC code for each zip code for a given year. Note however, that while this is an important link between theoretical and empirical work, the potential for measurement error is now much greater for two reasons: (i) variable construction is at the industry level; (ii) only a subset of potential rivals are observed. The former stems from firms being classified in the data as participating in a primary industry, which is certainly convenient for empirical work, but also difficult to think about for multiproduct firms who are then omitted from other industries in which they may be active. 9 The latter stems from smaller firms not being included in the data, and hence never included in variables concerning rivals' behavior that are computed from the raw data.
We view both of these issues as important sources of measurement error.
In Figure 1 we map total R&D expenditures within each district in China in 2005 (Panel A) and 2007 (Panel B), respectively. Comparing the two maps demonstrates substantial growth in innovative efforts across these two years in our data.
Both figures make clear that the bulk of R&D investments occur along the coast and eastern parts of China, consistent with where most of the economic activity occurs. The maps demonstrate both increased efforts within a number of districts that were already 8 Due to growth in China, its administrative boundaries of cities, zip codes, counties, or even provinces have experienced changes in the last thirty years. We geo-code each address at the 2010 zip code level which is when boundaries are most disaggregated. Fixing these geographic units maintains the same area definitions during our analysis. 9 One element of a dataset that could help address this concern from the raw data is if R&D expenditures were disaggregated and somehow associated with the various products a firm produces so that R&D investments could be "allocated" towards the different industries. This is not the case in our data nor most R&D-based datasets with which we are familiar.
active in R&D spending, as well as the proliferation of these efforts to new districts which see R&D investments from its constituent firms from 2005-2007. Still, a number of districts, particularly in central and western China see little R&D activity; these districts are often geographically much larger, but house few manufacturing firms. In Figure 2, we plot the locations of all firms in our data which meet the requirements discussed previously and are used in our forthcoming analysis.
In Table 1, we report summary statistics for important variables in our data.
Specifically, we report the average and standard deviation (below and in parentheses) The LI is firm-and time-specific and has several advantages over indicators such as market share, firm concentration ratios, or the Herfindahl-Hirschman Index (HHI).
These other measures rely more directly on precise definitions of geographic and product markets (Aghion et al. [2005]). While we know the SIC code that a firm operates in (allowing us to understand their output market), it can be difficult to define a specific geographic area relevant for their competition as more than 23% of firms operate in international markets (the dummy variable Exporter takes a value of one if the firm exports product outside of China, and zero otherwise). One challenge with constructing the LI via (9) is that some firms report negative operating profits in a given year. In Table 1, we provide summary statistics for both the relevant sample given considerations previously noted, as well as for the restricted sample in which we only consider observations with nonnegative operating profits so that the LI measure is properly characterized.
In Figure 3, we plot the log of R&D expenditures (using the right-hand axis) against the LI measure. The LI, which is bound between zero and one, takes on an average value of 0.202 and the median value is 0.046 conveying that many markets are quite competitive, though the average is being brought up by a few very concentrated industries. This observation is made clear by a second depiction within this figure which plots a histogram over the LI measure to give a better sense of this distribution (using the left-hand axis). Given there is substantial mass near the boundaries, the confidence interval on the log of R&D expenditures is widest in the middle. The inverted-U shape depicted in Figure 3 is consistent with what others in the literature have found (see, for example, Levin et al. [1985]) and matches the trend that Aghion et al. [2005] observed at the industry level. The inverted-U relationship suggests that R&D expenditures are highest in modestly concentrated industries-on average, there are 2-3 firms in industries where the inverted-U obtains its maximum suggesting these industries can best be characterized by oligopoly.
Given the prevalence of firms with much lower LI, perhaps unsurprisingly, firms in our data compete in industries with higher competition-a firm faces on average about 5.8 rivals in our data. These firms have about 200 employees and have been in business for over eight years on average. About a quarter of all firms produce two outputs, but less than 10% produce three or more different products. In our analysis, we consider the firm to compete in the industry that corresponds with the four-digit SIC code of its primary output. 10 With these data in mind, we apply empirical models that are consistent in spirit with past work in this literature and discuss estimation results in the next section. This allows us to establish some benchmark findings which can be contrasted with results from our hybrid Bayesian estimation strategy which we offer in Section 5.

Descriptive Regression Results
Given the trends presented in the previous section, we seek to evaluate the relationship between market power, competition, and R&D efforts as measured by expenditures.
10 In considering Chinese firms, one may wonder about the ownership of these enterprises. About 94% of firms in our data have no governmental ownership; the Chinese government has a minority stake in 1.2% of firms, and a majority stake in 4.8% of firms. We account for this structure in our empirical work going forward.
In these specifications, we recognize that the R&D investments of a given firm may depend on factors that are firm-, industry-, location-, and time-specific. We derived the LI from our theoretical model in equation (3) and presented its empirical counterpart to this measure in (6). We proxy for competition by counting the number of firms operating in the same industry, in the same geographic location (zip code), during the same year. We model (⋅) and (⋅) as cubic polynomials of their respective arguments to allow for richer relationships and given the inverted-U pattern observed in Figure 3. 11 As suggested by the theoretical model presented earlier, we also consider that firms may be investing in R&D strategically in equilibrium and so these decisions may depend on the R&D investments of rivals-firms operating in the same industry as firm within a given location at time . Industry is specified at the fourdigit SIC code, location corresponds with a zip code, and time to a year in the data.
The theory presented highlights that there are endogeneity concerns with respect to the regressors related to the Lerner Index (LI), number of rivals, and rivals' average R&D. These variables are simultaneously determined with a firm's own R&D expenditures in the motivating theoretical structure.
Beyond this, we include covariates that are either specific to a firm and observed to vary over time (for example, the number of employees at a firm, age of the firm, exporter status, type of ownership, whether the firm is a multi-product firm; these are contained in ) or to a given location at a point in time (for example, the number of national and provincial universities in a region; these are contained in ). 12 These variables are exogenous regressors potentially important in explaining variance in firms' own R&D choices. In our models we also include firm ( ) and time ( ) fixed effects. The inclusion of firm fixed effects implies that identification of the model is driven by changes within a firm across years in the data. Since each firm is assigned to only one industry (that of their primary output) and industry does not change over our sample, these firm effects capture industry effects which are critical to account for.
The term represents an independently distributed error term that is firm-, rivals-, location-, and time-specific reflecting influences on (log) R&D from factors not included in our model. 13 All of these variables are included in all models we estimate.
We present baseline estimates in Table 2, which provides OLS coefficient estimates from six different specifications nested by the empirical model presented in equation (7). In the first column, we focus primarily on the relationship between LI and R&D expenditures, controlling for a firm's rivals' spending as well as the firm's size, age, and exporter status. Given we represent the LI by a polynomial, interpreting a given coefficient directly is somewhat difficult. As such, we depict the estimates from column (6) in Figure 4, Panel A which confirms a nonlinear relationship with an interior peak.
Given our log-log specification concerning rivals' average R&D expenditures, that coefficient can be interpreted as an elasticity which suggests that if the average of a firm's rivals' R&D spending increases by 1%, a firm's own R&D investments would increase 6.6% in column (1). The larger a firm, as measured by the number of employees, the more a firm invests in R&D-a 1% increase in employment corresponds with an 11.1% increase in R&D expenditures. Age is not significant but being an exporter means R&D spending is nearly 10% higher than non-exporters.
In the column (2), we present estimates from a model in which we omit our LI measure, but include a cubic relationship concerning the logarithm of the number of rivals in the market as a measure of competition. Plotting the estimates shows the relationship between the number of rivals and the logarithm of a firm's R&D spending is nearly linear-as the number of rivals increases by one firm, log(R&D) spending decreases by 0.0012. The effects of other covariates are nearly identical to those detailed above. In fact, the estimates in column (3) suggest inclusion of both our LI measure and the competition measure with the same set of covariates leaves our estimates, and their statistical significance, essentially unchanged. Pushing harder on this, if we include controls for multi-product firms and the number of higher education schools in a zip code (as a proxy for how many researchers may live in the area), all results remain, as reflected by the estimates in column (4). Given we consider Chinese data in which the government occasionally holds an ownership stake in some firms, we wanted to make sure that our results are robust to this structure. In column (5), we consider a model in which we include directly the share of government ownership in the firm. That covariate is not significant and our earlier discussion continues to apply.
Lastly, in column (6), we include two dummy variables to capture whether the government is a minority or majority owner (with no government ownership, which represents the case for 94% of the firms in our data, as the omitted category). While there is weak evidence that when the government owns a minority stake in the firm, there is an increase in R&D expenditures by 8.7% (significant at the 10% level), all of the other effects (those of which we are primarily interested in) remain unchanged.
As our simple theoretic model based off Dasgupta and Stiglitz [1980] showed, R&D and market structure are determined simultaneously in equilibrium. Our reducedform approach does not address these endogeneity concerns as covariates are likely to be correlated with the error term in equation (7) due to strategic behavior, as our model presented earlier highlighted. Additionally, variables like the LI and number of employees are likely to be endogenous. A common approach to addressing endogeneity would to to adopt an IV regression model. To gauge the importance of endogeneity within a comparable framework, we instrument for market concentration, competition, and rivals' R&D.
The set of instruments that we use can be partitioned into two categories: those that are included and those that are excluded. The exogenous regressors used in the reduced-form model (firm's age, status as an exporter, variables capturing the number of products the firm manufactures, its' ownership structure, as well as the variables accounting for the number of (regional and national) universities in the same area, along with firm fixed effects) are considered to be included instruments. When it comes to excluded instruments, practitioners might use something like lagged versions of the independent variables (e.g., Arellano-Bond, which is not ideal given our short panel), or introduce new variables that can serve as instruments. For example, in our application, one might consider local/county tax rates or the distance to the nearest port. We estimate an IV regression that corresponds to model (6) using GMM and present the coefficient results in column (7) of Table 2. 14 Additionally, for easier comparison to the OLS estimates, Figure 4A depicts the estimated polynomials of the LI on the log of R&D expenditures under both approaches. The polynomial again suggests a nonlinear relationship between the measure of market concentration and R&D expenditures, though many of the coefficients themselves lose significance in the IV framework. Regardless, our primary point is that accounting for endogeneity seems to suggest very different relationships between market structure and R&D expenditures.
Of course, our reduced-form approach is in and of itself somewhat misspecified given the potential for fitted values to be negative. A strategy to deal with this in a reduced-form spirit would be to consider censored regression models such as this Tobitlike model: Alternatively, a reduced-form approach might employ something like the Poisson pseudo maximum likelihood (PPML) approach suggested by Santos Silva and Tenreyro [2006] who apply the estimator to trade flow data. We opted to use firm fixed effects in these specifications, which would be computationally difficult when considering PPML or a censored specification. Regardless, we recognize the censoring issue and address it in our Bayesian strategy. 15 In addition, R&D expenditures of rival firms is, clearly, a noisy variable perhaps measured with error given we do not observe all rival firms (only firms with annual sales of at least five million RMB). The measurement error problem is attenuated by the fact that the specification we (and the literature) consider is nonlinear in the variables. Our main contribution is to adopt an alternative estimation strategy based on recent advancements in econometrics aimed at dealing with measurement error and endogenous regressors. We apply this alternative and estimate a corresponding model in the next section which can be compared with the baseline regression results from Table 2.

Bayesian Estimation Strategy
In this section, we adopt a framework which integrates a strategy for dealing with measurement error concerns (ELVIS) with a Bayesian version of GMM (BETEL) to deal with endogeneity issues. To take account of measurement error in nonlinear variables (like the polynomials involving the LI and rivals' R&D as well as the number of rivals) which also have endogeneity issues, we adapt two strategies proposed by Schennach [2005Schennach [ , 2014. ELVIS is a general method to convert a model defined by moment conditions that involve both observed and unobserved variables into equivalent moment conditions that involve only observable variables allowing us to address measurement error concerns. BETEL uses these moment conditions and establishes the correct Bayesian posterior that should be used along with these moment conditions using instruments.
As such, our contribution is twofold, in the sense that we (i) deal with measurement error in critical variables of the model, 16 and (ii) we deal with endogeneity of rivals' R&D. We use a coherent and unified framework organized by combining ELVIS with BETEL in a general formulation which, in turn, can be 15 All our Bayesian techniques go through using the likelihood function of (8) and flat priors for the parameters. We take account of measurent error in the non-zero observations but we assume the left censoring values at zero are correct. This assumption is not restrictive as zero means the firm did not disclose information which, in effect, means that these variables are truly zero. 16 For example, measurement error may exist in the raw data or computation of the number of unique industries, the number of unique firms, the R&D data itself, operating profit, financial cost, sales, LI, number of rivals or rivals' R&D, the number of employees, or age of the firm. estimated using fast MCMC-based methods. Remember, we also require addressing the challenge that our dependent variable is censored. To understand how our integrated framework tackles each of these issues, we continue by discussing each of these in more detail to provide an overview of our empirical strategy. We expand on this summary in Appendix B where we provide technical characterizations of ELVIS and BETEL.
Step 1: The model in (8) is a censored regression model of the form * = ′ + , | ∼ i. i. d (0, 2 ), = 1, … , , where, in the interest of simplicity, we have omitted multiple subscripts and nonlinearities. In this expression, = max (0, * ), is a vector of regressors (for which we assume, momentarily that they are not measured with error), is a parameter vector, and is an error term, normally distributed with zero mean and standard deviation . Here, is observed and, for simplicity but without loss of generality, > 0 (i.e. log R&D is positive when R&D is observed). Suppose, again without loss of generality, that * = [ 1 , … , , +1 * , … , * ] ′ so that the first < observations contain data for which R&D is available.
Given , and , the missing values * can be generated from a normal distribution ( ′ , 2 ), subject to the restriction that the outcome is negative. This characterization corresponds to the Gibbs sampler data augmentation scheme (see, for example, Chib [1992]). The Gibbs sampler draws, successively from the so-called full conditional posterior distribution of | * , | * and, of course, * | , . This generates a MCMC sample which converges to the posterior of the model. So, the crucial step of censoring can be easily dealt with using Gibbs sampling with data augmentation.
Step 2: As some regressors may be measured with error we write * = * ′ 1 + ′ 2 + , | ∼ i. i. d (0, 2 ), = 1, … , , where is a -dimensional vector of nonlinear measurable functions depending on the parameter ∈ Θ ⊆ ℜ . In this general notation, the unobserved random vector takes values in ⊆ ℜ and the observed random vector takes value in ⊆ ℜ .
Intuitively, the unobservables can be eliminated from the moment condition by averaging the function ( , , ) over . ELVIS is a method to convert a model defined by moment conditions that involve both observed and unobserved variables into equivalent moment conditions that involve only observable variables by "integrating out" the unobservables.
In our case, the function (⋅) from (12) would be see Schennach [2014]. The only difference relative to Schennach [2014] is that (i) we have the additional, plausibly exogenous variables that can act as instruments, and (ii) these instruments are orthogonal to measurement errors in .
The key is that ELVIS delivers a new, equivalent set of moment conditions based on observables alone, say Technical details of this ELVIS procedure are explained in Appendix B.1.
Step 3: Given the new moment conditions ̃( , , ) = 0, one can use a number of different empirical strategies. One approach to estimating such models would be to employ

Bayesian Empirical Results
We implement our estimation strategy in Fortran 77 using the netlib and GNU We estimate models of the form provided in equation (7) under the specifications in Table 2, we provide analogous results in Table 4. Comparing the coefficient estimates in Tables 2 and 4 shows stark differences. The effect of rivals' average R&D, the size of the firm (as proxied by employment), exporter status, and the effect from being a multi-product firm are all reasonably similar. However, and importantly, the factors we've argued suffer from endogeneity have changed substantially. As an example, consider the polynomial representation of the LI measure.
Again, because polynomial coefficients are not very convenient to interpret, we depict the fitted polynomials under the Bayesian estimates in Figure 4, Panel B. The Bayesian estimate suggests the more market power a firm has, the more it invests in R&D-a trend supportive of Schumpeter's hypothesis, the theoretical model we presented earlier, and contrasting the nonlinear relationships suggested by our reduced form results (as well as those of researchers in this literature) presented in Figure 4, Panel A.
To support our creation of instruments using transformations of the included exogenous variables and in the spirit of computing a J-statistic, we consider the Hansen-Sargan criterion within our hybrid (BETEL-based rather than traditional GMM) approach for each model involving these instruments. 18 To be clear, J-statistics cannot be directly computed within our hybrid approach. As such, consider the following: Suppose we have the moment conditions Therefore, we need to test that H: J( ) = 0. In turn, we compute -values for the Hansen-Sargan criterion for each model we consider (these -values correspond to the asymptotic chi-squared distribution of the test statistic). We fail to reject the null hypothesis that the overidentifying restrictions are valid at the standard (5%) level for every model.

The Importance of Measurement Error and Endogeneity
While our empirical strategy sought to comprehensively address econometric challenges, it would be helpful to know the importance of measurement error and endogeneity on their own. In this section we do three things: we first address these issues individually by ignoring one step of our approach which shows that measurement error is most critical in our data; we then consider Monte Carlo simulations which deepen our understanding of these issues; lastly, we consider some robustness exercises.

Ignoring Measurement Error or Endogeneity
Our hybrid approach sought to address multiple issues at the same time. It's likely that both measurement error and endogeneity are at play, and perhaps compound each other. Regardless, to guage the relative importance of these issues and to help understand whether differences in our estimates relative to the reduced form results stem from measurement error or endogeneity, we consider ignoring one of these issues by shutting down one element of our hybrid approach. In Table 5, we present our baseline regression results from our OLS and GMM estimates in columns (1) and (2), respectively. In column (3), we ignore measurement error and address only endogeneity by adopting a BETEL approach that does not use the ELVIS-modified moment conditions. In column (4), we present estimates from using only ELVIS but ignoring endogeneity. In column (5), we present our full hybrid estimates. We have also included the polynomial fit from the BETEL-only and ELVIS-only models in Panel B of Figure 4, alongside the hybrid approach. In both the figure and in looking at the point estimates with respect to the LI-related terms, BETEL appears closer to our hybrid approach. However, the Bayes factors in the bottom of Table 5 offer relatively more support for ELVIS than BETEL (of course the full hybrid model has by far the most support) when considering the overall fit of the model. To investigate this, we consider some Monte Carlo and robustness exercises.

Monte Carlo Simulations
To help readers understand the important complications characterizing the environment we study, we complement our estimates with Monte Carlo simulations motivated by our empirical specifications. We draw observations from our data to construct newly simulated datasets that we then estimate by imposing (or not imposing) various moment restrictions when using our hybrid estimation strategy. Conceptually, this is attractive as our full-blown estimation strategy essentially nests OLS which obtains when certain moment conditions are not imposed; likewise, we can impose a subset of the moment conditions to partially address issues like endogeneity or measurement error which helps shed light on the important complications that OLS neglects.
We generate data for a number of firms ( ) and years ( ) and we take the matrix from the data by sampling data for randomly selected firms. We assume competition is not contaminated with noise and we generate a contaminated version by adding a normal random variable with zero mean and standard deviation equal to its observed value (16.78 is the standard deviation on the number of rivals variable from Table 1). We have firm and year effects in all cases and we assume that the functional form of (⋅) and (⋅) in (7) is unknown but can be approximated using ℎ( , ) = , where is LI, is log number of rivals, , are the orders of the polynomial, and are unknown coefficients. In fact, we have ℎ( , ) = ( ) + ( ) where (⋅) and (⋅) are second-order polynomials whose coefficients, like the coefficients of all other covariates, we take from column (6) of Table 4. In our Monte Carlo experiment the true polynomial orders ( , ) are unknown and we assume that , ∈ {1, … ,5}. In turn, we generate the s using the coefficient estimates from the Bayesian results (again, column (6) of Table 4), and adding a normal error with zero mean and standard deviation equal to the standard deviation of the Bayesian residuals.
To minimize computational costs, we use 60,000 MCMC iterations and we omit the first 10,000 to mitigate start up effects, if any. We are interested in whether we can find the truth about the functional forms (⋅) and (⋅).
It turns out that ELVIS works even when measurement error is substantial which is reassuring as this is one of the potential problems present in our data set. When the true model ( = = 2) is not selected, the models most often selected are "overfitted models" where and/or exceeds 2. This tendency disappears as sample size increases. We present summary results for various sample sizes concerning the number of firms and years in the simulated data in Table 6 Table 8 and Table 9, we shut down (drop/ignore) only the measurement error or endogeneity moment condition, respectively, and replicate the estimation process. While correcting for one of these concerns alone improves estimates, important biases remain if both issues are not addressed.

Robustness checks
Other researchers have noted that a firm's age or exporting status are also affected by a firm's productivity; see, for example, Olley and Pakes [1996]  A specification test that we can consider is to investigate whether and are separable in equation (7) instead of a more general model, say ℎ( , competition ). One way to do this, is assume that ℎ can be approximated by In turn, if we define = ( , , ), and (− ) ( ) denotes a leave-observation--out kernel estimator of , and is a kernel in ℝ 2 + , viz.
for bandwidth parameter ℎ. Then we can construct the sample analogue as where ̂ is a consistent estimator of . The test statistic is: and 0 is rejected when is large. To obtain asymptotic critical values, Horowitz [2006] shows that the asymptotic distribution is a mixture of 2 distributions, where the weights depend on the eigenvalues of a certain integral operator.
This test is important and has power against local alternatives whose distance from the null-hypothesis model is ( −1/2 ) where is the sample size. We implement these tests by randomly sampling 10,000 firms at a time and we do this 1,000 times. We use 60,000 MCMC iterations and we omit the first 10,000 to mitigate possible start up effects. We use the same instruments as in our empirical application to keep things consistent and inform our empirical model specification. The results are reported in Table 11. From the Bayes factors in favor of model (7) as well as the Horowitz [2006] test, it turns out that the specification in column (6) of Table 4 passes the diagnostic tests. We apply the tests to different sub-samples to investigate whether there is some form of misspecification.
From these results it does not seem that the specification in column (6) of Table   4 has substantial problems. A standard criticism of the Bayes factor is that it depends too much on the prior and it does not work well with improper priors. For this reason, we use the Hyvärinen [2005] score as implemented in Shao et al. [2019]. The results are reported in Table 12. The large values of the Hyvärinen [2005] score suggest that the model is, indeed, better compared to the other specifications.

Conclusion
In this paper, we adopt a hybrid approach to address censoring, measurement errors, and endogeneity jointly. We apply our econometric approach to examine the connections between firm size, market structure, and R&D efforts, where these relationships have been of interest to economists since (at least) Schumpeter [1942].
Researchers have long recognized that these relationships are difficult to disentangle because of endogeneity concerns. We motivate our setting by adapting a simple theoretical model based on Dasgupta and Stiglitz [1980] work where determinants of R&D investment decisions, market structure, and the number of firms active in a market are simultaneously decided. In addition, measurement error is likely a concern as there is a correlation between explanatory variables and the error term in econometric models. To these points, Aghion et al. [2014] noted "Moreover, clean and direct measures of innovation and competition are usually not available in field data, which can lead to the additional problem of measurement error." We apply our econometric approach to Chinese manufacturing data that has been widely used in the literature. As we do not employ field data nor observe all firms in the population, our measures of rival firms' expenditures on R&D suffer from mismeasurement. Further, compounding these issues is the fact that the underlying relationships are likely nonlinear. Indeed, our OLS estimates suggest a nonlinear relationship between market structure (as measured by a LI) and the amount a firm spends on R&D. This finding is consistent with what other researchers have found using data from different time periods, countries, and industries. When we estimate the same model under our hybrid approach, we get a coherent and unified framework organized around BETEL combined with ELVIS in a general formulation estimated using fast MCMC-based techniques. We find that there is a positive relationship between market concentration and R&D expenditures, suggesting that firms in more concentrated industries invest more in R&D. Our finding is consistent with Schumpeter's original hypothesis as well as with the simple theoretic model we use to highlight issues of simultaneity. We provide Monte Carlo evidence that supports our approach and emphasizes the presence of both measurement error and endogeneity in the data and helps explain why OLS is ill-suited for estimation in this setting.
The natural direction for future research would be to consider this hybrid Bayesian framework we have suggested for other types of problems or in other data sets. The differences in the estimates obtained under a traditional strategy and one that corrects for the issues we've highlighted suggests that concerns around simultaneity and measurement error are nontrivial, meaning previous findings may be worth revisiting. The R&D literature usually focuses on one of two things-efforts (often measured by expenditures as we have done) and output (often measured by patents or innovations). Our approach could be extended to outcome-based models and measures where endogeneity concerns remain, and variables are likely to be even more plagued by measurement error. More generally, we think this approach will be helpful in addressing environments with interdependencies amongst decision makers. For example, if players in a game maximize expected payoffs but there is an unobservable disturbance (seen by players but not the econometrician) that enters expected payoffs in nonlinear and even nonmonotone ways, our approach can be extended. We've considered that here with a focus on R&D, but other traditional examples include pricesetting games, firm advertising decisions, product differentiation models, or tariff setting. While many researchers often use a strategy that requires lagged variables in addressing these concerns, we feel our approach is particularly attractive for short panel data with a large cross section of observations.     Bayes factor has to been re-normalized to 1 using specification in Column 1.    (7). ARMSE and AM are computed as noted in footnote 6 of the main text. AM stands for "alternative model" and corresponds to model selected most often when the true model is not selected. Notes: The true model is second-order polynomials for f and g in (7). ARMSE and AM are computed as noted in footnote 6 of the main text. AM stands for "alternative model" and corresponds to model selected most often when the true model is not selected.  Notes: The true model is second-order polynomials for f and g in (7). ARMSE and AM are computed as noted in footnote 6 of the main text. AM stands for "alternative model" and corresponds to model selected most often when the true model is not selected. Notes: The true model is second-order polynomials for f and g (7). ARMSE and AM are computed as noted in footnote 6 of the main text. AM stands for "alternative model" and corresponds to model selected most often when the true model is not selected.  (1) Against a general neural network formulation whose number of nodes is determined by the BF.
(2) Median value and 99% confidence interval. For the Horowitz [2006] we test we use the same moment conditions as in our implementation of ELVIS. (3) We implement these tests by randomly sampling 10,000 firms at a time and we do this 1,000 times. We use 60,000 MCMC iterations and we omit the first 10,000 to mitigate possible start up effects. Hyvärinen score in favor of model in (7) (1) 10,000 randomly selected firms (3) 17.55 (4.12-44.2) (2) Entire data set 12.37 First half of data set 13.21 Second half of data set 11.40 Lower 10% of rivals 12.32 Upper 10% of rivals 12.45 Lower 25% of rivals 14.32 Upper 25% of rivals 14.20 Notes: (1) Against a general neural network formulation whose number of nodes is determined by the BF.
(2) Median value and 99% confidence interval. For the Horowitz [2006] we use the same moment conditions as in our implementation of ELVIS. (3) We implement these tests by randomly sampling 10,000 firms at a time and we do this 1,000 times. We use 60,000 MCMC iterations and we omit the first 10,000 to mitigate possible start up effects.  , = 1, … , , which is trivial to compute. What is not trivial is to determine an importance density ( | , ), which by choice of parameters ∈ ⊆ ℜ can approximate (B14) across all values of , and ; see Danielsson and Richard [1993].

B.3. Markov Chain Monte Carlo
We used the algorithm proposed by Girolami and Calderhead [2011] (GC) to update draws for . The algorithm uses local information about both the gradient and the Hessian of the log-posterior distribution conditional on at the existing draw. A Metropolis test is used for accepting the candidate generated, but the GC algorithm moves considerably faster relative to alternatives. The GC algorithm is started at the first-stage GMM estimator and MCMC is run until convergence. Depending on the model and the subsample this takes 5,000 to 10,000 iterations. For safety we run 10,000 iterations. Then we run another 150,000 MCMC iterations omitting the first 50,000 iterations to obtain final results for posterior moments, densities of parameters, and functions of interest. It has been found that the GC algorithm performs vastly superior relative to the standard Metropolis-Hastings algorithm and autocorrelations are much smaller.
The Langevin diffusion is given by the following stochastic differential equation Finally, is selected so that the acceptance rate of the algorithm is not too small or large. We determined using a target acceptance rate of, approximately, 25%.