Copula-based regression models with data missing at random

https://doi.org/10.1016/j.jmva.2020.104654Get rights and content

Abstract

The existing literature of copula-based regression assumes that complete data are available, but this assumption is violated in many real applications. The present paper allows the regressand and regressors to be missing at random (MAR). We formulate a generalized regression model which unifies many prominent cases such as the conditional mean and quantile regressions. A semiparametric copula and the target regression curve are estimated via the calibration approach. The consistency and asymptotic normality of the estimated regression curve are proved. We show via Monte Carlo simulations that the proposed approach operates well in finite samples, while a benchmark equal-weight approach fails with substantial bias under MAR. An empirical application on revenues and R&D expenses of German manufacturing firms highlights a practical use of our approach.

Introduction

Regression is the most prevailing method for investigating the relationship between a regressand Y and regressors W. Widely used regressions include conditional mean and quantile regressions. Noh et al. [27] proposed a novel approach to estimate a conditional mean regression function by exploiting copulas, where observations are assumed to be independently and identically distributed (i.i.d.) and completely observed. Their key insight is that the loss function expressed as a conditional expectation given W can be rewritten as an unconditional expectation involving a parametric copula and nonparametric marginal distributions. The marginal distributions and the copula parameter are estimated via plug-in methods. The flexibility of the semiparametric copula alleviates model specification issues such as how to transform regressors and which cross-products of regressors to include.

Noh et al. [27] spurred extensive research on the copula-based regression. Noh et al. [28] applied the method of Noh et al. [27] to the quantile regression with i.i.d. or time series data that are completely observed. De Backer et al. [4] extended the method of Noh et al. [28] to the quantile regression with censored data. Kraus and Czado [19] studied the quantile regression with complete data, using D-vine copulas. Rémillard et al. [30] discussed the asymptotic connection between the estimators of Noh et al. [28] and Kraus and Czado [19]. Chang and Joe [3] proposed an algorithm for computing the conditional distribution function via the vine copula.

Nagler and Vatter [23] unified various copula-based regressions by formulating a general loss function which may not be continuously differentiable. Their generalized regression model includes the conditional mean regression of Noh et al. [27], the conditional quantile regression of Noh et al. [28], and the asymmetric least squares of Newey and Powell [26] as special cases. The unified framework enhances the systematic interpretation of the various existing regressions.

A potential issue left in the existing literature of copula-based regression is that complete data are assumed to be available. There also exists the vast literature where copula itself is a primary target of estimation [11], but the availability of complete data is still assumed in most papers in that literature. The assumption of complete data is violated in many real applications such as applied microeconomics, corporate finance, and survey sampling. In survey sampling, for example, respondents may refuse to report their personal information such as age and salary.

To relax the assumption of complete data, this paper allows both the regressand Y and the regressors W to be missing at random (MAR), a key concept originally explored by Rubin [31]. The MAR condition has been popularly used in econometrics and statistics to identify the parameter of interest [22]. Hamori et al. [13] is one of few works to deal with data missing at random in copula modeling. (Ding and Song [6] proposed an EM algorithm for estimating the Gaussian copula under the MAR condition. Emura et al. [8], Emura and Wang [9], [10] considered the copula inference with truncated survival data. Guo et al. [12] studied the semiparametric estimation of copula models with nonignorable missing data.) Hamori et al. [13] use calibration weights proposed by Chan et al. [2] for both nonparametric marginal distributions and target copula parameters. The calibration estimation is a nonparametric method that balances the empirical moments of covariates between the observed and whole groups. It does not require an explicit specification of the missing mechanism, and delivers consistent inference under MAR.

Inspired by Hamori et al. [13], this paper adopts the calibration estimation to perform the copula-based regression with {Y,W} missing at random. As in Nagler and Vatter [23], we formulate the generalized regression model which unifies many prominent regressions. A semiparametric copula and the target regression curve are estimated via the calibration approach. The consistency and asymptotic normality of the estimated regression curve are proved. Our simulation study shows that the proposed approach performs well in finite samples, while a benchmark equal-weight approach fails with substantial bias under MAR.

As an empirical application, we regress the R&D expenses of German manufacturing firms onto their revenues. The revenue is observed for all 500 firms considered, while the R&D expense is observed for only 125 firms. The vast majority of the 375 firms with missing R&D have small revenues. The calibration approach delivers a plausible empirical result by assigning sufficiently large weights on firms with small revenues. The benchmark equal-weight approach, by contrast, delivers a misleading result by discarding the 375 firms with missing R&D and assigning the uniform weight on the 125 firms left. This contrast highlights how the calibration approach achieves valid inference under the MAR mechanism.

The rest of this paper is organized as follows: Section 2 explains our basic framework and notation. The calibration estimation is proposed in Section 3, and its large sample properties are derived in Section 4. Variance estimators and confidence intervals are constructed in Section 5. In Section 6, data-driven selection of a tuning parameter K for sieve basis functions is discussed. The simulation study is performed in Section 7. The empirical application is presented in Section 8. Conclusions are provided in Section 9. Proofs of main theorems are collected in Technical Appendices. In the Online Supplement [14], omitted technical and numerical details are provided.

Section snippets

Generalized regression model with missing data

Let Y be a regressand, and let W=(W1,,Wd) be d-dimensional regressors. Consider the generalized regression model: a0(w)=argminaREL{g(Y)a}|W=w,where w=(w1,,wd); L() is a pre-specified loss function whose derivative, denoted as L(), exists almost everywhere; g(Y) is a known function of Y. L() is not required to be continuously differentiable.

Let 1(A) be the indicator function which equals 1 if event A occurs and 0 otherwise. The formulation (1) includes many prominent cases:

  • L(v)=v2 and g

Calibration weights

This paper adopts the covariate balancing principle of Chan et al. [2] and Hamori et al. [13] to estimate {Nπj(Xi)}1 and {Nη(Xi)}1. Their key insight is that the following equation holds for any integrable function u(X) and j{0,1,,d}: ETji1πj(Xi)u(Xi)=E{u(Xi)},E1(T0i==Tdi=1)1η(Xi)u(Xi)=E{u(Xi)}. The estimator of {Nπj(X)}1, denoted as pˆjK(X), should satisfy the sample counterpart of (12): i=1NTjipˆjK(Xi)uK(Xi)=1Ni=1NuK(Xi),where uK(X)={uK,1(X),,uK,K(X)} is a known sieve basis function

Large sample properties

It is evident from (22)–(23) that the proposed estimator aˆ(w) depends on pˆjK(X), qˆK(X), and θˆK. Large sample properties of pˆjK(X), qˆK(X), and θˆK are established by Hamori et al. [13], and we provide a brief review of their results in Section 4.1. Then we establish large sample properties of the proposed estimator aˆ(w) in Section 4.2.

Variance estimation and confidence interval

The asymptotic normality of aˆ(w) established in Theorem 2 has a direct implication for constructing the confidence interval of a0(w). Evidently, the 95% confidence interval for a0(w) is given by aˆ(w)1.96×SÊ{aˆ(w)},aˆ(w)+1.96×SÊ{aˆ(w)},where SÊ{aˆ(w)}=N12σˆ(w) is the standard error of aˆ(w); σˆ(w) is a consistent estimator for the asymptotic standard deviation σ(w), which is defined in Theorem 2. Constructing the confidence interval (25) essentially requires the consistent variance

Data-driven selection of tuning parameters

The calibration approach requires a selection of the tuning parameter K. More generally, the calibration weights can be computed with distinct tuning parameters as {pˆ0,K0(X),,pˆd,Kd(X),qˆKη(X)}. The asymptotic theory of the proposed estimator permits various values for K=(K0,,Kd,Kη). This poses a dilemma for applied researchers who have only one finite sample and would like some guidance on the selection of K. Several data-driven selection methods are discussed in [20], [21] among others.

Monte Carlo simulation

In this section, we perform Monte Carlo simulations in order to evaluate the finite sample performance of the calibration approach. Here we focus on a benchmark scenario which has one regressor (d=1) and one covariate (r=1) in order to conserve space. In Sections 6.2 and 6.3 of the Online Supplement [14], we discuss two extended scenarios for completeness: the extended scenario I which has one regressor (d=1) and two covariates (r=2) and the extended scenario II which has two regressors (d=2)

Empirical application

In this section, we analyze the relationship between research and development (R&D) expenses and revenues of German manufacturing firms in 2017. R&D plays a key role in the manufacturing industry, and we use revenues as a proxy of the firm size. Intuitively, the larger manufacturing firm should have the larger R&D expense, hence the two variables should be positively correlated. A primary goal of this study is to check if that is indeed the case in Germany, with an explicit attention to missing

Conclusion

The existing literature of copula-based regression models assumes that complete data are available. This assumption is violated in many real applications such as applied microeconomics, corporate finance, and survey sampling. In the present paper, the regressand and regressors are allowed to be missing at random. We formulate the generalized regression model which unifies many prominent cases such as the conditional mean and quantile regressions. The calibration estimator of the regression

CRediT authorship contribution statement

Shigeyuki Hamori: Validation, Formal analysis, Writing - review & editing, Supervision, Project administration. Kaiji Motegi: Software, Formal analysis, Investigation, Visualization, Writing - review & editing. Zheng Zhang: Conceptualization, Methodology, Investigation, Writing - original draft, Writing - review & editing.

Acknowledgments

We are grateful to the Editor-in-Chief, Dietrich von Rosen, an anonymous Associate Editor, and two anonymous referees for their insightful comments and suggestions. We also thank Marcus Chambers, Daisuke Nagakura, Teruo Nakatsuma, Tatsuyoshi Okimoto, and Naoya Sueishi, seminar participants at Kobe University, the University of Essex, and Keio University, and conference participants at the 2019 Japanese Joint Statistical Meeting for their helpful comments. The first author, Shigeyuki Hamori, is

References (33)

  • ChanK.C.G. et al.

    Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting

    J. R. Stat. Soc. Ser. B Stat. Methodol.

    (2016)
  • De BackerM. et al.

    Semiparametric copula quantile regression for complete or censored data

    Electron. J. Stat.

    (2017)
  • DetteH. et al.

    Some comments on copula-based regression

    J. Amer. Statist. Assoc.

    (2014)
  • EfronB. et al.

    An Introduction to the Bootstrap

    (1994)
  • GenestC. et al.

    A semiparametric estimation procedure of dependence parameters in multivariate families of distributions

    Biometrika

    (1995)
  • GuoF. et al.

    Semiparametric estimation of copula models with nonignorable missing data

    J. Nonparametr. Stat.

    (2020)
  • Cited by (12)

    • Copula modeling from Abe Sklar to the present day

      2024, Journal of Multivariate Analysis
    • Non-parametric estimator of a multivariate madogram for missing-data and extreme value framework

      2022, Journal of Multivariate Analysis
      Citation Excerpt :

      Indeed, estimating nonparametrically the empirical copula process with missing data outside this framework is still unexplored. As a starting point, semiparametric inference for copula and copula based-regression allowing missing data under Missing At Random (MAR) mechanism have been studied by [19,20]. Alexis Boulin: Conceptualization, Formal analysis.

    • A survey of artificial immune algorithms for multi-objective optimization

      2022, Neurocomputing
      Citation Excerpt :

      Therefore, machine learning (ML) has become a hot topic in computer science, which is learning some data by a computer and then making predictions and judgments on other data. The study on combining the traditional MOIAs with some approaches in machine learning, such as the regression model [149–151] and the neural network [152–154], is pretty interesting and valuable. On the one hand, some effective methods designed in machine learning can be embedded into MOIAs to improve their performance and robustness on solving MOPs.

    • Solving Estimating Equations With Copulas

      2023, Journal of the American Statistical Association
    View all citing articles on Scopus
    View full text