Consequences of Model Misspecification for Maximum Likelihood Estimation with Missing Data

Researchers are often faced with the challenge of developing statistical models with incomplete data. Exacerbating this situation is the possibility that either the researcher’s complete-data model or the model of the missing-data mechanism is misspecified. In this article, we create a formal theoretical framework for developing statistical models and detecting model misspecification in the presence of incomplete data where maximum likelihood estimates are obtained by maximizing the observable-data likelihood function when the missing-data mechanism is assumed ignorable. First, we provide sufficient regularity conditions on the researcher’s complete-data model to characterize the asymptotic behavior of maximum likelihood estimates in the simultaneous presence of both missing data and model misspecification. These results are then used to derive robust hypothesis testing methods for possibly misspecified models in the presence of Missing at Random (MAR) or Missing Not at Random (MNAR) missing data. Second, we introduce a method for the detection of model misspecification in missing data problems using recently developed Generalized Information Matrix Tests (GIMT). Third, we identify regularity conditions for the Missing Information Principle (MIP) to hold in the presence of model misspecification so as to provide useful computational covariance matrix estimation formulas. Fourth, we provide regularity conditions that ensure the observable-data expected negative log-likelihood function is convex in the presence of partially observable data when the amount of missingness is sufficiently small and the complete-data likelihood is convex. Fifth, we show that when the researcher has correctly specified a complete-data model with a convex negative likelihood function and an ignorable missing-data mechanism, then its strict local minimizer is the true parameter value for the complete-data model when the amount of missingness is sufficiently small. Our results thus provide new robust estimation, inference, and specification analysis methods for developing statistical models with incomplete data.


Introduction
Researchers are often faced with the challenge of developing statistical models with incomplete data (Little and Rubin 2002;Molenberghs et al. 2014;Rubin 1976). Exacerbating this situation is the possibility that the researcher's complete-data model or model of the missing-data mechanism is misspecified. The objective of this article is to formally explore the consequences of model misspecification in the presence of incomplete (missing) data for statistical models that utilize maximum likelihood estimation (MLE) (Fomby and Hill 2003).
Model Misspecification. The problems of estimation and inference in the presence of model misspecification are important for several reasons. First, model misspecification may be present in many, if not most, situations; and so robust methods that address the assumption of correct specification are necessary (White 1980(White , 1982(White , 1994Golden 1995Golden , 1996Golden , 2000Golden , 2003. While a correctly specified model is always desirable, in many fields such as econometrics, medicine, and psychology, some degree of model misspecification may be inevitable despite the researcher's best efforts (e.g., White 1980White , 1982White , 1994. Thus, the development and application of robust methods (Golden et al. 2013(Golden et al. , 2016Henley et al. 2019) that address the challenges posed by model misspecification (e.g., White 1980White , 1982White , 1994 has been and continues to be an active area of research (e.g., see Fomby and Hill 2003;Hardin 2003, for relevant reviews). Second, situations arise where the Quasi-Maximum Likelihood Estimates (QMLE) converge to the true parameter value despite the presence of model misspecification. For example, the QMLE can be shown to be consistent to the true parameter value for both linear and nonlinear exponential family regression models even though only the conditional expectation of the response variable given the predictors (covariates) is correctly specified (e.g., Gourieroux et al. 1984;Royall 1986;Wedderburn 1974;Wei 1998;White 1994, Corollary 5.5, p. 67). Consistent parameter estimation of the true parameter values of the researcher's model in the complete data case may also occur for misspecified models where: (i) heteroscedasticity is present (e.g., Verbeek 2008, sec. 6.3), (ii) the random effects distribution is misspecified in linear hierarchical models (e.g., Verbeke and Lesaffre 1997), or (iii) correlations among dependent observations are misspecified (e.g., Hosmer and Lemeshow 2000, pp. 315-317;Liang and Zeger 1986;Wall et al. 2005;Vittinghoff et al. 2012). Third, in more complicated missing data situations, consistent estimation of the true parameter values is possible in linear structural equation models even though only the first two moments have been correctly specified (e.g., Arminger and Sobel 1990), and in longitudinal time-series modeling even though dependent observations are approximately modeled as independent (Parzen et al. 2006;Zhao et al. 1996).

Maximum Likelihood Estimation for Models with Partially Observable Data
Representing Partially Observable Data Generating Processes. In the selection model framework for representing partially observable data generating processes (Rubin 1976;Little 1994;Molenberghs et al. 1998;Little and Rubin 2002), it is assumed that nature creates a complete-data record (observation) by sampling from the complete-data Data Generating Process (DGP). The complete-data record containing the observation's values is then decimated by a pattern of missingness sampled from the missing-data mechanism, thus hiding those values in the complete-data record. Rubin (1976) defined three types of missing-data mechanisms. A missing-data mechanism is termed Missing At Random (MAR) when the probability distribution of the pattern of missingness is functionally dependent only upon the observed data. A special case of MAR, called Missing Completely at Random (MCAR), occurs when the probability distribution of the pattern of missingness is not functionally dependent on either observed or unobserved data. Missing data generating processes that are not MAR are termed Missing Not at Random MNAR (i.e., not MAR), also called NMAR. The probability distribution of the pattern of missingness for an MNAR missing-data mechanism is functionally dependent on unobservable data.
A strategy for representing the pattern of missing values in data sets that supports utilizing maximum likelihood estimation is to create a collection of d binary indicator variables, h i = h i,1 , . . . , h i,d T for the ith data record x i = x i,1 , . . . , x i,d T where the notation h i,k = 1 indicates that the kth element of x i , x i,k , is observable and the notation h i,k = 0 indicates that the kth element of x i is not observable. When h i,k = 0, x i,k is typically set equal to a constant such as zero (Allison 2001;Groenwold et al. 2012). This method has been called the missing-indicator method (Groenwold et al. 2012) and has also been called the dummy variable adjustment method (Allison 2001). It provides a useful method for identifying the presence or absence of all the information in the data set and thus is applicable to representing MCAR, MAR, and MNAR missing-data mechanisms. In practice, the researcher often does not have specific knowledge for modeling the joint distribution of the complete-data representation and the missing data indicator variables, which further increases the likelihood of model misspecification. Groenwold et al. (2012) provides some explicit empirical examples illustrating the challenge of correctly applying this approach.
Overview of Parameter Estimation in the Presence of Partially Observable Data. If the missing-data mechanism is MCAR, then maximum likelihood estimation can be utilized by first applying listwise deletion (Allison 2001;King et al. 2001) also known as complete-case analysis (Little and Rubin 2002), which involves simply removing data records (observations) containing missing values from the data set (also see Groenwold et al. 2012). The resulting dataset, containing no missing values, is then used for statistical modeling. A problem with using listwise deletion to handle a MCAR missing-data mechanism is that the standard errors of the parameter estimates for the researcher's observable-data model may be larger because the information contained in records with missing values has been removed from the data set. However, a more serious issue with the listwise deletion method is that for MAR data the maximum likelihood estimates may be biased (e.g., Allison 2001;Ibrahim et al. 2005;King et al. 2001).
If the missing DGP has a MAR mechanism, then maximum likelihood estimates can be obtained by using Expectation Maximization (EM) (Allison 2001;Dempster et al. 1977;Little and Rubin 2002;McLachlan and Krishnan 1997) or multiple imputation (MI) (Efron 1994;Groenwold et al. 2012;Ibrahim et al. 2005;Robins and Wang 2000;Rubin 1996Rubin , 1987Wang and Robins 1998). These algorithms allow for estimation and inference in the presence of missing data while only requiring that the researcher specify the complete-data model. In these situations, it can be shown that the MAR missing-data mechanism is safely "ignorable" in the sense that all of the available information in the data set is used for the purposes of computing unbiased maximum likelihood estimates without the possibility of inflated standard errors (e.g., Little and Rubin 2002, p. 19) that listwise deletion may cause.
In many important cases, the researcher may have a specific theory of how the complete-data was generated, but does not have a strong theory of how the complete-data was decimated by the missing-data mechanism. In such situations, it is not uncommon when developing models on incomplete data for researchers to assume an ignorable missing-data mechanism (MCAR, MAR). Further, a MNAR missing-data mechanism is not completely verifiable using only incomplete data and so cannot be detected without additional assumptions. In addition, Molenberghs et al. (2008) showed that an MNAR model may be replaced with a MAR model that fits the observed data exactly. Further, while statistical tests exist for checking the MCAR mechanism on the data set (e.g., Little 1988; see Rhoads 2012 for a review), tests for the MAR assumption against the MNAR alternative require additional assumptions to be refutable (Breunig 2019;Jaeger 2006;Rhoads 2012). Nonetheless, in practice, much statistical modeling still relies on the assumption of an ignorable missing-data mechanism and thus robust methods that offer improved estimation and inference approaches to deal with models of ignorable missing-data mechanisms are of critical importance. Our theoretical framework provides the foundation for utilizing new model misspecification testing methods (Golden et al. 2013(Golden et al. , 2016Henley et al. 2019) to improve statistical modeling on incomplete data in the presence of both ignorable and nonignorable missing data processes.

Prior Work on Misspecification in Missing Data Models
Although the consequences of model misspecification have been investigated for many years (e.g., White 1982White , 1994Fomby and Hill 2003;Chen and Swanson 2013;Golden et al. 2013Golden et al. , 2016, a detailed investigation into the consequences of model misspecification for statistical models in the presence of missing data addressed by this article continues to be an open area of research. It is important to emphasize that when missing data is present, one must not only consider the possibility of misspecification in the complete-data model, but also the possibility of misspecification of the missing-data mechanism. For example, the complete-data model may be correctly specified, but the assumption that the missing-data mechanism is ignorable may be incorrect. An important robust method for characterizing the asymptotic distribution of the QMLEs in the presence of model misspecification is the sandwich covariance matrix estimator (e.g., Huber 1967;White 1982White , 1994. For example, Arminger and Sobel (1990) used the sandwich covariance matrix estimator for the purpose of characterizing the asymptotic distribution of the QMLEs for linear structural equation models in the presence of missing data. Robins and Wang (2000, pp. 114-115) and Sung and Geyer (2007, pp. 991-92) also used the sandwich covariance matrix estimator as the basis for their analysis of the asymptotic behavior of multiple imputation estimation. Yuan (2009Yuan ( , pp. 1901 discusses the relation of the sandwich covariance matrix estimator with respect to missing data problems and model misspecification for Gaussian models. Kashner et al. (2010) used a misspecification-robust difference-in-differences binary logistic regression model with the sandwich covariance matrix estimator applied to "naturally" missing observational data from before-after study designs.

A Framework for Understanding Misspecification in Missing Data Models
This article provides a formal framework for characterizing models of missing data with ignorable missing-data mechanisms when either the complete-data model is misspecified or the missing-data mechanism is misspecified. First, we provide sufficient regularity conditions on the researcher's complete-data model to characterize the asymptotic behavior of maximum likelihood estimates in the simultaneous presence of both missing data and model misspecification. These results are then used to derive robust hypothesis testing methods for possibly misspecified models in the presence of ignorable or nonignorable missing-data mechanisms. Second, a method for the detection of model misspecification in missing data problems is discussed using recently developed Generalized Information Matrix Tests (GIMT) (Golden et al. 2013(Golden et al. , 2016; also see Cho and White 2014;Cho and Phillips 2018;Huang and Prokhorov 2014;Ibragimov and Prokhorov 2017;Prokhorov et al. 2019;Schepsmeier 2015Schepsmeier , 2016Zhu 2017). Third, we provide regularity conditions for the Missing Information Principle (MIP) to hold in the presence of model misspecification in order to provide useful computational covariance matrix estimation formulas. Fourth, we provide regularity conditions that ensure the missing data expected negative log-likelihood function for models with ignorable missing-data mechanisms is convex on the parameter space, if and only if, the fraction of information loss function on the parameter space does not exceed one. Fifth, we provide regularity conditions that ensure that when the researcher has: (i) correctly specified a probability model for partially observable data as a complete-data model with an ignorable missing-data mechanism, and (ii) the missing data expected negative log-likelihood is convex on the parameter space, then a strict local minimizer of the missing data expected negative log-likelihood is the unique true parameter value for the complete-data model.
To our knowledge, explicit regularity conditions for the new theorems presented here are not readily available in the published scientific literature. Further, methods for testing for model misspecification in the presence of missing values with the GIMT method have not been discussed in the published scientific literature. In the final section of this article, the key results of the stated theorems and their relevance for practical data analysis problems with the use of new model misspecification tests are presented. Sketches of the key proofs which are based upon conventional arguments may be found in the Appendix A.

Assumptions
In this section, we provide the assumptions of a formal framework for characterizing models of missing data with assumed ignorable missing-data mechanisms when either the complete-data model or the missing-data mechanism are possibly misspecified.

Data Generating Process Assumptions
Assumption 1. I.I.D. Partially Observable Data Generating Process. Let (X i , H i ), i = 1, 2, . . . be a sequence of independent and identically distributed (i.i.d.) random vectors where (X i , H i ) has a common Radon-Nikodým probability density p x,h : In regression modeling applications, the first element of the d-dimensional real vector x i (a realization of X i ) is a value of the outcome variable for a regression model associated with the ith data record while the remaining elements of x i are values for the predictor variables associated with the ith data record, i = 1, . . . ,n. The ith observed data indicator record h i (a realization of H i ) is a d-dimensional binary vector defined so that its jth element is 1 if the jth element of x i is observable and the jth element of h i is 0 otherwise, i = 1, . . . ,n. Let x n ≡ [x 1 , . . . , x n ], x n ∈ R dn . Let h n ≡ [h 1 , . . . , h n ], h n ∈ L dn . Given the full partially observable record (x, h), let the number of observable elements of x be defined such that ρ(h) ≡ (1 d ) T h, ∀h ∈ L d where ρ : L d → {0, 1, 2, . . . , d} and the notation 1 d is used to denote a d-dimensional column vector of ones. For convenience, let ρ h ≡ ρ(h). Also define the observable-data selection matrix s(h) generated by h as a matrix with ρ h rows and d columns such that the kth element of the jth row of s(h) is equal to 1 if the jth non-zero element in h is the kth element in h; and set the jkth element of s(h) equal to 0 otherwise. Let y h : R d → R ρ h be defined such that: distributed by Assumption 1. Let y n ≡ y 1 , . . . , y n and h n ≡ {h 1 , . . . , h n } be realizations of Y n ≡ {Y 1 , . . . , Y n } and H n ≡ {H 1 , . . . , H n } respectively. Let (y n , h n ) denote the observed data sample. Now define the unobservable-data selection matrixs(h) generated by h as a matrix with d − ρ h rows and d columns such that the kth element of the jth row ofs(h) is equal to 1 if the jth zero element in h is the kth element in h; and set the jkth element ofs(h) equal to 0 otherwise. Let z h : R d → R d−ρ h be defined such that: z h (x) =s(h)x for h ∈ L d \1 d . Thus, the d − ρ h i -dimensional column vector z i ≡ z h i (x i ) contains the unobservable components associated with the ith observed data record, i = 1 . . . , n. Let Z i ≡ z H i (X i ), i = 1, . . . , n.
Finally, note that the Radon-Nikodým density representation in Assumption 1 is used so that this theory is applicable to not only situations where the complete data vector X i consists of discrete or absolutely continuous random variables, but also for situations where the complete data vector X i includes both discrete and absolutely continuous random variables. In fact, the Radon-Nikodým density p x,h is also applicable to situations where the elements of X i are constructed from combinations of both discrete and absolutely continuous random variables. In the special case, where X i is a vector consisting of only discrete random variables, then p x,h may be interpreted as a probability mass function.
A common representational convention in the literature (e.g., Little and Rubin 2002;Orchard and Woodbury 1972, p. 699;Rubin 1976, p. 584;Schenker andWelsh 1988, p. 1553) is to assume that a realization (x i , h i ) of an observation can be represented as (z i , y i ), which consists of an observable component vector . . ,n). Our Assumption 1 makes the dependence on h i more explicit and clearly shows how one can apply standard i.i.d. asymptotic statistical theory to support the analysis of missing data problems. Using , which can be rewritten using a more implicit compact notation as p y h (y h ) ≡ p x (x)dν z h (z h ). The density p y h : R ρ h → [0, ∞) specifies the conditional probability distribution of the random vector y h (X) = s(h)X given a particular observed data indicator record h. The observable-data density p y ) specifies the joint probability distribution of the observed data record (Y,H) that includes the pattern of missingness H as well as the observable data component Y.
Additionally, define the missing-data mechanism: density p h|x ≡ p x,h /p x . The missingness-data mechanism is called a MAR missing-data mechanism if there exists a function p h|y h : for all h ∈ L d . A conditional density p h|x that is not a MAR missing-data mechanism is called a MNAR missing-data mechanism. These definitions are consistent with the discussion in Little and Rubin (2002, pp. 11-12; also see Rubin 1976), but are formulated for the specific i.i.d. case considered here.

Probability Model Assumptions
Let supp X denote the support of X.

Assumption 2. Parametric Densities. (i) Let Θ be a compact and non-empty subset of
The approximating complete-data density f (· ; θ) specifies the likelihood of a data record in the case where no component of the data record is missing for each θ in the parameter space Θ. A set of complete-data densities indexed by the parameter vector θ specifies the researcher's complete data model: Note that the researcher's approximation to the missing-data mechanism (specified by the density q h|x ) is MAR so that q h|x is not functionally dependent upon the unobservable-data record component z h (x). Furthermore, q h|x is a constant on the parameter space Θ. These are the two conditions that define an ignorable missing-data mechanism (e.g., Little and Rubin 2002, p. 119; also see Heitjan 1994; Kenward and Molenberghs 1998;Nielsen 1997;Rubin 1976). When q h|x is MAR, it will be convenient to define the function q h|y h such that q h|x (h|x) = q h|y h h y h (x) for all (x, h) ∈ R d × L d .
In contrast to Assumption 3, a non-ignorable missing-data mechanism corresponds to a situation where either: (i) q h|x is functionally dependent upon the unobservable-data record component z h (x), or (ii) q h|x is not a constant on the complete-data model parameter space Θ. Every model of an MNAR missing-data mechanism is non-ignorable. However, it is possible for a researcher to postulate a non-ignorable MAR or MCAR missing-data mechanism.
In order to specify a missing-data probability model, the researcher constructs a best approximating model of the DGP density p x,h using the approximating complete-data density f (· ; θ) for each θ in Θ together with the approximating missing-data mechanism q h|x . In many practical missing data applications, it is common practice to only implicitly specify the missing-data probability model since the researcher explicitly provides only the complete-data density f (· ; θ) and implicitly assumes an ignorable missing-data mechanism.
Let q x,h ≡ f q h|x specify the researcher's best approximating density of the DGP density p x,h . Let H ⊆ {0, 1} d denote the set of all permissible missing data patterns (i.e., the support of The density q y h is intended to approximate the observable-data density p y h . The researcher's observable-data model of the data generating process is the set The density q z h |y h ,h = q h|x /q h|y h f /q y h specifies the researcher's model of the distribution of the missing data given the observable data when h The observable-data model M o is called a correctly specified observable-data model if the observable-data DGP density p y h ∈ M o holds υ y h -a.e.; for all h ∈ H; otherwise M o is a misspecified observable data model.
If p x ∈ M c , then this is a sufficient, but not necessary condition for ensuring M c is correctly specified. In particular, the statement p x ∈ M c holds υ x -a.e. is a formal way of acknowledging that it is possible for another density ..
p to have the property that its corresponding cumulative distribution function is exactly the same as the cumulative distribution function for the DGP density p x even though ..
p(x) = p x (x) only for all x where the sigma-finite measure ν x vanishes. For example, let p x (x) = 1 for all |x| ≤ 1 with p x (x) = 0 for all |x| > 1. Let ..
p(x) = 1 for all |x| < 1 with p x (x) = 0 for all |x| ≥ 1. The cumulative distribution functions for p x and ..
p are identical even though p x ..

p.
A missing-data probability model may be misspecified if either: (i) the complete-data model M c is misspecified, (ii) the missing-data mechanism q h|x is misspecified, or (iii) both the complete-data model M c and the missing-data mechanism q h|x are misspecified.
In regression modeling, a complete-data record x is commonly partitioned such that x = [R, u] where R is the regression model response variable and u is the predictor variables for the regression model. The complete-data probability model is specified by f.
Thus, misspecification of the researcher's complete-data probability model in a regression modeling application may be due to either a misspecification of either (or both) the regression model and the conditional missing predictor variable model. In practice, the researcher's conditional missing predictor model is specified by densities of the form f u miss |u obs ≡ f u / f u obs where f u obs is the marginal distribution for the predictors that are fully observable according to the researcher's missing-data probability model. Additional discussion of conditional missing predictor models may be found in Chen (2004) and Ibrahim et al. (1999).

Likelihood Functions, Pseudo-True Parameter Values, and True Parameter Values
Note that the notation log(q) will be used throughout this article to refer to the natural log of q.
Definition 2. Complete-Data Likelihood Function. Assume Assumptions 1, 2(i), and 2(ii) hold. Given a data sample x n , the complete-data likelihood function L x n : The complete-data likelihood function (e.g., White 1982White , 1994McCullagh and Nelder 1989;Dobson 2002;Little and Rubin 2002) is the usual likelihood function encountered in problems where no missing data is present. In many situations (e.g., when the complete-data probability model contains members of the linear exponential family), the complete-data expected negative log-likelihood l x (· ) : Θ → [0, ∞) is convex on the parameter space Θ (e.g., Kass and Voss 1997, pp. 14-19). In such situations, a strict local minimizer of l x is the unique global minimizer on the parameter space. For more complicated probability models where l x contains multiple strict local and global minimizers, the parameter space Θ can sometimes be defined to contain exactly the strict local minimizer of l x .
Definition 3. Complete-Data True Parameter Value. Assume that Assumptions 1, 2(i), and 2(ii) hold. A global minimizer of the complete-data negative average likelihood function l Note that when misspecification is present, it is possible that the complete-data true parameter value may not exist because the complete-data model is not capable of representing the complete-data DGP.
For a missing-data probability model, the missing-data likelihood function is: when the researcher's missing-data mechanism model specified by q h|x (h|x) is an ignorable missing-data mechanism model (i.e., see Assumption 3) so that q h|x (h|x) = q h|y h h y h (x) , then (1) may be rewritten as: or equivalently as: The likelihood L y,h n in (2) shows that when the researcher assumes an ignorable missing-data mechanism model, this implies that the global minimizers of L y,h n are only functionally dependent upon the observable data components y n ≡ [y 1 , . . . , y n ] generated from combining the complete-data patterns x n ≡ [x 1 , . . . , x n ] and missing-data patterns h n ≡ [h 1 , . . . , h n ]. The global minimizers of L y,h n are not functionally dependent on either the researcher's choice of missing-data mechanism q h|y h or an explicit representation of the specific patterns of missingness h n ≡ [h 1 , . . . , h n ] (Rubin 1976;Schafer 1997, pp. 11-12;Little and Rubin 2002, p. 119). Furthermore, note that although the researcher has assumed an ignorable missing-data mechanism model for the likelihood L y,h n in (2), this assumption does not imply that the data generating process is actually MAR because the researcher's assumption of an ignorable missing-data mechanism model may be wrong. Thus, maximizing the likelihood L y,h n in (2) to estimate the parameter values θ when the data generating process is MNAR is quasi-maximum likelihood estimation (e.g., White 1982White , 1994. These remarks thus motivate the following definition of the observable-data likelihood function (e.g., Schafer 1997, pp. 11-12;Little and Rubin 2002, p. 119) that is central to the objectives of this article.
Definition 4. Observable-Data Likelihood Function. Assume that Assumptions 1, 2(i), 2(ii), for all θ ∈ Θ. The observable-data expected negative average log-likelihood l : The observable-data negative average log likelihood l n in general is typically not convex on the parameter space Θ even if the complete-data probability model specified by f (x; θ) is a member of the linear exponential family (Orchard and Woodbury 1972;Louis 1982). The non-convex property of l n is a consequence of the Missing Information Principle (see Louis 1982, and Theorem 3), which states that the Hessian of the observable-data negative average log-likelihood is the difference of two positive semidefinite matrices. McLachlan and Krishnan (1997, pp. 91-95; also see Schafer 1997, pp. 52-55;Murray 1977) provide some helpful empirical examples of non-convex missing-data likelihood functions. Thus, in the presence of missing data, it is not unusual for multiple local minimizers, ridges, or multiple global minimizers of the negative average observable data log-likelihood function to exist. To apply the asymptotic theory in such situations, it may be possible to choose the parameter space Θ such that l : Θ → [0, ∞) is a convex function on Θ and has a unique global minimizer on Θ.
Definition 5. Observable-Data True Parameter Value. Assume that Assumptions 1, 2(i), and 2(ii) hold. A global minimizer of the observable-data negative average likelihood function l n on the parameter space Θ is called an observable-data quasi-maximum likelihood estimator. A global minimizer of the observable-data expected negative log-likelihood l is called an observable-data pseudo-true parameter value θ * . A parameter value θ * 0 ∈ Θ defined such that for all y h ∈ supp Y h : q y h y h ; θ * 0 = p y h (y h ) for each h ∈ H is called an observable-data true parameter value.
The observable-data quasi-maximum likelihood estimator is the parameter value that maximizes the likelihood of the observable data in terms of the assumptions associated with the researcher's proposed probability model. In addition (see Equations (1) and (2)), when the DGP missing-data mechanism is MAR the observable-data pseudo-true parameter value is semantically interpretable as identifying the probability distribution in the researcher's observable data probability model that is most similar to the probability distribution that generated the observed data (i.e., the distribution specified by the density p y h ,h ) using the Kullback-Leibler Information Criterion (e.g., White 1982White , 1994Kullback and Leibler 1951). If the DGP missing-data mechanism is MAR, the complete-data probability model is correctly specified, and the observable-data expected negative log-likelihood is convex, then the observable-data true parameter value and the complete-data true parameter value are identical (see Proposition 3(iii) of this article).
Since the researcher is assuming an ignorable missing-data mechanism, this means that the observable-data pseudo-true parameter value is calculated without incorporating knowledge of the observed patterns of missingness and without incorporating an explicit missing-data mechanism into the researcher's probability model. These assumptions correspond to potentially serious misspecification errors when the DGP missing-data mechanism is MNAR.

Moment Assumptions
Let ∇ denote the gradient operator with respect to θ ∈ Θ yielding an r-dimensional column vector. Let ∇ 2 denote the Hessian operator with respect to θ ∈ Θ yielding an r × r symmetric matrix. Let the norm, · , of an r-dimensional vector θ be defined such that: be defined such that for all y ∈ supp y h (X) and for all θ ∈ Θ: for all x ∈ supp X and for all θ ∈ Θ.

Assumption 4. Domination Conditions. For each
(d) each element of ∇ 2 log q y h is dominated on Θ with respect top y h ; and (ii) there exists a finite positive number K such that for all x ∈ supp X and for all θ ∈ Θ: f (x; θ) ≤ Kp x (x).
Assumption 4 holds under fairly general conditions. Assumption 4(i) corresponds to standard maximum likelihood regularity assumptions applied to the observable probability model representation (e.g., White 1982White , 1994Serfling 1980). Assumption 4(ii) is a relatively weak condition that states the likelihood of an environment events assigned by the researcher's complete data probability model must have a (generous) upper bound determined by the likelihood of that event in the environment.
Simple verifiable conditions for ensuring Assumption 4 holds are: (i) assume that the DGP is bounded (i.e., |X i | ≤ K with probability one for some finite number K), (ii) log f (x; θ) is piecewise continuous in its first argument and a twice continuously differentiable function in its second argument, (iii) the parameter space Θ is a closed and bounded set, and (iv) if for each possible realization x i in the researcher's model (i.e., f (x; θ) > 0 for all θ ∈ Θ), the realization x i in the statistical environment should also be possible with likelihood greater than some positive number ε (i.e., p e (x) > ε).
The Assumption 5(i) is an identifiability assumption that is commonly made for the purposes of establishing consistency and asymptotic normality for maximum likelihood estimators in the completely observable case (e.g., White 1982). Assumption 5(i) will fail, for example, if the complete-data model contains redundant or irrelevant parameters (e.g., White 1982). However, the nature of the actual missing-data mechanism may also cause Assumption 5(i) to fail. For example, there may not be sufficient information in the observed data to uniquely specify how all predictor variables in a regression model covary.
To state our next assumption, let g y h (y; θ) ≡ −∇ log q y h (y; θ) and write A(θ) ≡ ∇ 2 l(θ) and Assumption 6(ii) is used in order to apply the Lindeberg-Levy Central Limit Theorem (e.g., White 1984, Theorem 5.2). Violations of Assumption 6(i) and Assumption 6(ii) may be interpreted as analogous to the presence of multicollinearity in the special case of linear regression modeling.

Theorems
In this section, the explicit regularity conditions developed in Section 2 are used to formulate and prove key theorems applicable to missing data models comprised of a possibly misspecified complete-data model and a missing-data mechanism that is possibly misspecified as ignorable. Many of the results are applicable to DGPs with either MAR or MNAR missing-data mechanisms.
Theorem 1 establishes conditions for the QMLE to converge to the pseudo-true parameter value θ * even if the complete-data model is misspecified. Theorem 2(i) establishes conditions for the distribution of the quasi-maximum likelihood estimates to have an asymptotic multivariate Gaussian distribution centered at θ * with a sandwich covariance matrix even if the complete-data model is misspecified. In addition, Theorem 2(ii) establishes that the Hessian and OPG covariance matrices will differ from each other and the sandwich covariance matrix when the observable-data model is misspecified. This latter result is important for two reasons. First, it suggests alternative methods for estimating the QMLE covariance matrix. Second, Theorem 2(ii) implies that when the Hessian and OPG covariance matrices are not equal that the complete-data model is misspecified regardless of whether or not the researcher has correctly specified the missing-data mechanism as ignorable even if the data is MNAR.
Theorem 3 provides explicit regularity conditions for the Orchard and Woodbury (1972); also see Louis (1982) Missing Information Principle to hold for ignorable missing-data mechanisms. In addition, Theorem 3 introduces several new forms of the Missing Information Principle that are important for interpreting and computing the covariance matrix of the missing-data maximum likelihood estimates in terms of the researcher's complete-data model. Proposition 2 establishes conditions for computationally tractable formulas for consistent gradient and covariance matrix estimators. Theorem 4 describes how the shape of the complete-data expected likelihood and the amount of missing information influence the convexity of the expected observable data negative log-likelihood function for both MAR and MNAR environments. Proposition 3 summarizes some key results regarding local and global identifiability of the observable-data pseudo-true parameter values, observable-data true parameter values, and complete-data true parameter values.

Quasi-Maximum Likelihood Estimation for Possibly Misspecified Missing Data Models
Let the missing-data gradientḡ n : R r → R r be defined such that for all θ ∈ Θ and for all (y n ,

Proposition 1. Missing-Data Average Negative Log Likelihood Function and Gradient Estimation.
Assume that Assumptions 1, 2(i), 2(ii), 2(iii), 4(i)a, and 5 hold. Then as n → ∞, l n (· ; Y n , H n ) → l and g n (· ; Y n , H n ) → g uniformly on Θ with probability one. In addition, l and g are continuous on Θ.
Theorem 1 provides primitive conditions for ensuring the consistency of the missing-data quasi-maximum likelihood estimatorθ n for missing-data models involving assumed ignorable missing-data mechanisms in the possible presence of model misspecification.

QMLE Asymptotic Distribution for Possibly Misspecified Missing Data Models
Now consider the asymptotic distribution ofθ n which is a unique minimizer of l n on a closed and bounded parameter space Θ that contains θ * in the interior of Θ.
Theorem 2(i) (whose proof follows directly from the methods of Huber 1967;White 1982White , 1994) provides explicit regularity conditions delivering the asymptotic distribution of the quasi-maximum likelihood estimator for possibly misspecified models in the presence of missing data. The covariance matrix of the quasi-maximum likelihood estimate is specified by the missing-data robust covariance matrix (Robins and Wang 2000;Clayton et al. 1998;Arminger and Sobel 1990): Wang and Robins (1998, p. 937) and Robins and Wang (2000, pp. 114-15) simply assume that the conclusions of Theorem1 and Theorem 2 hold in their development of an asymptotic statistical theory of parameter estimation using both single and multiple imputation methods for possibly misspecified models with ignorable missing-data mechanisms. Thus, Theorems 1 and 2 also are useful for providing a set of primitive assumptions for the missing data asymptotic theory of imputation methods developed by Robins and Wang (2000) and Wang and Robins (1998).
Theorem 2(ii) provides explicit regularity conditions supporting the assertion that if the observable-data probability model is correctly specified (regardless of the correct specification of the researcher's missing-data mechanism), then the Information Matrix Equality (A * = B * ) holds so that C * can be replaced with either the missing-data Hessian covariance matrix formula (e.g., Little and Rubin 2002;Meng and Rubin 1991;Jamshidian and Jennrich 2000): or the missing-data OPG (Outer-Product Gradient) covariance matrix (e.g., Berndt et al. 1974; also see McLachlan and Krishnan 1997, p. 122) formula: Note that the second part of Theorem 2, 2(ii), states that when (6), (7), and (8) are not equivalent (i.e., C * [A * ] −1 and C * [B * ] −1 ), then the missing-data Hessian covariance matrix formula in (7) and the missing-data OPG covariance matrix formula in (8) are not correct. In this situation, the formula for the missing-data robust covariance matrix C * in (6) must be used instead of (7) and (8) in order to ensure reliable statistical inferences for possibly misspecified probability models whose parameters are estimated from missing data.

Validity of Missing Information Principles When Model Misspecification Is Present
The Uniform Law of Large Numbers (e.g., Jennrich 1969, Theorem 2) with Assumptions 1, 2, and 4, provides a variety of convenient ways to estimate A * , B * , and C * . Letĝ y h = − ∇ log f x;θ n ψ z h |y h (z h y h ; θ) dν z h (z h ). Estimate B * −1 using the missing-data OPG (Outer Product Gradient) covariance matrix estimator:B −1 The missing data Hessian covariance matrix estimator:Â n ≡ ∇ 2 l n θ n ; y n , h n , is a convenient estimator for the missing-data robust covariance matrix C * given by the missing-data robust covariance matrix estimator:Ĉ n ≡ Â n −1B n Â n −1 . However, in practice, it is desirable to obtain computational formulas for estimating the asymptotic covariance matrix of the parameter estimates C * which are expressed in terms of the first and second derivatives of the complete-data probability model M c ≡ f (x; θ) : θ ∈ Θ rather than the observable-data probability model M o . In this section, such formulas will be developed for situations where the observable-data model may be misspecified following the classical derivation of the Missing Information Principle which does not explicitly assume that model misspecification may be present (Woodbury 1971;Orchard and Woodbury 1972; also see Louis 1982, eq. 3.1, andMcLachlan andKrishnan 1997, p 100).

Proposition 2. Missing-Data Gradient Computation Formulas.
Assume that Assumptions 1, 2, 3, and 4 hold. Then, (i) and (ii) Proposition 2(i) provides an expression useful for computing the OPG missing-data covariance matrix estimator B n θ n ; Y n , H n −1 . Moreover, when (9) is used to evaluateḡ n (θ; y n , h n ), then we obtain Equation (2.13) from Orchard and Woodbury (1972;also see Woodbury 1971) and Equation (3.1) from Louis (1982). Proposition 2(ii) establishes that the conditional independence relation q z h |y h ,h = q z h |y h holds and provides the convenient computational formula for an ignorable-type complete-data generation model: We now provide formal conditions for Orchard and Woodbury's Equation (2.15) (Orchard and Woodbury 1972; also see Woodbury 1971) and Louis's (Louis 1982) Equation (3.2) to hold by investigating the structure ofĀ n θ n ; Y n , H n . When they exist, let and When Assumption 3 holds, then A n (θ; y n , h n ) can be interpreted as the Hessian conditional complete-data information matrix and A n (θ; y n , h n ) can be interpreted as the Hessian conditional missing information matrix. Note that in the special case where all records are complete: A n (θ; y n , h n ) vanishes so the trace of A n (θ; y n , h n ) may be interpreted as a measure of the amount of missing information. h). When it exists, let:

When they exist, let
which is referred to as the OPG conditional complete-data information matrix when A3 holds. Inspecting (13), we see that B n (θ; y n , h n ) is positive semidefinite for all θ. Let Then, when it exists, which is referred to as the OPG conditional missing information matrix when Assumption 3 holds. The trace of the positive semidefinite matrix B n ( · ; y n ; h n ) defined in (15) may be interpreted as a measure of the amount of missing data. In the special case where there are no missing data, the trace of (15) vanishes.

B(θ) = B(θ) − B(θ).
Theorem 3 provides explicit regularity conditions for the Missing Information Principle (MIP) to hold for the case where either (or both) the missing data model is possibly misspecified as ignorable or the researcher's complete-data model is misspecified. The formula A = A − A in Equation (16) corresponds to the MIP presented in Equation 2.15 from Orchard and Woodbury (1972;also see Jank and Booth 2003;McLachlan and Krishnan 1997, pp. 101-3, 111-13;Meng and Rubin 1991). Substituting the relation in (18) into Equation (16) yields an alternative form of the MIP discussed by Louis (1982, eq. 3.2), which is:Ā n = A n − B n . Both of these forms of the MIP are valid in the presence of an observable-data model which is possibly misspecified.
If the observable-data model is correctly specified, then the results of Theorem 2(ii) imply that A n = B n which may be combined with the results of Theorem 3 to obtain two additional MIPs that are valid for the case when the observable-data model is correctly specified:Ā n = B n − B n and A n = B n − A n . To our knowledge, this discussion of these four specific forms of the MIP and their validity in the presence of model misspecification has not been discussed in the literature.

Detection of Model Misspecification in the Presence of Missing Data
It is important to note that contrapositive of Theorem 2(ii) may be used as the basis for the detection of model misspecification in the observable-data model since Theorem 2(ii) implies that a failure of the Information Matrix Equality indicates the presence of model misspecification in the researcher's observable-data model. Thus, Theorem 2(ii) suggests a new approach to checking for model misspecification in the presence of missing data for models with assumed ignorable missing-data mechanisms. If the missing-data Hessian covariance matrix estimator Â n −1 and the missing-data OPG covariance matrix estimator B n −1 are asymptotically different, then this indicates the presence of model misspecification in the observable-data model.
More specifically, the general theoretical framework for developing Generalized Information Matrix Tests (GIMTs) for the detection of model misspecification (Golden et al. 2013(Golden et al. , 2016 may be applied to construct a wide range of entirely new misspecification tests for detecting model misspecification in observable-data models that assume an ignorable decimation mechanism. Further, this method is valid for both MAR and MNAR environments regardless of whether the missing-data mechanism has been correctly specified as ignorable.
In practice, researchers are often interested in whether the complete-data model, rather than the observable-data model, is misspecified. However, misspecification of the observable-data model when an ignorable missing-data mechanism is postulated implies that either the complete-data model is misspecified or the missing-data mechanism is MNAR. Thus, Theorem 2(ii) in conjunction with the GIMT Framework (Golden et al. 2013(Golden et al. , 2016 provides a method for the detection of misspecification of complete-data probability models and also provides a method for detecting the presence of an MNAR data generating process in situations where the complete-data model is known, in fact, to be correctly specified. Nonetheless, it is important to emphasize that the GIMT method is only capable of detecting some types of misspecification of the researcher's complete-data model. For example, suppose that the complete-data model was misspecified and the missing-data mechanism was correctly specified as ignorable. Situations may exist where the presence of the missingness "hides" misspecification in the observable data model and thus the complete-data model is not identified as misspecified. This occurs because the missing-data mechanism sometimes renders the presence of model misspecification unobservable. Further, this method for detecting model misspecification by checking the Information Matrix Equality cannot directly detect misspecification of the missing-data mechanism because the Information Matrix Equality is not functionally dependent upon the missing-data mechanism. However, misspecification of the missing-data mechanism as ignorable may be indirectly detected by the GIMT method in the special case where the complete-data model is known to be correctly specified and misspecification is detected in the observable-data model. In this situation, the correctly specified complete-data model (alternative) serves to provide the necessary distribution assumption (Jaeger 2006;Molenberghs et al. 2008;Rhoads 2012) to test for the presence of a nonignorable missing-data mechanism. exists. The quantity ξ * B ≡ ξ B (θ * ) is called the OPG fraction of information loss. (iii) The robust fraction of information loss function ξ C : Θ → R is defined such that for all θ ∈ Θ: ξ C (θ) = λ max C(θ) C(θ) where (θ) when C(θ) exists. The quantity ξ * C ≡ ξ C (θ * ) is called the robust fraction of information loss. Dempster et al. (1977;also see McLachlan and Krishnan 1997;and Little and Rubin 2002, p. 177) discuss the Hessian fraction of information loss. In particular, they define the fraction of information loss as the largest eigenvalue of A The fraction of information loss function will typically take on non-negative values that are no greater than one on the parameter space because the parameter space is usually chosen so that the expected negative observed-data log-likelihood function is convex on the parameter space. However, in regions of the parameter space where the expected negative observed-data log-likelihood is not convex (e.g., saddle points), the fraction of information loss function can take on values that are greater than one (Dempster et al. 1977, p. 10;McLachlan and Krishnan 1997, p. 107). ξ A on Γ is the set of non-negative real numbers.
(ii) Assume that A is positive definite on a non-empty open convex subset Γ of Θ. Both ξ A (θ) ≤ 1 or .. ξ A (θ) ≤ 1 for all θ ∈ Γ if and only if l is convex on Γ. In addition, the range of ξ A and .. ξ A on Γ is the set of non-negative real numbers.
The assumption that A is positive definite on the parameter space is not very restrictive in practice. Under typically assumed regularity conditions, A will be positive definite on the parameter space when the complete-data negative log-likelihood is strictly convex on the parameter space. For example, standard regularity conditions will typically ensure the complete-data negative log-likelihood is strictly convex when the complete-data model is a member of the linear exponential family (Kass and Voss 1997, pp. 14-19). The additional condition ξ A ≤ 1 may be semantically interpreted to mean that the observable-data expected negative log-likelihood will be convex if and only if the fraction of information loss function (i.e., amount of "missingness" in the data) is not too large over the parameter space.
Proposition 3. Identifiability. Assume that Assumptions 1, 2(i), 2(ii), and 4(i)(a) hold. Let Γ be a non-empty open subset of the parameter space Θ. Assume that the observable-data expected negative log-likelihood l is a convex function on Γ. Let θ * ∈ Γ be a strict local minimizer of l. Then the following assertions hold.
(i) The minimizer θ * is the unique global minimizer of l on Γ.
(ii) If the missing-data mechanism p h|x is MAR and the observable-data model is correctly specified on Γ, then the unique global minimizer θ * is the unique observable-data true parameter value for l on Γ. (iii) If the missing-data mechanism p h|x is MAR and the complete-data model is correctly specified on Γ, then the unique global minimizer θ * is both the observable-data true, and complete-data true parameter value for l on Γ.
The assumption that θ * ∈ Γ is a strict local minimizer of l means that Assumption 5 holds for θ * . In summary, Theorem 4 provides regularity conditions that ensure the observable-data expected negative log-likelihood is convex on the parameter space provided the complete-data expected negative log-likelihood is strictly convex and the amount of missingness is sufficiently small. Given that the observable-data expected negative log-likelihood is strictly convex, then it follows (with some additional regularity conditions) from Proposition 3(iii) that for MAR data any strict local minimizer of the observable-data expected negative log-likelihood is the unique global minimizer and that global minimizer corresponds to the unique complete-data true parameter value.

Summary and Conclusions
In practice, it is challenging to develop probability models involving missing data where either the researcher's complete-data model or the missing-data mechanism may be misspecified. To directly address this issue, a formal theoretical framework that explicitly discusses the consequences of model misspecification has been introduced, which encompasses robust estimation, inference, and specification analysis methods for models involving missing data. The main results of our theoretical framework, summarized in Table 1, follow from the key assumption of an i.i.d. partially observable data generating process (see Assumption 1) that may be either of type MAR or MNAR. Another key assumption is that the probability model of the i.i.d. missing DGP has been modeled by the researcher as a complete-data model with a postulated ignorable missing-data mechanism. Thus, our framework is specifically designed to investigate the consequences of misspecification in cases that also include the situation where the missing DGP is a MNAR mechanism, but the researcher assumes that it is MAR mechanism. Table 1.
Key theoretical results for probability models with assumed ignorable missing-data mechanisms.

Consistency
Theorem T1 QMLE is a consistent estimator of the pseudo-true parameter values for observable-data probability models with an assumed ignorable missing-data mechanism in the presence of a missing DGP specified by a MAR or MNAR missing-data mechanism.
Asymptotic Distribution Theorem T2(i) The asymptotic distribution of the QMLE is Gaussian with covariance matrix C * = (A * ) −1 = (B * ) −1 for observable-data probability models with an assumed ignorable missing-data mechanism in the presence of a missing DGP specified by a MAR or MNAR missing-data mechanism.

Misspecification Detection Theorem T2(ii)
A GIMT may be used to detect the presence of misspecification in the observable-data probability model with an assumed ignorable missing-data mechanism in the presence of a missing DGP that is a MAR or MNAR missing-data mechanism. If this observable-data probability model is misspecified, this implies the complete-data probability model is misspecified when the missing-data mechanism is possibly misspecified but correctly specified as ignorable.
Missing Information Principles Theorem T3 Let l n (θ) = −n −1 n i=1 log q y h i (y i ; θ) denote the observable-data negative average log-likelihood. The Hessian of l n (θ) in the presence of possible model misspecification may be estimated using either: A n = A n − A n andĀ n = A n − B n . If, in addition, either observable-data or complete-data model is correctly specified, then the Hessian of l n (θ) may be estimated using either:Ā n = B n − B n andĀ n = B n − A n .

Identifiability Proposition P3
Assume that the observable-data negative log-likelihood is convex on a convex region, Γ, of the parameter space with a unique global minimizer in the interior of Γ. Assume that the observable-data model is correctly specified and the missing-data mechanism is correctly specified as ignorable. Then assume that global minimizer is the observable-data model true parameter value. If, in addition, the complete-data model is correctly specified on Γ, then the unique global minimizer on Γ, is the complete-data model true parameter value.

Fraction of Information Loss Theorem T4
If the amount of missing data as measured by the Fraction of Information Loss is small and the complete-data model negative log-likelihood is strictly convex on a convex region of the parameter space, then with appropriate regularity conditions the observable data negative log-likelihood will be convex on that convex region of the parameter space.
Estimation. Theorem 1 establishes that the quasi-maximum likelihood estimator (QMLE) is a consistent estimator of the pseudo-true parameter values of the observable-data model with an assumed ignorable missing-data mechanism (MCAR, MAR). Further, QMLEs are shown to be consistent for the observable-data model in the presence of an MNAR missing-data mechanism. Our framework not only characterizes the asymptotic behavior of the quasi-maximum likelihood estimators in the presence of model misspecification and missing data, but also provides conditions for those estimators to converge to the true parameter values for the complete-data model. When the amount of missing data as measured by the Fraction of Information Loss is small and the complete-data model negative log-likelihood is strictly convex on a convex region of the parameter space, then the observable data negative log-likelihood will be convex on the same region of the parameter space (Theorem 4). This key result supports the Identifiability Proposition 3 that shows when the observable-data model or the complete-data model true parameter values may be estimated.
Inference. In our framework, the correct specification of the complete-data model always implies correct specification of the observable-data model. Therefore, a key theoretical result provided in Theorem 2(i) is that if the complete-data probability model is correctly specified and the missing-data mechanism is correctly specified as ignorable, then either the robust missing-data sandwich covariance matrix estimatorĈ −1 n , missing-data Hessian covariance matrix estimatorÂ −1 n , or missing-data OPG covariance matrix estimatorB −1 n may be used for the purposes of estimating the covariance matrix of the observable data pseudo-true parameter estimates. However, in general, only the missing-data sandwich covariance matrix estimatorĈ −1 n can be used to obtain unbiased estimates of the covariance matrix of the observable data pseudo-true parameter estimates. Thus, for this reason, it is recommended that researchers always use the robust missing-data sandwich covariance matrix estimator instead of the missing-data Hessian covariance matrix estimator or the missing-data OPG covariance matrix estimator.
In addition, the Missing Information Principle (MIP) provides a computationally useful way of expressing the gradient and Hessian of the observable-data likelihood in terms of the gradient and Hessian of the complete-data likelihood. In practice, the gradient and Hessian of the complete-data likelihood are usually more available. We show that MIP as described by Louis (1982) and the MIP formula described by Dempster et al. (1977) are equivalent and valid for the special case where the researcher correctly postulates an ignorable missing-data mechanism but may possibly misspecify the complete-data model. We also provide two additional new MIPs which are valid when the researcher's observable-data probability model is correctly specified. These results are summarized in Theorem 3.
Specification Analysis. Our theory supports a new approach for the detection of model misspecification in missing data problems using the results of Theorem 2(ii) with the Generalized Information Matrix Test (GIMT) methods of Golden et al. (2013Golden et al. ( , 2016. Under the assumption that the missing-data mechanism is possibly misspecified, but postulated as ignorable (MAR), the GIMT method for specification testing can be used to detect the presence of model misspecification in the observable-data model. In practice, these results serve to elucidate the consequences of how ignorable and nonignorable missing-data mechanisms may affect a complete-data model, which may be either possibly correctly specified or misspecified (White 1982(White , 1994, when the researcher has postulated an ignorable mechanism. Table 2 depicts the relationships of missing-data mechanisms to the specification of the complete-data model when the observable-data model has been determined to be misspecified and the researcher's model of missing-data mechanism is postulated as ignorable. As shown, it can be concluded that the complete-data model is misspecified when the observable-data model is misspecified in the presence of a MAR mechanism. Notably, in the special case where the complete-data model is known to be correctly specified, the presence of a MNAR missing-data mechanism may be detected within this framework. This result follows as a consequence of an ignorable missing-data mechanism not affecting the specification of the observable-data model. Thus, rejecting the null hypothesis for a specification test on the observable-data model evidences the presence of a nonignorable missing-data mechanism (MNAR). Such a test may also be viewed as testing the missing at random (MAR) hypothesis (Jaeger 2006;. Lu and Copas 2004;Rhoads 2012). Finally, when the complete-data model may be possibly be misspecified the determination that the observable-data is misspecified based on specification testing leads to the conclusion that the either the complete-data model is misspecified OR the missing-data mechanism is MNAR. 1 Researcher's model of missing-data mechanism is postulated as ignorable. 2 Maximum likelihood estimates obtained by minimizing observable data likelihood (Equation (3)). 3 Generalized Information Matrix Tests (GIMT) (Golden et al. 2013(Golden et al. , 2016 may be applied to detect misspecification in the observable-data model (Theorem 2(ii)). 4 GIMT provides a statistical test for detecting if the partially observable DGP has an MNAR missing-data mechanism in situations where the complete-data model is known to be correctly specified.
In conclusion, our framework provides a unified theory, which formalizes methods that provide insights into the robustness of maximum likelihood estimation, inference, and specification analysis when misspecification is present in the researcher's complete-data model or the missing data ignorability assumption is incorrect. These results will assist researchers to more explicitly understand the consequences of the inevitable occurrence of model misspecification in missing data analysis problems, as well as the use of appropriate robust methods, when pursuing the goal of developing correctly specified models. gives for each θ ∈ Θ and then differentiating twice to obtain: ∇ 2 q y h (y; θ) dν y h (y) = 0 r×r which after some algebra becomes: ∇ log q y h (y; θ) ∇ log q y h (y; θ) T + ∇ 2 log q y h (y; θ) q y h (y; θ)dν y h (y) = 0 r×r (A1) If the observable-data probability model is correctly specified, then a θ * 0 exists such that: q y h y h (x); θ * 0 = p y h (y h (x)) a.e. − ν y h for each h ∈ H. It then follows that: q y h y h (x); θ * 0 = p y h (y h (x)) may be substituted into (A1) to obtain: B * − A * = 0 r×r .

Proof of Proposition 2.
(i) The gradient of l n (θ; y n , h n ),ḡ n (θ; y n , h n ) ≡ ∇l n (θ; y n , h n ), is: and the result follows.
(ii) Using the definitions of q z h |y h ,h and q y h ,h in Section 2.2, substitute and use the ignorability Assumption 3 q h|x (h|x) = η h|y h h y h (x) to obtain: Thus, q z h |y h ,h (z h (x) y h (x), h; θ) = ψ z h |y h (z h (x) y h (x); θ).
To show the final part of (ii), take expectations of (17) and use Assumption 4 with the Dominated Convergence Theorem (e.g., Bartle 1966, Corollary 5.7, p. 45) to ensure the expectations exist and obtain: B = B − B.
Proof of Theorem 4. Let the vector-valued function λ : R r×r → R r be defined such that λ(Q) is an ordered list of the real eigenvalues of a real symmetric matrix Q. Two r-dimensional square matrices Q and R will satisfy λ(Q) = λ(R) if there exists a non-singular matrix T such that T −1 QT = R (Franklin 1968, Theorem 1, p. 76). Theorem 4(ii) will be proved first and then used to prove Theorem 4(i).
Since Assumptions 1, 2, and 4 hold then Theorem 3 implies A = A − A and A = B so that: where the relation A = B was used to obtain the right-hand side of (A10). A) are less than or equal to one. Finally note that the Hessian of l, A, is positive semidefinite on the non-empty open convex set Γ if and only if l is convex on Γ (see Proposition 5 of Luenberger 1984, p. 180). This establishes the first part of Theorem 4(ii).
To show the second part of Theorem 4(ii), note that the matrix A A are also non-negative and real. Since A is continuous on Θ and Assumptions 1, 2, and 4 hold, it follows that there exists a sufficiently small non-empty open convex set Γ containing θ † such that A takes on values as close to A θ † as desired. Thus, Γ can be chosen so that for all u ∈ U: If u T A θ † u > 0, then u T A(θ)u > 0 for all θ ∈ Γ. Use the result Theorem 4(ii) to then show that Theorem 4(i).
(i) Since l is convex on the non-empty open convex set Γ, and θ * is a strict local minimizer of l on Γ then θ * is the unique global minimizer of l on Γ (Bazarra et al. 2006, pp. 125-26). (ii) If the observable-data model is correctly specified on Γ, then the observable-data true parameter value is in Γ. Since the missing DGP density is MAR, every observable-data true parameter value is a global minimizer of l on Γ. By Proposition 3(i), θ * is the unique global minimizer of l on Γ which implies the global minimizer θ * is the unique observable-data true parameter value. (iii) If the complete-data model is correctly specified on Γ, then the complete-data true parameter value is in Γ. If there exists a complete-data true parameter value θ 0 so that f (x; θ 0 ) = p x (x) (a.e. − ν x ) then f (x; θ 0 ) dν z h (z h ) = p x (x)dν z h (z h ) and thus q y h y h (x); θ 0 = p y h (y h (x)) for all x in the support of X and for all h ∈ H. Thus, correct specification of the complete-data model on Γ implies correct specification of the observable-data model on Γ. By the assumption that the missing DGP density is MAR, and the correct specification of the observable-data model, Proposition 3(i), and Proposition 3(ii), it follows that θ 0 is the unique global minimizer θ * of l on Γ.