MLE with datasets from populations having shared parameters

We consider maximum likelihood estimation with two or more datasets sampled from different populations with shared parameters. Although more datasets with shared parameters can increase statistical accuracy, this paper shows how to handle heterogeneity among different populations for correctness of estimation and inference. Asymptotic distributions of maximum likelihood estimators are derived under either regular cases where regularity conditions are satisfied or some non-regular situations. A bootstrap variance estimator for assessing performance of estimators and/or making large sample inference is also introduced and evaluated in a simulation study.


Introduction
With advanced technologies in data collection and storage, in modern statistical analyses we often have multiple datasets as independent samples from different populations having shared parameters. Typically, one of these multiple datasets is primary with carefully collected data from a population of interest. The other datasets are from external sources, such as data from other studies, administrative records and publicly available information from internet.
On one hand, the fact that populations share common parameters provides a great opportunity for increasing statistical accuracy by utilizing multiple datasets instead of a single dataset. On the other hand, because of the difference in data collection, study purpose and/or time of investigation, heterogeneity often exists among populations so that we cannot simply combine all datasets into a single large dataset to run analysis, but must develop or modify statistical methodology to correctly utilize multiple datasets. The research on analysis with multiple datasets fits into a general framework of data integration (Kim et al., 2021;Lohr & Raghunathan, 2017;Merkouris, 2004;Rao, 2021;Yang & Kim, 2020;Zhang et al., 2017;Zieschang, 1990).
In this article, we study maximum likelihood estimation (MLE) for independent datasets with parametric populations sharing some (not necessarily all) parameters. For simplicity of presentation, we focus on the case of two independent datasets. The main idea and result can be extended to multiple datasets. Our research can also be extended to semi-parametric estimation, such as empirical likelihood or Cox regression for survival data.
Throughout, we consider two independent random samples. One random sample of size n, resulting a dataset {X 1 , . . . , X n }, is sampled from a parametric population with probability density f (x, θ , φ) (for either continuous or discrete x), where f is a known function and θ and φ are unknown parameter vectors. Another random sample of size m, resulting a dataset {Y 1 , . . . , Y m }, is sampled from a population with probability density g(y, θ , ϕ), where g is a known function and θ and ϕ are unknown parameter vectors. Note that X i and Y j can be vectors. The shared parameter θ can be either the main parameter vector of interest or a nuisance parameter vector, and φ and ϕ are other parameter vectors in two populations.
Let ϑ denote the vector with θ , φ, and ϕ as sub-vectors. In Section 2, we derive the maximum likelihood estimator (MLE) of ϑ based on two datasets, which is expected to be asymptotically more efficient than each MLE based on a single dataset, since more data are used for estimating the shared parameter θ, a component of ϑ. The asymptotic normality of MLE of ϑ is established when densities f and g satisfy regularity conditions that are typically assumed for MLE. Applications to location-scale problems are discussed in Section 3, where we also present a situation in which f or g does not satisfy the regularity conditions. Section 4 contains an example in which regularity conditions do not hold and MLE is not asymptotically normal. The common mean of a discrete data problem is considered in Section 5. Section 6 is devoted to the scenario where an additional uncertainty exists in the second population density g. To handle the situation where asymptotic normality of the MLE of ϑ is not available, we introduce a bootstrap variance estimator in Section 7 and provide some simulation results to examine finite sample performances.

MLEs with two datasets
The following are regularity conditions for probability density p(x, ϑ) (with a fixed ϑ) of a continuous or discrete random variable/vector X, typically assumed for MLEs in parametric populations (Shao, 2003).
(R1) For every x in the range of X, p(x, ϑ) is twice continuously differentiable with respect to ϑ in an open set of the Euclidean space with a fixed dimension.
where C denotes the transpose of a vector or matrix C and the integral should be replaced by an appropriate summation when X is discrete.
(R3) The Fisher information matrix −E{ ∂ 2 ∂ϑ∂ϑ log p(X, ϑ)} exists and is positive definite, (R4) For any given ϑ, there exists a positive number c ϑ and a positive function h ϑ such that E{h ϑ (X)} < ∞ and In this section, we assume that both f and g satisfy regularity conditions (R1) -(R4). When some regularity conditions are not satisfied, we have to deal with the problem case by case. See, for example, the problem of normal and Laplace distributions in Section 3.2 and the problem of two truncation distributions in Section 4.
The log likelihood function of ϑ is and the score function is If ϑ is a solution to the score equation s(ϑ) = 0, then we call ϑ an MLE of ϑ, although traditionally an MLE is defined as a maximizer of (ϑ) over the range of ϑ and ϑ satisfying s( ϑ ) = 0 may not be a maximizer. A solution to the score equation often does not have an explicit form, even when each MLE of a single population has an explicit solution.
Under regularity conditions (R1)-(R4), E{s(ϑ)} = 0 and is the Fisher information matrix of information contained in two samples. Let Then is positive definite, where a = m/n and without loss of generality we assume that m = an for a fixed a > 0. It can be seen that I (ϑ) is increasing in a in the sense that A ≥ B for two non-negative definite matrices A and B if and only if A−B is non-negative definite.
Using the standard argument in asymptotic theory, e.g., Theorem 4.17 in Shao (2003), we obtain the following result.
Theorem 2.1: Assume (R1)-(R4) and that m = an with a remaining fixed as n increases. Then, with probability tending to 1 as n → ∞, there exists ϑ (depending on n) such that P{s ( where d

− → denotes convergence in distribution and N(C, D) is the normal distribution with mean C and covariance matrix D.
The asymptotic result (1) enables us to assess performance of ϑ and to carry out large sample statistical inference on parameter ϑ or any of its components θ , φ, and ϕ. When some of regularity conditions (R1) -(R4) are not satisfied, however, we may apply the bootstrap method (see Section 3.2 and Section 7 for the normal and Laplace problem) or directly derive the asymptotic distribution of ϑ (see Section 4 for the problem of two truncation distributions).

Application to location-Scale problems
An application of our general result in Section 2 is to the case where for two continuous probability density functions f and g on real line, i.e., both populations are in location-scale families. We have several scenarios.
Under any location-scale problem, it is often true that I θφ (θ , φ) = 0 and I θϕ (θ , ϕ) = 0 and, hence, the inverse of I (ϑ) can be easily obtained. For example, if both f and g are continuously differentiable functions symmetric about 0, then it follows from Example 3.9 in Shao (2003) that both I θφ (θ , φ) and I θϕ (θ , ϕ) varnish.
In the following we consider a special case in details.

Normal and Laplace densities with a single scale parameter
, which is the normal distribution N(0, θ 2 ), and that g(y, θ) = 1 2θ e −|y|/θ , y ∈ (−∞, ∞), which is the Laplace distribution (also called double exponential distribution) with mean zero and standard deviation √ 2θ . The two densities share the common scale parameter θ > 0. The MLEs of θ based on data from f and g, respectively, are In this particular case, we can obtain an explicit form of the MLE θ of θ based on all data from two samples. The log likelihood based on two samples is The score function is Setting s(θ ) = 0 and using the form of MLE from each sample, we obtain that Since θ > 0 and only one root is positive, we obtain that the MLE of θ is Note that θ is a nonlinear function of θ N and θ E . In general, the MLE of the shared parameter based on two datasets is not a simple function of separate MLEs based on each single dataset.
To derive the asymptotic distribution of θ, we can use the general result (1), because regularity conditions (R1) -(R4) are satisfied for f and g. Since θ has an explicit form, we can also simply derive it. Because X i 's and Y j 's are independent and a = m/n, Then, Hence, by the delta method, e.g., Theorem 1.12 in Shao (2003), where ∇g is the derivative vector of g at (t, s) = (θ , √ aθ), i.e., ∂g ∂t = 2t a + 1 as 2 (a + 1) 2 + 4t 2 a + 1 , This leads to the following result.
Corollary 3.1: Assume that m = an with a remaining fixed as n increases. Then, as n → ∞, The asymptotic relative efficiency of θ N with respect to θ is 2/(a + 2), which is decreasing in a and bounded between 0 and 1. The asymptotic relative efficiency of θ E with respect to θ is a/(a + 2), which is increasing in a and bounded between 0 and 1.
For parameter vector ϑ = (μ, θ) , the log likelihood is Although (ϑ) is not always differentiable in μ, it is concave in μ and, hence, the MLE μ of μ exists though it does not have an explicit form. The MLE of θ is given by (2) with θ N and θ E replaced by, respectively, The asymptotic distribution of ϑ = ( μ, θ ) cannot be obtained from (1), since g does not satisfy conditions (R1) -(R4). For assessing performance of ϑ and/or making inference, we recommend a bootstrap method, which is discussed in Section 7 and studied by simulation.

Application to two truncation distributions
Let f (x, θ) and g(y, θ) be positive density functions on the interval (0, θ) and zero outside (0, θ), where θ > 0 is an unknown scale parameter common for both populations, and f and g are known when θ is known. The likelihood is where I A is the indicator of event A, X (n) = max(X 1 , . . . , X n ) and Y (m) = max(Y 1 , . . . , Y m ). This likelihood is not always differentiable in θ , but it can be seen that the MLE of θ is θ = max(X (n) , Y (m) ), a maximizer of the likelihood. This is an example in which regularity conditions (R1) -(R4) in Section 2 are not satisfied so that result (1) does not hold. The MLE θ is not even asymptotically normal. In the following we directly derive the asymptotic distribution of θ .
It follows from the result in Example 2.34 of Shao (2003), the independence of X i 's and Y j 's, and m = an that where ε 1 and ε 2 are independent random variables with the same exponential distribution having density e −x , x > 0. Because .
From the independence of ε 1 and ε 2 , for any t > 0, This leads to the following result.
Inference on θ can be made using this asymptotic result. The asymptotic relative efficiency of the MLE X (n) based on the first dataset with respect to the MLE θ based on two datasets is {1 + ag(θ , θ)/f (θ , θ)} −2 , which is increasing in a and bounded between 0 and 1. The asymptotic relative efficiency of the MLE Y (m) based on the second dataset with respect to the MLE θ based on two datasets is {1 + a −1 f (θ , θ)/g(θ , θ)} −2 , which is decreasing in a and bounded between 0 and 1.

Application to Poisson and binomial samples
Here we consider a discrete data problem, where X i has the Poisson distribution with mean θ , Y j is binary with P(Y j = 1) = θ, and θ ∈ (0, 1) is the shared parameter. Let X be the sample mean of X i 's and Y be the sample mean of Y j 's. The score function based on two samples is Setting s(θ ) = 0, we obtain the score equation

Since the score equation is a quadratic equation, it has two solutions if and only if
(1 + a + X) 2 − 4(X + aY) > 0.
The solution with + sign in front of the squared root is always larger than 1, out of the range (0, 1) for θ in this problem. Hence, we conclude that the MLE of θ is The minimum is taken because 0 < θ < 1. Again, the MLE θ is a nonlinear function of the separate MLEs, X and Y.
The asymptotic distribution of θ can be derived using the delta-method, but because regularity conditions (R1) -(R4) are satisfied, it is a corollary of Theorem 2.1 in Section 2.
Corollary 5.1: Under the Poisson and binary assumptions for two datasets and m = an, as n → ∞, The asymptotic relative efficiency of the MLE X based on the first dataset with respect to the MLE θ based on two datasets is (1 − θ)/(1 − θ + a), which is decreasing in a and bounded between 0 and 1. The asymptotic relative efficiency of the MLE Y based on the second dataset with respect to the MLE θ based on two datasets is a/ (1 − θ + a), which is increasing in a and bounded between 0 and 1.

MLEs with two samples and an additional uncertainty
In this section, we consider a scenario in which the first sample is obtained under a controlled study so that we know the form of probability density f (x, θ, φ), but the form of g(y, θ , ϕ) for the second sample has an additional uncertainty, because the second sample may be obtained through a past study and/or public records. We assume that the additional uncertainty comes from an unknown parameter ζ taking two possible values, 0 and 1, i.e., the probability density of the second sample is g (y, θ , ϕ, ζ ), where ζ = 0 or 1 and g is still a known density when θ , ϕ, and ζ are known.
How do we derive the MLE of ϑ = (θ , φ , ϕ ) ? If ζ is known, then the MLE can be obtained using the method in Section 2. Since ζ takes only two values, if ζ is a consistent estimator of ζ , i.e., then we obtain the MLE of ϑ as where ϑ(0) and ϑ(1) are MLEs under ζ = 0 and ζ = 1, respectively. A suggested consistent estimator of ζ is the MLE of ζ based on the second sample, Y j 's. Let θ(ζ ) and ϕ(ζ ) be the MLEs of θ and ϕ, respectively, based on Y j 's, when the value of ζ is fixed. Then the MLE of ζ is The following result gives the asymptotic distribution of the MLE ϑ.
where I (ϑ, ζ ) is the Fisher information as defined in Section 2 under the true value of ζ .
Condition (3) has to be checked for each particular problem. The following is an example. Suppose that f (x, θ) is the density of N(0, θ 2 ), g(y, θ , 0) is the same normal density for N(0, θ 2 ) but g(y, θ, 1) is the Laplace distribution with zero mean and standard deviation √ 2θ given in Section 3.1. In other words, sample one is from the main study whereas sample two is from an external source in which the data may follow the same distribution as sample one but may deviate from sample one. The parameters φ and ϕ are constant (non-existing).
In this example, when ζ = 0, we can simply combine the two samples and the MLE of θ is ( n i=1 X 2 i + m j=1 Y 2 j )/(n + m); on the other hand, when ζ = 1, the MLE of θ is given by (2). To check ( Table 1. Results from 1000 simulations for the normal-Laplace problem with location μ = 1 and scale θ = 1 (n = m = 100, SD = standard deviation, ( μ, θ ) = the MLE of ( μ, θ ) based on X i 's and Y i 's, ( X, θ X ) = the MLE of ( μ, θ ) based on X i 's, ( Y, θ Y ) = the MLE of ( μ, θ ) based on Y j 's, and SD is by bootstrap with B = 500).  (2) The MLE θ of θ does not have a negligible bias, although its performance is acceptable with sample size n + m = 200 and its SD is slightly smaller than the SD of θ X . The large bias of the MLE θ mainly comes from the large bias of the MLE θ Y for the Laplace dataset, as it has large bias and SD. (3) The bootstrap SD estimator SD performs very well for all estimators (see the rows under "SD by simulation" and "mean of SD by simulation" in Table 1), even when the point estimator has non-negligible bias.
The histogram of 1000 values of μ from simulation is shown in Figure 1, together with a Q-Q plot. The result suggests μ is asymptotically normal, although such a theoretical result has not been established.