Approximations to distribution of median in stratiﬁed samples

. We consider an Edgeworth type approximation to the distribution function of sample median in the case of stratiﬁed samples drawn without replacement. We give explicit expression of this approximation, and also its empirical version based on bootstrap. We compare their accuracy with that of the normal approximation by numerical examples.


Introduction and results
Consider a population X = {x 1 , . . . , x N } of size N . We assume without loss of generality that x 1 · · · x N . Let X be divided into h 1 nonoverlapping strata X = X 1 ∪· · ·∪X h , where X k = {x k,1 , . . . , x k,N k }. Clearly, N = N 1 +· · ·+N h . Here, for convenience, we will also assume that x k,1 · · · x k,N k . Let X k = {X k,1 , . . . , X k,n k } be the simple random sample of size n k < N k drawn without replacement from the stratum X k . We assume that the samples X 1 , . . . , X h are independent. Write X = X 1 ∪ · · · ∪ X h and denote n = n 1 + · · · + n h . Denote the distribution function of the stratum k and its empirical analogue by I{x k,i x} and F n,k (x) = 1 n k n k i=1 I{X k,i x} respectively. Here I{·} is the indicator function. Then the distribution function of the population X and its estimator are respectively. Consider the population median defined as follows F −1 N (0.5) = inf{x: F N (x) 0.5}. Define its estimator Denote σ 2 = VarX med . In the present paper we are interested in approximations to the distribution function F med (x) = P{X med − EX med xσ}. The asymptotic normality of median X med under stratified simple random sampling (STSRS) without replacement was considered in [4,5]. Here we present an Edgeworth expansion for F med (·) and its empirical analogue. Our approach is based on Hoeffding's (orthogonal) decomposition X med = EX med + L + Q + R constructed in [1] for general symmetric statistics based on STSRS samples drawn without replacement. Here L and Q are called linear and quadratic parts of the decomposition, and R is a remainder term. In the case of U -statistics, where R ≡ 0, Edgeworth expansions were constructed and their second-order correctness was shown in [2]. Thus we expect that, if R is negligible, those Edgeworth expansions will also approximate F med (·) well. In particular, we propose to approximate F med (·) by obtained in [2]. Here Φ ′ (x) denotes the derivative of the standard normal distribution function Φ(x), and with τ 2 k = n k (1 − n k /N k ). Here the moments established in [2], are based on the functions where for 1 i N − 1 we write △ i = x i+1 − x i , and denote the probabilities We give these probabilities in (6) and in Proposition 1 below. Note that expressions (3)-(5) are obtained directly from (11) in [1], using the definitions of expectation and conditional expectations, and applying summation by parts formula in the case of expectation) and noting that, by definition, p N = 0 and p 0 = 1, and so forth.
n the probability that a hypergeometric random variable with parameters N , n and i attains the value j.
and then the variance of X med in (1) is Next we give explicit expressions of the conditional probabilities.
(i) For 1 k h and 1 s N k we have (ii) For 1 k h and 1 s < r N k we have (iii) For 1 k < u h and 1 s N k , 1 r N u we have Proof. Calculations of all conditional probabilities are based on the same arguments as the derivation of (6) in [4]. Here for every of cases (i)-(iii) we need to consider, under fixed conditions, a few different positions of x i only. Note that the set T is the same for all probabilities, since we use the convention that b a = 0 if a < 0; as well as the convention that b a = 0 if a > b. ⊓ ⊔ Empirical approximation. The parameters α = α(X ), κ = κ(X ) and σ 2 = σ 2 (X ) defining approximation (1) are usualy unknown characteristics of the population X . Thus they should be estimated in practice. In [4], for the estimation of the parameter σ 2 , convenient plug-in rule was proposed, where strata distribution functions were replaced by their corresponding empirical versions. However, it is not convenient for the estimation of α and κ. Another way is to replace the population parameters by their jackknife estimators, see [2]. But it is well known that in the case of sample median (or other empirical quantiles) jackknife estimators often fail.
Here we consider the finite population bootstrap of [3]. Let η = η(X ) be any characteristic of the population X . For 1 k h write N k = m k n k + l k , where 0 l k < n k . Given the sample X k drawn from the stratum X k construct an empirical stratum X * k by combining m k copies of X k with a simple random sample without replacement Y k = {Y k,1 , . . . , Y k,l k } of size l k from X k . Then X * = X * 1 ∪ · · · ∪ X * h is an empirical (bootstrap) population, and the bootstrap estimator of η is then defined aŝ Thus we have the bootstrap estimatorsα,κ andσ 2 of α, κ and σ 2 . However, it is difficult to obtain their explicit expresions. Therefore, here we apply Monte Carlo (MC) approximations for the parameters we are interested in. In particular, let X * (1) , . . . , X * (B) be B empirical populations constructed independently as described above, i.e., we randomly and with replacement select B empirical populations from all possible h k=1 n k l k . Then MC approximation to (8) is Finally, replacing the true parameters α, κ and σ 2 in (1) by their estimatesα,κ andσ 2 we obtain the empirical approximationG(·) to F med (·).
A population for two examples below consists of Lithuanian service enterprises with economic activity classified as 'combined facilities support activities'. For our purposes we take three completely sampled strata of sizes N 1 = 25, N 2 = 7 and N 3 = 13, and for our simulations we choose sample sizes n 1 = 10, n 2 = 3 and n 3 = 5. Tables 1 and 2 present simulation results for the populations X (1) = X 3 , where elements of X (1) and X (2) are measurements of income and number of persons employed respectively. We use the first-quarter data of year 2011. Table 1 shows that G(·) significantly improves Φ(·). However, it is not the case for its empirical versionG(·), since for a large part of the samples this approximation to F med (·) is less accurate than Φ(·). Table 2 shows that G(·) evidently outperforms Φ(·).
We note that the proposed approximations may be very efficient in real surveys where we need to measure the accuracy of median in small domains of a population (for some collections of strata) and where populations are highly skewed. Our formulas with minor modifications are applicable for any quantile.