Skip to main content
Log in

Quantile inference based on clustered data

  • Published:
Metrika Aims and scope Submit manuscript

Abstract

One-sample sign test is one of the common procedures to develop distribution-free inference for a quantile of a population. A basic requirement of this test is that the observations in a sample must be independent. This assumption is violated in certain settings, such as clustered data, grouped data and longitudinal studies. Failure to account for dependence structure leads to erroneous statistical inferences. In this study, we have developed statistical inference for a population quantile of order p in either balanced or unbalanced designs by incorporating dependence structure when the distribution of within-cluster observations is exchangeable. We provide a point estimate, develop a testing procedure and construct confidence intervals for a population quantile of order p. Simulation studies are performed to demonstrate that the confidence intervals achieve their nominal coverage probabilities. We finally apply the proposed procedure to Academic Performance Index data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Carlos ARD, Marcelo HT, Leite JG (2010) Bayesian analysis of a correlated binomial model. Braz J Probab Stat 24:68–77

    Article  MathSciNet  MATH  Google Scholar 

  • Datta S, Satten GA (2005) Rank-sum tests for clustered data. J Am Stat Assoc 100:908–915

    Article  MathSciNet  MATH  Google Scholar 

  • Datta S, Satten GA (2008) A signed-rank test for clustered data. Biometrics 64:501–507

    Article  MathSciNet  MATH  Google Scholar 

  • Datta S, Nevalainen J, Oja H (2012) A general class of signed-rank tests for clustered data when the cluster size is potentially informative. J Nonparametr Stat 24:797–808

    Article  MathSciNet  MATH  Google Scholar 

  • Diniz CAR, Tutia M, Leite JG (2010) Bayesian analysis of a correlated binomial model. Braz J Probab Stat 24(1):68–77

    Article  MathSciNet  MATH  Google Scholar 

  • Donner A, Birkett N, Buck C (1981) Randomisation by cluster: sample size requirements and analysis. Am J Epidemiol 114:906–914

    Google Scholar 

  • Ferguson TS (1967) Mathematical statistics: a decision-theoretic approach. Academic Press, New York

    MATH  Google Scholar 

  • Fox M, Rubin H (1964) Admissibility of quantile estimates of a single location parameter. Ann Math Stat 35(3):1019–1030

    Article  MathSciNet  MATH  Google Scholar 

  • Haataja R, Larocque D, Nevalainen J, Oja H (2009) A weighted multivariate signed-rank test for cluster-correlated data. J Multivar Anal 100:1107–1119

    Article  MathSciNet  MATH  Google Scholar 

  • Hoffman EB, Sen PK, Weinberg CR (2001) Within-cluster resampling. Biometrika 88:1121–1134

    Article  MathSciNet  MATH  Google Scholar 

  • Koenker R, Bassett G (1978) Regression quantiles. Econometrica 46(1):33–50

    Article  MathSciNet  MATH  Google Scholar 

  • Larocque D (2003) An affine-invariant multivariate sign test for cluster correlated data. Can J Stat 31:437–455

    Article  MathSciNet  MATH  Google Scholar 

  • Larocque D (2005) The Wilcoxon signed-rank test for cluster correlated data. In: Duchesne P, RÉMillard B (eds) Statistical modeling and analysis for complex data problems. Springer, New York, pp 309–323

  • Larocque D, Nevalainen J, Oja H (2007) A weighted multivariate sign test for cluster correlated data. Biometrika 94:267–283

    Article  MathSciNet  MATH  Google Scholar 

  • Luceno A (1995) A family of partially correlated Poisson models for overdispersion. Comput Stat Data Anal 20:511–520

    Article  MATH  Google Scholar 

  • Luceno A, Ceballos F (1995) Describing extra-binomial variation with partially correlated models. Commun Stat Theory Methods 24:1637–1653

    Article  MATH  Google Scholar 

  • Lumley T (2004) Analysis of complex survey samples. J Stat Softw 9(1):1–19

    Google Scholar 

  • Nevalainen J, Larocque D, Oja H, Prsti I (2010) Nonparametric analysis of clustered multivariate data. J Am Stat Assoc 105:864–871

    Article  MathSciNet  MATH  Google Scholar 

  • Nevalainen J, Datta S, Oja H (2014) Inference on the marginal distribution of clustered data with informative cluster size. Stat Pap 55:71–92

    Article  MathSciNet  MATH  Google Scholar 

  • Ozturk O (2013) Combining multi-ranker information in judgment post stratified and ranked set samples when sets are partially ordered. Can J Stat 41:304–324

    Article  MathSciNet  MATH  Google Scholar 

  • Ozturk O, MacEachern SN (2004) Control versus treatment comparison under order restricted randomization. Ann Inst Stat Math 56:701–720

    Article  MathSciNet  MATH  Google Scholar 

  • Rosner B, Grove D (1999) Use of the Mann–Whitney U-test for clustered data. Stat Med 18:1387–1400

    Article  Google Scholar 

  • Rosner B, Glynn RJ, Lee MLT (2003) Incorporation of clustering effects for the Wilcoxon rank sum test: a large-sample approach. Biometrics 59:1089–1098

    Article  MathSciNet  MATH  Google Scholar 

  • Rosner B, Glynn RJ, Lee MLT (2006a) The Wilcoxon signed rank test for paired comparisons of clustered data. Biometrics 62:185–192

    Article  MathSciNet  MATH  Google Scholar 

  • Rosner B, Glynn RJ, Lee MLT (2006b) Extension of the rank sum test for clustered data: two-group comparisons with group membership defined at the subunit level. Biometrics 62:1251–1259

    Article  MathSciNet  MATH  Google Scholar 

  • Tallis GM (1962) The use of a generalized multinomial distribution in the estimation of correlation in discrete data. J R Stat Soc Ser B 24:530–534

    MathSciNet  MATH  Google Scholar 

  • Williamson JM, Datta S, Satten GA (2003) Marginal analyses of clustered data when cluster size is informative. Biometric 59:36–42

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

Authors would like to thank two anonymous reviewers as well as the editor for their helpful comments on the earlier version of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Asuman Turkmen.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 32 KB)

Appendix

Appendix

Proof Theorem 1

We would like to minimize the variance of \(\bar{T}(\eta _{p,0})=T(\eta _{p,0})/N\) under the constraint that \(\frac{1}{N} \sum _{i=1}^M n_iw_i=1\). This minimization problem is equivalent to minimizing the following expression under the Lagrangian constraint

$$\begin{aligned} {\varLambda }(w_i,\lambda )=\frac{1}{N^2} w_i^2 n_i(1-p) p\left[ (1-\delta )+n_i\delta \right] + \lambda \left( \frac{1}{N}\sum _{i=1}^M n_i w_i-1\right) . \end{aligned}$$

The optimal weight, \(w_{i,o}\), must satisfy the equality \(\frac{\partial {\varLambda }(w_i,\lambda )}{\partial w_i}=0\) for \(i=1,\ldots ,M\). This leads to following M equalities

$$\begin{aligned} w_i=\frac{-\hat{\lambda }N}{2p(1-p) \left[ (1-\delta )+n_i\delta \right] }, \quad i=1,\ldots , M. \end{aligned}$$
(5)

Using the lagrangian constraint \(\frac{1}{N} \sum _{i=1}^M n_iw_i=1\), the estimate of \(\lambda \) is obtained as

$$\begin{aligned} \hat{\lambda }=\frac{-2}{\sum _{i=1}^M \frac{n_i}{p(1-p)[(1-\delta )+n_i\delta ]}}. \end{aligned}$$

Inserting the above expression in Eq. (5), the optimal weights, for \(i=1,\ldots ,M\), are obtained as

$$\begin{aligned} w_{i,o}=\frac{\frac{N}{[(1-\delta )+n_i\delta ]}}{\sum _{i=1}^M \frac{n_i}{[(1-\delta )+n_i\delta ]}}. \end{aligned}$$

The variance of \(\sqrt{N}\bar{T}(\eta _{p,0})\), with the optimal weights, simplifies to

$$\begin{aligned} Var \left( \sqrt{N}\bar{T} \left( \eta _{p,0}\right) \right) = \frac{p(1-p)}{\frac{1}{N}\sum _{i=1}^M \frac{n_i}{[(1-\delta )+n_i\delta ]} } \end{aligned}$$

which reduces to

$$\begin{aligned} Var \left( \sqrt{N}\bar{T}_p(\eta _p)\right) = p(1-p)[(1-\delta )+n\delta ] \end{aligned}$$

if \(n_i=n\) for \(i=1,\ldots ,M\).

Proof of Theorem 2

Let

$$\begin{aligned} U_p(a/\sqrt{N})=\frac{\bar{S}_p(\eta _p+a/\sqrt{N}) - \bar{S}_p(\eta _p)}{a/\sqrt{N}}. \end{aligned}$$

First, we consider the expected value of \(U_p(a/\sqrt{N})\)

$$\begin{aligned} E (U_p(a/\sqrt{N})=\sum _{i=1}^M \frac{w_in_i}{N} \frac{F(\eta _p)- F \left( \eta _p+a/\sqrt{N}\right) }{a\sqrt{N}}. \end{aligned}$$

It is clear that the first term (sum) is equal to 1 from the constraint of the weights. The second term has a limit at \(-f(\eta _p)\) as N goes to infinity. Hence, we have

$$\begin{aligned} \lim _{N\rightarrow \infty }U_p(a/\sqrt{N})= -f(\eta _p). \end{aligned}$$

We next show that the variance \(U_p(a/\sqrt{N})\) approaches to zero as N goes to infinity. Without loss of generality, assume that \(a>0\). It is easy to see that

$$\begin{aligned} U_p\left( a/\sqrt{N}\right)= & {} \frac{\sqrt{N}}{aN} \sum _{i=1}^M w_i \sum _{j=1}^{n_i} I \left( \eta _p \le X_{ij} \le \eta _p+a/\sqrt{N}\right) \\= & {} \frac{\sqrt{N}}{aN} \sum _{i=1}^M w_i Y^*_i \left( a/\sqrt{N}\right) , \end{aligned}$$

where

$$\begin{aligned} Y^*_i \left( a/\sqrt{N}\right) = \sum _{j=1}^{n_i} I \left( \eta _p \le X_{ij} \le \eta _p+a/\sqrt{N} \right) \end{aligned}$$

has a probability mass function \(b(y_i,n_i,1-p_{a/\sqrt{N}},\delta )\) in Eq. (2) with \(p_{a/\sqrt{N}} = F(\eta _p)-F(\eta _p+a/\sqrt{N})\). The variance of \(U_p(a/\sqrt{N})\) then becomes

$$\begin{aligned} var \left( U_p \left( a/\sqrt{N}\right) \right) = \frac{1}{a^2 N} \sum _{i=1}^M w_i^2 n_i \left( 1-\delta +n_i \delta \right) p_{a/\sqrt{N}} \left( 1-p_{a/\sqrt{N}}\right) . \end{aligned}$$

In the above equation, the expression

$$\begin{aligned} \frac{1}{a^2N} \sum _{i=1}^M w_i^2n_i[ 1-\delta +n_i\delta ] \end{aligned}$$

is finite for any N. Since F is a continuous function, \(p_{a/\sqrt{N}}\) converges to zero as N goes to infinity. Hence, we show that the variance \(U_p(a/\sqrt{N})\) converges the zero as N approaches to infinity. This completes the proof of point-wise convergence. Uniform convergence in a compact set follows from the fact that estimating equation is non-increasing in its argument.

EM-Algorithm The EM-algorithm is an iterative procedure involving two steps: Expectation (E-step) and maximization (M-steps). Let \(p^{(0)}\) and \(\delta ^{(0)}\) be the initial values of the parameters p and \(\delta \). We set \(p^{(0)}\) as the sample \(p{\hbox {th}}\) quantile of the data and \(\delta ^{(0)}\) as 0.5.

E-step In the kth iteration of the algorithm, expectation step computes the conditional expected values of the log likelihood functions \(l_1(\delta ,\mathbf z )\) and \(l_2(p,\mathbf n ,\mathbf y ,\mathbf z )\) for given values of \(\mathbf y \) and \((k-1)\)-st step estimate of p and \(\delta \)

$$\begin{aligned} Q_1 \left( \delta |\delta ^{(k-1)},p^{(k-1)},\mathbf n , \mathbf y \right)= & {} E \left( l_1 \left( \mathbf z |\delta ^{(k-1)},p^{(k-1)},\mathbf n ,\mathbf y \right) \right) \\= & {} \sum _{i=1}^M \tau \left( y_i,p^{(k-1)},\delta ^{(k-1)}\right) log(\delta ) \\&+\,\left( M-\sum _{i=1}^M\tau (y_i,p^{(k-1)},\delta ^{(k-1)})\right) log (1-\delta ), \\ Q_2 \left( p|\delta ^{(k-1)},p^{(k-1)},\mathbf n ,\mathbf y \right)= & {} E\left( l_2 \left( \mathbf z |\delta ^{k-1},p^{k-1},\mathbf n , \mathbf y \right) \right) \\= & {} B^* \left( p^{(k-1)},\delta ^{(k-1)}\right) log(p) \\&+\,C^*\left( p^{(k-1)}, \delta ^{(k-1)}\right) log(1-p), \end{aligned}$$

where

$$\begin{aligned} B^* \left( p^{(k-1)},\delta ^{(k-1)}\right)= & {} E(B(\mathbf z |p^{(k-1)},\delta ^{(k-1)},\mathbf y ,\mathbf n ) \\= & {} \sum _{i=1}^M \left( 1-\tau \left( y_i,p^{(k-1)},\delta ^{(k-1)} \right) \right) y_i I_{A_{1,i}} (y_i)\\&+\,\sum _{i=1}^M \tau \left( y_i,p^{(k-1)},\delta ^{(k-1)}\right) \frac{y_i}{n_i}I_{A_{2,i}}(y_i), \end{aligned}$$

and

$$\begin{aligned} C^* \left( p^{(k-1)},\delta ^{(k-1)}\right)= & {} E(C(\mathbf z |p^{(k-1)},\delta ^{(k-1)},\mathbf y ,\mathbf n )\\= & {} \sum _{i=1}^M \left( 1-\tau _i \left( y_i,p,\delta \right) \right) \left( n_i-y_i\right) I_{A_1}(y_i)\\&+\,\sum _{i=1}^M\tau _i \left( y_i,p,\delta \right) \frac{(n_i-y_i)}{n_i}I_{A_{2,i}}(y_i). \end{aligned}$$

M-step In the M-step, we update the kth iteration estimates by maximizing \(Q_1(\delta |\delta ^{(k-1)}, p^{(k-1)},\mathbf n ,\mathbf y )\) with respect to \(\delta \) and \(Q_2(p|\delta ^{(k-1)},p^{(k-1)}, \mathbf n ,\mathbf y )\) with respect to p. The updated kth iteration estimates are given by

$$\begin{aligned} \delta ^{(k)}= \frac{1}{M} \sum _{i=1}^M \tau \left( y_i,p^{(k-1)},\delta ^{(k-1)}\right) \end{aligned}$$

and

$$\begin{aligned} p^{(k)}= \frac{B^*\left( p^{(k-1)}, \delta ^{(k-1)}\right) }{B^*\left( p^{(k-1)}, \delta ^{(k-1)}\right) + C^*\left( p^{(k-1)}, \delta ^{(k-1)}\right) }. \end{aligned}$$

For the final estimates, the E- and M-steps are repeated several times based on a certain stopping rules. In this paper, the iteration is terminated if the difference of the successive estimates is less than \(10^{-6}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ozturk, O., Turkmen, A. Quantile inference based on clustered data. Metrika 79, 867–893 (2016). https://doi.org/10.1007/s00184-016-0581-0

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00184-016-0581-0

Keywords

Navigation