Monotone Least Squares and Isotonic Quantiles

We consider bivariate observations $(X_1,Y_1),\ldots,(X_n,Y_n)$ such that, conditional on the $X_i$, the $Y_i$ are independent random variables with distribution functions $F_{X_i}$, where $(F_x)_x$ is an unknown family of distribution functions. Under the sole assumption that $F_x$ is isotonic in $x$ with respect to stochastic order, one can estimate $(F_x)_x$ in two ways: (i) For any fixed $y$ one estimates the antitonic function $x \mapsto F_x(y)$ via nonparametric monotone least squares, replacing the responses $Y_i$ with the indicators $1_{[Y_i \le y]}$. (ii) For any fixed $\beta \in (0,1)$ one estimates the isotonic quantile function $x \mapsto F_x^{-1}(\beta)$ via a nonparametric version of regression quantiles. We show that these two approaches are closely related, with (i) being a bit more flexible than (ii). Then, under mild regularity conditions, we establish rates of convergence for the resulting estimators $\hat{F}_x(y)$ and $\hat{F}_x^{-1}(\beta)$, uniformly over $(x,y)$ and $(x,\beta)$ in certain rectangles.


Introduction
Suppose we observe n ≥ 1 pairs (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X n , Y n ) ∈ X × R with random or fixed covariate values X 1 , . . . , X n in a set X ⊂ R such that, conditional on (X i ) n i=1 , the response values Y 1 , . . . , Y n are independent with IP(Y i ≤ y) = F X i (y) for 1 ≤ i ≤ n and y ∈ R. Here (F x ) x∈X is an unknown family of distribution functions on R. Note that some values X i could be identical, so the corresponding random variables Y i have the same conditional distribution, given (X i ) n i=1 . To facilitate our arguments, we treat (X i ) n i=1 from now on as fixed. But our main results remain valid in random design settings with minor adjustments, as indicated later.
Our goal is to estimate the whole family (F x ) x∈X under the sole assumption that F x is isotonic (non-decreasing) in x with respect to stochastic order. This can be expressed in three equivalent ways: (SO.1) For arbitrary fixed y ∈ R, F x (y) is antitonic (non-increasing) in x ∈ X .
In what follows, we denote with Q x (β) any β-quantile of F x and assume that it is isotonic in x.
Such a constraint appears natural in several settings. For instance, an employee's income Y tends to increase with his or her age X. Other examples in which such a stochastic order is plausible are: The expenditures Y of a household for certain goods in relation to its monthly income X; the body height or weight Y of a child in relation to its age X. Stochastic ordering constraints also have applications in forecasting. For example, X 1 , . . . , X n and Y 1 , . . . , Y n could be the predicted and actual cumulative precipitation amounts on n different days, respectively, with the predictions being obtained from a numerical weather prediction model, see Henzi (2018).
With condition (SO.1) in mind, one could think about estimating the antitonic function x → F x (y) by means of monotone least squares regression, replacing the response values Y i with the indicator variables 1 [Y i ≤y] . Precisely, we would set F x (y) = h(x) with an antitonic function h : X → [0, 1] such that is minimal. The solution h is unique on {X 1 , . . . , X n }, and on X \ {X 1 , . . . , X n } one could extrapolate it in some reasonable way. In the special case of X being finite this approach has been proposed and analyzed by El Barmi and Mukerjee (2005).
Conditions  suggest to imitate the regression quantiles of Koenker and Bassett (1978). That means, we estimate the conditional β-quantiles Q x (β) by Q x (β) = h(x) with an isotonic function h : X → R minimizing the empirical risk where ρ β (z) := (β − 1 [z<0] )z. This estimator has been considered, for instance, by Poiraud-Casanova and Thomas-Agnan (2000) who showed that it coincides with an estimator of Casady and Cryer (1976) which is given by a certain minimax formula involving sample β-quantiles. The characterization of isotonic estimators in terms of minimax formulae has also been derived by Robertson and Wright (1980) in a rather general framework including arbitrary partial orders on X and general loss functions The goals of the present paper are to clarify the connection between these two estimation paradigms and to provide new consistency results in a suitable asymptotic framework.
In Section 2, we give a detailed description of the estimator ( F x ) x∈X based on monotone least squares and estimators ( Q x ) x∈X based on monotone regression quantiles. Then we show that the estimators Q x are consistent with and encompass quantiles of the estimators F x , but the latter estimators allow for smoother estimated quantile curves.
In Section 3, we analyze the estimators in a suitable asymptotic framework with a triangular scheme of observations and X being a real interval. It turns out that under certain regularity conditions on the design points and the true distribution functions F x , one can prove rates of convergence for quantities such as with intervals I ⊂ X , J ⊂ R and B ⊂ (0, 1).
Proofs and technical details are deferred to Section 4.

Estimation of the conditional distributions
Let x 1 < · · · < x m be the different elements of {X 1 , X 2 , . . . , X n }, i.e. m ≤ n, and for 1 ≤ j ≤ m, set w j := #{i : and the unconstrained maximum likelihood estimator of F x j (y) is given by

Estimation of F x via monotone least squares
The estimator F x j (y) in (1) by itself is rather poor, unless the corresponding subsample size w j is rather large. But in connection with our stochastic order constraint it becomes a useful tool. Note first that, for any function h : and the stochastic order assumption implies that the vector F (y) = (F x j (y)) m j=1 belongs to the cone Hence one can estimate F (y) by the unique least squares estimator The computation of F (y) is easily accomplished via the pool-adjacent-violators algorithm, see Robertson et al. (1988). Note also that it suffices to compute F (y) for at most n − 1 different values of y. Precisely, if y 1 < y 2 < · · · < y are the elements of {Y 1 , Y 2 , . . . , Y n }, then F (y) = 0 for y < y 1 , F (y) = 1 for y ≥ y , and F (y) = F (y k ) for 1 ≤ k < and y ∈ [y k , y k+1 ).
It is well-known that F (y) may also be represented by the following minimax and maximin formulae: For 1 ≤ j ≤ m, Finally, we extrapolate F (y) to an antitonic function x → F x (y) on X . We set F x (y) := F x 1 (y) for x ≤ x 1 and F x (y) := F xm (y) for x ≥ x m . For x j−1 ≤ x ≤ x j , 1 < j ≤ m, one could define F x (y) by linear interpolation, but other antitonic interpolations are possible without affecting our asymptotic results. More precisely, no specific interpolation is specified in our proofs.

Plug-in estimation of Q x
Once we have estimated (F x ) x∈X by ( F x ) x∈X as in Section 2.1, we can easily determine corresponding quantile functions. For any fixed β ∈ (0, 1) and x j , 1 ≤ j ≤ m, we could determine the minimal and maximal β-quantiles, Both vectors ( F −1 x j (β)) m j=1 and ( F −1 x j (β +)) m j=1 are isotonic, and any choice of an isotonic function X is a plausible estimator of a β-quantile curve.

Estimation of Q x via monotone regression quantiles
Similarly as in Section 2.1, we focus on the vector Q we propose to estimate Q(β) by some vector in the set Note that the function T β (·) is convex but not strictly convex on R m . Hence it need not have a unique minimizer. The next result provides more precise information in terms of the minimal and maximal sample β-quantiles F −1 r:s (β) := min y ∈ R : F r:s (y) ≥ β , F −1 r:s (β +) := inf y ∈ R : F r:s (y) > β . Any vector q ∈ Q(β) satisfies L ≤ q ≤ U componentwise.
Remark 2.2. At first glance, one could suspect that any isotonic vector q ∈ R m ↑ satisfying L ≤ q ≤ U minimizes T β . But this conjecture is wrong. As a counterexample, consider the case of n = 2 observations with X 1 < X 2 but Y 1 > Y 2 . Here m = 2, and F 1:1 (y) = 1 [y≥Y 1 ] , F 2:2 (y) = 1 [y≥Y 2 ] and with equality if, and only if, q 1 = q 2 .

Connection between the two estimation paradigms
Restricting the plug-in quantile estimators of Section 2.2 to the set of observed X-values leads to the set This set is closely related to the set Q(β): Lemma 2.3. The vectors L and U in Lemma 2.1 are given by Example 2.4. The simple example in Remark 2.2 shows that Q(β) = Q plug−in (β) in general. Let us illustrate this point with a more interesting numerical example. Figure 1 shows a simulated sample of size n = 100. In addition, it shows the minimal and maximal median curves , as well as a piecewise linear median curve x → Q x (0.5) minimizing q (x) 2 dx among all isotonic functions q : Figure 1: n = 100 data pairs, together with the true medians F −1 x (0.5) (green, dotted) and the estimated medians is a natural candidate and smoother in x than F −1 x (0.5) or F −1 x (0.5 +), the corresponding values of T 0,5 (·) are (rounded to three digits) The true median of F x , which is unique for each x, is depicted as well.

Asymptotic considerations
We provide some asymptotic properties of the estimators just introduced in case of a real interval X and a triangular scheme of observations: For sample size n ≥ 2, consider observations (X n1 , Y n1 ), . . . , (X nn , Y nn ) with fixed values X n1 , . . . , X nn ∈ X and independent random variables Y n1 , . . . , Y nn such that for 1 ≤ i ≤ n and y ∈ R. The resulting estimators of F x (y) and Q x (β) are denoted by F nx (y) and Q nx (β), respectively. In what follows, we assume that the distribution functions F x are Hölder-continuous in x, at least on some subinterval of X . Moreover, we assume that the design points are 'asymptotically dense' within this interval. In what follows we write ρ n := log n n , and λ(·) stands for Lebesgue measure. Moreover, the absolute frequency of the design points X ni is denoted by w n (·), that means, (A.1) For intervals I ⊂ X and an arbitrary interval J ⊂ R, there exist constants α ∈ (0, 1] and C 1 > 0 such that (A.2) There exist constants C 2 , C 3 > 0 and an integer n o such that for n ≥ n o and arbitrary intervals I n ⊂ I, Under these two assumptions, the estimator F nx satisfies a certain consistency property.
Remark 3.2 (Fixed and random design points). Assumption (A.2) is satisfied if, for instance, On the other hand, suppose that X n1 , X n2 , . . . , X nn are independent random variables with density g on X such that inf x∈I g(x) > 0 on I. Then standard results from empirical processes on the real line imply that for any choice of α ∈ (0, 1], 0 < C 2 < inf x∈I g(x) and C 3 > 0, with asymptotic probability one as n → ∞. Hence the conclusions from this section are also true in such a random design setting.
Concerning estimated quantiles, we combine Assumptions (A.1-2) with a growth condition on the conditional distribution functions F x : for arbitrary x ∈ I and y 1 , y 2 ∈ R such that y 1 < y 2 and F x (y 1 ), F x (y 2 −) ∈ (β 1 , β 2 ).

Proofs and technical details 4.1 Monotone regression
The main result of this section is Theorem 4.2. It is formulated in a general framework and is therefore of independent interest, as it provides a characterization of the solutions to monotone regression problems. An application of this Theorem to monotone regression quantiles yields Lemma 2.1.
For 1 ≤ j ≤ m, let R j : R → R be convex and coercive functions, that is |R j (z)| → ∞ as |z| → ∞. Denote by the set of monotone regression solutions. These solutions have already been characterized in a nonexplicit fashion by Dümbgen and Kovac (2009).
Theorem 4.1 (Dümbgen and Kovac). The following three properties about an element q ∈ R m ↑ are equivalent: Here R j (q j +) is the right-sided derivative of R j at q j , whereas R j (q j −) is its left-sided one.
We propose now to characterize Q more explicitly using minimax and maximin representations.
For all 1 ≤ r ≤ s ≤ m, one can easily verify that the notation [L r:s , U r:s ] := arg min q∈R s j=r R j (q) is well justified, as the right-hand side is indeed a closed and compact interval of R. Any vector q ∈ Q satisfies L ≤ q ≤ U componentwise.
At present, Mühlemann et al. (2019) are extending this result to more general functions R j and arbitrary partial orders on {1, 2, . . . , m}, using a different approach.
The proof of Theorem 4.2 is built on several intermediate results. Since L a = L b , we get L a = max r≤a min s≥b L r:s = L r:s for some 1 ≤ r ≤ a ≤ b ≤ s ≤ m.
The next Lemma and its Corollary provide an even more precise conclusion. Here and in what follows, we set L 0 := −∞ and L m+1 := +∞.
Similar conclusions hold for U .
In other words, if L a−1 < L a = L b < L b+1 , then L a is the smallest minimizer of b j=a R j (q), overall all q ∈ R. To prove Lemma 4.4, we need the following result: Proposition 4.6. Let G 1 , G 2 be convex and coercive functions and let Proof of Proposition 4.6. Without loss of generality let L 1 ≤ L 2 . First, suppose ab absurdo that L < L 1 . Since L 1 and L 2 are the respective smallest minimizer of G 1 and G 2 , we have G 1 (L+), G 2 (L+) < 0. Hence (G 1 + G 2 ) (L+) < 0 which contradicts the fact that L minimizes G 1 + G 2 . Next, since L is the smallest minimizer of G 1 + G 2 , there is no element q < L with (G 1 + G 2 ) (q+) ≥ 0. So L 2 < L cannot happen because G 1 (L 2 +), G 2 (L 2 +) ≥ 0 yield (G 1 + G 2 ) (L 2 +) ≥ 0.
Proof of Lemma 4.4. We prove the result for 1 ≤ a ≤ b ≤ m such that L a−1 < L a = L b . The proof for when L a = L b < L b+1 and those for U are similar.
First, for a = 1, we have Let a > 1 and suppose ab absurdo that L a:b < L a . Since L a = L b , Lemma 4.3 provides the existence of 1 ≤r ≤ a and b ≤s ≤ m such that L a = Lr :s . Hence L a:b < Lr :s . Next, notice that we must haver < a, otherwiser = a implies Let us now define G 1 := a−1 j=r R j and G 2 := b j=a R j . Since Lr :a−1 < Lr :b and because Lr :a−1 and Lr :b are the smallest minimizers of G 1 and G 1 + G 2 , respectively, we necessarily have that the smallest minimizer L a:b of G 2 satisfies Lr :b ≤ L a:b . If this was not true, Lr :b would be strictly greater than the maximum of Lr :a−1 and L a:b , which would contradict Proposition 4.6.
Combining all the inequalities leads to the following contradiction Proof of Theorem 4.2. Let a, b ∈ Q and λ ∈ (0, 1). Define q := (1 − λ)a + λb. Then so q ∈ Q and therefore Q is convex.
Next, notice that Therefore L ∈ R m ↑ . To show that L minimizes m j=1 R j (q j ) over all q ∈ R m ↑ , let 1 ≤ a ≤ b ≤ m be such that L a−1 < L a = L b . From Lemma 4.4, we have that L a ≤ L a:b . Because L a:b is a minimizer of b j=a R j (q) and by convexity of the latter expression, we get We then prove similarly that b j=a R j (L b +) ≥ 0 when L a = L b < L b+1 . Thus, Property 2 of Theorem 4.1 is fulfilled, which shows that L ∈ Q.
Finally, let q ∈ Q and suppose ab absurdo that q k < L k for some 1 ≤ k ≤ m. Let 1 ≤ a ≤ k be such that L a−1 < L a = L k and k ≤ b ≤ m be such that q k = q b < q b+1 . Then we have that q j ≤ q k < L k = L j , for all a ≤ j ≤ k, and q j = q k < L k ≤ L j , for all k ≤ j ≤ b. Therefore we get q j < L j , for all a ≤ j ≤ b.
This last property and convexity of the R j 's imply that R j (q j +) ≤ R j (L j −) for all a ≤ j ≤ b. However, this inequality cannot be strict for any would violate Property 3 of Theorem 4.1 and contradict the fact that q ∈ Q. Hence, we have that R j (q j +) = R j (L j −), which means that R j is linear on (q j , L j ) = ∅, for all a ≤ j ≤ b.
From Corollary 4.5, we have that L k = L a:p 1 . Since L a:p 1 is the smallest minimizer of g and because g is linear on (q k , L a:p 1 ), the function g is strictly decreasing on (q k , L a:p 1 ), has a kink at L a:p 1 and is increasing on (L a:p 1 , ∞). Therefore we have For all c = 1, 2, . . . , t − 1, Property 2 of Theorem 4.1 implies that

Combining the strict inequality in (3) and the inequalities in (4) yields the strict inequality
but this contradicts Property 3 of Theorem 4.1 and shows that q cannot belong to Q.

Proof of Lemma 2.1 and 2.3
Proof of Lemma 2.1. For 1 ≤ j ≤ m, set so that the minimal and maximal sample β-quantiles In order to prove Lemma 2.3, two useful Corollaries of Theorem 4.1 are stated.
Proof of Corollary 4.7. The left and right-sided derivatives of R j are both equal to R j (f ) = 2w j (f − F x j (y)). It then remains to apply the proper inequality from the antitonic version of Theorem 4.1 to obtain the desired result.
Proof of Corollary 4.8. For 1 ≤ j ≤ m and q ∈ R, we have . To conclude we apply the appropriate inequality of Theorem 4.1.
Proof of Lemma 2.3. We first show that recursively on k.
Let k = 1 and let 1 ≤ s ≤ m be such that L 1 = · · · = L s < L s+1 . Corollary 4.8 ensures that F j:s (L 1 ) ≥ β for any 1 ≤ j ≤ s. Furthermore, we have that so Corollary 4.7 can be applied. Let 1 ≤ j ≤ s be such that f j−1 > f j = · · · = f s , with f 0 := +∞ if needed. Then Let now 2 ≤ k ≤ m be such that F x j (L j ) ≥ β holds for all 1 ≤ j ≤ k − 1. Let 1 ≤ r ≤ k and k ≤ s ≤ m with L r−1 < L r = L k = L s < L s+1 . Corollary 4.8 yields F x j :xs (q k ) ≥ β for any r ≤ j ≤ s. Next, we have that so Corollary 4.7 can be applied again. Then either one of the two cases occurs: • There exists r ≤ j ≤ s such that f j−1 > f j = · · · = f s . Then we have that In both cases, F x k (L k ) ≥ β. Now, Lemma 2.1 will yield if we can show that q := F −1 x j (β) m j=1 ∈ Q(β). Let 1 ≤ r ≤ s ≤ m be arbitrary indices such that q r = q s < q s+1 and define f : The goal is to verify that q fulfils the properties of Corollary 4.8.
First, by definition of q, we have The last inequality holds, otherwise F x s+1 (q s ) ≥ β implies q s+1 = inf y∈R { F x s+1 (y) ≥ β} ≤ q s and contradicts the hypothesis on s.
Since the vector f is the unique minimizer of m j=1 w j ( F j (q s )−f j ) 2 over R m ↓ , Corollary 4.7 can be applied. Exactly one of the following cases occurs: • If f r > f s , let p ≥ 1 and r ≤ k 1 < · · · < k p < s be such that Then, the first inequality yields f s < f k 1 ≤ F r:k 1 (q s ); the second one yields f s < f k 2 ≤ F k 1 +1:k 2 (q s ); and so on, up to the last one which gives f s ≤ F kp+1:s (q s ).
Combining each inequality then yields β ≤ f s < F r:s (q s ).
In both cases, the property of Corollary 4.8 in case of q r = q s < q s+1 is fulfilled. The other property regarding q r−1 < q r = q s is obtained with similar arguments. Therefore, q ∈ Q(β).
This shows that F −1 x j (β) = L j for 1 ≤ j ≤ m and one proves similarly that F −1 x j (β+) = U j for 1 ≤ j ≤ m.

Asymptotics
In what follows, the dependence on n does not always need to be stated. Therefore, we drop the subscript n of various elements to lighten the notation. For instance, we write w(B) instead of w n (B) for B ⊂ X , where w n (B) = #{i ≤ n : X i ∈ B}. Furthermore, we define for B ⊂ X The proofs make use of the following inequality, due to Bretagnolle (1980). Theorem 4.9 (Bretagnolle). There exist universal constants C 4 , C 5 > 0 such that for any interval I o ⊂ X , η > 0 and n ∈ N Corollary 4.10. With the same constants as in Theorem 4.9, we have, for all η > 0 and n ∈ N, that Proof of Corollary 4.10. Notice that, for all η > 0 and n ∈ N, we have Recall that ρ n = log(n)/n, Lemma 4.11. Suppose that Assumption (A.2) is satisfied and let N ≥ n o be sufficiently large so that I N = ∅. For all n ≥ N and x k ∈ I n , define Then, min and max Proof of Lemma 4.11. Fix n ≥ N and x k ∈ I n . The two intervals are subintervals of I of Lebesgue measure δ n and are both at a distance δ n from x k . Assumption (A.2) ensures that x r(k) ∈ I r(k) and x s(k) ∈ I s(k) , which yields Lemma 4.12. Suppose that Assumption (A.1-2) are satisfied and let N ≥ n o be sufficiently large so that I N = ∅. Then, for all n ≥ N sup x k ∈In,y∈J Proof of Lemma 4.12. Let us fix n ≥ N , x k ∈ I n and y ∈ J. Then This property and Lemma 4.11 imply that Taking the supremum over all x k ∈ I n and y ∈ J yields the desired result. The second claim is proved similarly.
Proof of Theorem 3.1. Let N ≥ n o be sufficiently large so that I N = ∅. Let C 1 , C 2 , C 3 be as in Assumptions (A.1-2), C 4 , C 5 be as in Corollary 4.10, C 6 be as in Lemma 4.12 and C 7 be such that C 5 C 2 7 > 2. Finally, we set Let us fix n ≥ N , x k ∈ I n and y ∈ J. Since λ([x r(k) , x k ]) ≥ δ n from Lemma 4.11, Assumption From Lemma 4.12, we also get Therefore we get Notice that the very last set depends neither on x k nor on y. Hence, defining I X n := I n ∩ {x 1 , . . . , x m }, we obtain IP sup x∈I X n ,y∈J for some ε > 0.