I MPROVED C ONFORMALIZED Q UANTILE R EGRESSION ∗

Conformalized quantile regression is a procedure that inherits the advantages of conformal prediction and quantile regression. That is, we use quantile regression to estimate the true conditional quantile and then apply a conformal step on a calibration set to ensure marginal coverage. In this way, we get adaptive prediction intervals that account for heteroscedasticity. However, the aforementioned conformal step lacks adaptiveness as described in (Romano et al., 2019). To overcome this limitation, instead of applying a single conformal step after estimating conditional quantiles with quantile regression, we propose to cluster the explanatory variables weighted by their permutation importance with an optimized K-means and apply k conformal steps. To show that this improved version outperforms the classic version of conformalized quantile regression and is more adaptive to heteroscedasticity, we extensively compare the prediction intervals of both in open datasets.


Introduction
Conformal prediction (CP) is a set of distribution-free and model agnostic algorithms devised to predict with a userdefined confidence with coverage guarantee, see Eq. (1). CP is key in high risk decision-making settings where the true output must be in the forecast interval with high probability. Unlike bayesian methods [1,2,3] that might only ensure asymptotic coverage guarantees and bootstrap uncertainty estimations methods [4,5], CP provides valid coverage in finite samples that work for any model and any distribution. However, prediction intervals (PIs) must be as short as possible and adaptive to be informative. PIs are said to be adaptive when they take in consideration the uncertainty of the conditional distribution Y |X. In plain words, this means that PIs must be wide whenever the model is highly uncertain or the input is hard to predict and narrow if the model has minimal uncertainty on the input. Unfortunately, conditional coverage, see Eq. (5), is a stronger property that CP does not assure; however, there are heuristic ways to approximate it. CP has two main strategies: (i) inductive conformal prediction (ICP) and (ii) full conformal prediction [6]. In a nutshell, the former requires data splitting and therefore is more scalable, whereas the latter does not require data splitting at the cost of refitting the model infinitely many times, thus being computationally onerous. Henceforth, in the interest of scalability we will solely dedicate to ICP, also referred to as split conformal prediction.
1. Split D in three mutually exclusive datasets such that D = D train ∪D cal ∪D val and N = n train +n cal +n val ; 2. Train a Machine Learning (ML) model on the training set D train = {(x i , y i )} ntrain i=1 ; 3. Define a heuristic notion of uncertainty given by s(x, y), often referred to as the non-conformity score function;

For each element
apply the function s to get n cal non-conformity scores {(s i )} n cal i=1 ; 5. Select a user-defined miscoverage error rate α and computeq as the (n cal +1)(1−α) n cal quantile of the nonconformity scores {(s i )} n cal i=1 ; 6. Given theq calculated on the previous step, get PIs in the out-of-sample phase given by C( samples, if we construct C(x val ) as indicated above, the following inequality holds for any non-conformity score function s and any α ∈ (0, 1) Proof. Let f s (s) and F s (s) denote the probability density function and cumulative density distribution of the nonconformity scores, respectively. Without loss of generality and for the sake of simplicity we assume that the nonconformity scores {(s i )} n cal i=1 are in sorted order such that s 1 < s 2 < ... < s n cal and soq = s (n cal +1)(1−α) . By the i.i.d. assumption Z = (X, Y ) ∼ Q and therefore s 1 = s(z 1 ), s 2 = s(z 2 ), ..., s n cal = s(z n cal ) are exchangeable so From Eq.(2) follows that Definition 1 (Conditional coverage). An ICP procedure guarantees conditional coverage if

Naive method
The most basic and widely used non-conformity score function for regression problems is the absolute error given by where f θ (x) is the model forecast with respect to input x. After calculatingq on the calibration set, PIs come as As proven above, this naive method guarantees marginal coverage. However, PIs length are always equal to 2q regardless of the input, hence not adaptive. In fact, this naive method is known to overcover easy inputs and to undercover hard inputs. As an example, suppose a regression problem where we have 4 groups: A, B, C, and D in the following proportions: 0.5, 0.4, 0.05, and 0.05 and that groups A and B are prone to have lower regression output than groups C and D and also less variation. Hence, for α = 0.1 this method might yield 0.9 coverage at the cost of overcovering groups A and B (nearly 100%) and undercovering groups C and D (nearly 0%). In turn, evaluating PIs by simply looking to their mean amplitude or marginal coverage is not enough nor adequate. A finer analysis must look to other statistics as standard deviation and quantiles. PIs with high standard deviation are not necessarily a bad sign, it is a mere result of being adaptive to heteroscedasticity. In plain words, heteroscedasticity intuitively means that the variance among groups is non constant.

Quantile regression
Usually, in a regression problem, we attempt to find the parameters θ of a model f θ : R D → R via the minimization of the sum of squared residuals on the training set D train as where Ω(θ) is a potential regularizer.
QR [8,9], however, attempts to approximate the true α ∈ (0, 1) conditional quantile Q α (x) given by of the true conditional distribution Y |X by minimizing the following objective where ρ α is the quantile loss, also known as pinball loss due to its resemblance to a pinball ball movement, see Fig.(1). This loss function can be mathematically expressed as Consequently, it is suggested to train a model f θ (x) for α/2 and 1 − α/2 on D train to obtain 1 − α conditional coverage. Therefore, PIs come as Note that, unlike naive PIs, QR PIs are adaptive. Furthermore, depending on the application, over-forecasts and under-forecasts might have different costs and therefore, in such cases, a different combination of α conditional quantiles should be used to get more conservative lower or upper bounds with 1 − α coverage. Another great virtue of QR is that it can be applied on top of any model by just changing the loss function to a pinball loss. Although the estimationQ α (x) yielded by QR of the true conditional quantile Q α (x) is known to be asymptotically consistent under certain conditions [7,10,11], it rarely provides 1 − α coverage in finite samples. To overcome this limitation, (Romano et al., 2019) [7] drawn several ideas from CP and devised the so-called CQR that we introduce in the next section.

Conformalized quantile regression
CQR grounds on correcting QR intervals with ICP techniques on a calibration set D cal to ensure marginal coverage, hence inheriting the advantages of both, i.e., adaptive intervals with marginal coverage guarantee. Specifically, if QR bounds are constantly undercovering, then PIs must get wider. On the contrary, in case of QR PIs cover in a ratio superior to 1 − α, they must be shortened. For this purpose, (Romano et al., 2019) [7] proposed the following non-conformity score function Subsequently, after calculatingq = Quantile s 1 , s 2 , ..., s n cal ; (n cal +1)(1−α) n cal , PIs with 1 − α marginal coverage guarantee are given as Built on the same idea, a different non-conformity score function was proposed in [12]; however, it has not proved to outperform the non-conformity score function in Eq.(13) [13]. Note that, aq > 0 (most cases) is a result of a QR model that did not assure 1 − α coverage and therefore intervals must get wider, see Fig.(2) for such case. In the case ofq < 0, it signifies that QR bounds are overcovering and thus we can narrow the bands, i.e., the lower bound increasesq units and the upper bound decreases the same amount, while still ensuring 1 − α marginal coverage.
CQR seems appealing, and it is in fact, yet the calibration step lacks adaptiveness. It does not depend on the data in any form. It is a simple combination between the naive method and QR. Our contribution, presented in the next section, will focus on improving the calibration step to make it more adaptive.

Contribution
Our idea to improve CQR is to do k conformal steps instead, one per each group, as depicted in Fig.(3). This is based on the idea that . To attain such goal, we can compute the euclidean distance between observations and cluster them using k-means [14]; however, this is not a good heuristic to approximate the conditional distribution Y |X. We need two additional conditions: (1) features must be scaled, since euclidean distance is scale-dependent; and (2) each feature has a different predictive power upon the response variable y. To accommodate condition (2), whenever clustering the observations, we must weigh each feature by the respective feature importance. A simple way of calculating feature importance that work for any model is permutation importance [15]. Algorithms (1), (2), and (3) comprise every step to successfully perform k-means, permutation importance, and our improved CQR version, respectively. In a nutshell, these are the main steps of our improved version: (1) normalize the training data, and apply the same normalization object on the calibration and validation set; (2) train f θ : R D → R with pinball loss for α 2 and 1 − α 2 ; (3) get feature importance by means of permutation importance algorithm using the calibration set for evaluation; (4) create a copy of the training set, calibration set, and validation set weighted by feature importance, henceforth referred to as clustering training set, clustering calibration set, and clustering validation set, respectively; (5) select the best k of k-means on the clustering training set and store k centroids to represent each cluster; (6) assign each observation of the clustering calibration set to the nearest cluster/centroid; (7) knowing which elements belong to each of the k clusters given by the previous step, compute a differentq for each cluster using Eq.(13) as the non-conformity score function on the calibration set; (8) given a new observation x val , find the nearest centroid c i and use the respectivê q to produce PIs as K-means, as shown in Algorithm (1) is a clustering algorithm that attempts to minimize the following objective where k is the number of clusters, S i denotes the cluster set i, c i the centroid of cluster i, and ||.|| the euclidean norm.
Since k-means relies upon the hyperparameter k to define the number of clusters beforehand, we must select a criteria to find the potential best k. For this purpose, we suggest the fraction of variance explained, mathematically expressed as where n i is the number of examples in cluster i and µ denotes the feature mean.

Remarks
• Sometimes the number of observations assigned to a cluster i, represented as n i is very low and therefore the finite sample correction (ni+1)(1−α) ni > 1. A possible solution to overcome this hurdle is to use constrained k-means [17,18] and increase the minimum number of observations per cluster. A second solution is to do the followingq (i) ← Quantile S (i) ; min(1, (ni+1)(1−α) ni ) ; • The selection of k for k-means is not limited to our approach, different criteria regarding the selection of k for k-means are acceptable, e.g., silhouete score and elbow method [19,20,21]; • Different clustering algorithms beyond k-means might also be adequate as long as they take in consideration the essential point, which is cluster based on similarity between observations and feature importance.

Algorithm 1 K-means algorithm
Input: c k initial centroids efficiently randomly assigned as in [22]. Output: k clusters given by their centroids that minimize Eq. (15) 1: Converged ← False 2: t ← 1 3: while Converged is False do 4: Assign each observation to the cluster with the nearest centroid i given by 5: , a model f θ : R D → R, a performance measure M, and R repetitions. Output: D feature importance's I 1 , I 2 , ..., I D 1: Split D in two mutually exclusive sets, a training set D train = (X train , y train ) and a validation set D val = (X val , y val ) 2: Train f θ on D train , and compute the baseline error score E b on D val using the performance measure M 3: for j ← 1 to D do 4: for i ← 1 to R do 5: Permutate column j on the validation X set X val and compute the permutated error score E  Input: Three normalized sets: D train = (X train , y train ), D cal = (X cal , y cal ), and D val = (X val , y val ), miscoverage error rate α, desired explained variance σ, maximum number of clusters K, R repetitions, and a model f θ : R D → R Output: Adaptive intervals with 1 − α coverage 1: Part 1: Estimate conditional quantiles with quantile regression.

Experiments
In this section we apply Algorithm (3) and compare it against CQR, QR and Naive on the datasets shown in Table  (1). We use a FFNN (feedforward neural network) [23] with three output neurons to estimate Q α 2 (x), Q 0.5 (x), and Q 1− α 2 (x), respectively. This is easily achieved by having a pinball loss function with different values of α for each output neuron.Q α 2 (x),Q 1− α 2 (x), andQ 0.5 (x) represent the estimated lower bound, upper bound, and model forecast with respect to x, respectively. Due to the stochastic behaviour of FFNN, the model is trained T=100 times to reduce the associated variance. On top of these 100 trained models each of the aforementioned methods is applied. Each dataset in Table (1) is divided in three mutually exclusive datasets, a training set contained 50% of the data, a calibration set with 25%, and a evaluation set with 25%. Thereafter, each method is assessed on the evaluation set considering summary statistics of interval widths and coverages for α = 0.1 and σ k = 0.9. All the data and code, which is written in Python can be found here.        Table 7: (Bike sharing) coverage summary statistics.
In the two first datasets, QR is constantly undercovering; hence, QR bounds are in need of ICP to ensure 1 − α coverage. ICQR is clearly more adaptive to heteroscedasticity in comparison to CQR. ICQR has higher std and IQR, lower median, while still ensuring the same coverage, as seen in Table (2-5).
In the last dataset (bike sharing), we have the opposite case, QR is overcovering as seen in Table (7). Consequently, we can reduce the bounds up to 1 − α coverage with ICP. In line with the previous datasets, ICQR bounds increase slower than CQR, as shown in Fig.(10), hence being more adaptive.

Remarks
• Naive bounds are not dependent on the input in any form nor adaptive. The interval amplitude is always equal to 2q. The small deviation seen in the above tables is a result of training the model 100 times to reduce the variance associated with FFNN due to the random initialization process. • Evaluating PIs by just looking to their mean value is not adequate. Most times, naive method has the lowest mean value; however, it achieves 1 − α coverage not in a group-balanced way, overcovering easy inputs, and PIs have always the same amplitude. Therefore, analyzing amplitude quantiles gives a better perspective regarding conditional coverage. Adaptive methods generally have higher mean amplitude because certain inputs or groups have massive uncertainty. Nevertheless, the median amplitude value is usually lower in adaptive methods.

Conclusion
In this paper, we have proposed an improved version of CQR. Results demonstrate that our version is further adaptive to heteroscedasticity, hence ICQR is one step ahead towards conditional coverage. The major shortcoming of ICQR in comparison to CQR are the two additional steps to calculate permutation importance and perform k-means. Additionally, some clusters might have few observations and therefore the associatedq comes biased; however, this is not a significative hindrance as we can simply apply the constrained version of k-means instead and increase the minimum number of observations per cluster. Despite these minor disadvantages, ICQR offers eye-catching adaptive PIs in comparison to the classic CQR, which convey us to strongly endorse its use across any high-stakes regression problem.