Bias-corrected methods for estimating the receiver operating characteristic surface of continuous diagnostic tests

Verification bias is a well-known problem that may occur in the evaluation of predictive ability of diagnostic tests. When a binary disease status is considered, various solutions can be found in the literature to correct inference based on usual measures of test accuracy, such as the receiver operating characteristic (ROC) curve or the area underneath. Evaluation of the predictive ability of continuous diagnostic tests in the presence of verification bias for a three-class disease status is here discussed. In particular, several verification bias-corrected estimators of the ROC surface and of the volume underneath are proposed. Consistency and asymptotic normality of the proposed estimators are established and their finite sample behavior is investigated by means of Monte Carlo simulation studies. Two illustrations are also given.


Introduction
Before applying a diagnostic test in clinical settings, a rigorous statistical assessment of its performance in discriminating the disease status from the nondisease status is necessary. For a continuous-scale test T , the diagnosis is dependent upon whether the test result is above or below a specified cut point c.
Assuming, without loss of generality, that higher test values indicate a higher likelihood of disease, a result is called positive if its value exceeds the cut point, and negative otherwise. A positive test indicates presence of the disease.
At a fixed cut point c, the accuracy of the test can be evaluated by its true positive rate (TPR) and its true negative rate (TNR), which are defined as the probabilities that the test correctly identifies the diseaded and non-diseaded subjects, respectively. The Receiver Operating Characteristic (ROC) curve is the plot of TPR versus 1-TNR by varying the cut point c. Usually, the ROC curve is monotone and lies in the upper triangle of the unit square, which consist of three vertices (0, 0), (0, 1) and (1,1). The shape of ROC curve allows to evaluate the ability of the test. For example, a ROC curve equal to a straight line joining points (0, 0) and (1, 1) represents a diagnostic tests which is the random guess. A commonly used summary measure that aggregates performance information across the range of possible cut points is the area under ROC curve (AUC). Reasonable values of AUC range from 0.5, suggesting that the test is no better than chance alone, to 1.0, which indicates a perfect test.
Clearly, the ROC curve and the AUC of a test under assessment are unknown and the statistical evaluation of the test requires suitable inferential procedures. See, for example, Zhou et al. [18] and Pepe [14] as general references. In principle, knowledge of the true disease status of the subjects under study is obtained by the most accurate available test, called gold standard (GS) test. In practice, there may be many drawbacks to implement the GS test, which can be too expensive, or too invasive, or both for regular use. Thus, often only a subset of patients undergoes disease verification and the decision to send a patient to verification typically depends on the test result and other patient characteristic. Statistical evaluations based only on data from subjects with verified disease status are typically biased. This bias is known as verification bias.
In the last fifteen years, various methods have been developed to deal with the verification bias problem, most of which assume that the true disease status, if missing, is missing at random (MAR, Little and Rubin [11]). Among the others, we cite the papers by [1], [3], [6], [7] and [15]. In particular, Alonzo and Pepe [3] proposed four types of partially parametric estimators of TPR and TNR, i.e., full imputation (FI), mean score imputation (MSI), inverse probability weighting (IPW) and semiparametric efficient (SPE) estimators.
In some medical studies, however, the disease status often involves more than two categories; for example, Alzheimer's dementia can be classified into three categories (see [4] for more details). In such situations, quantities used to evaluate the accuracy of a diagnostic test are the true class fractions (TCF's). These are well defined as a generalization of TPR and TNR. In a three-class diagnostic problem, given a pair of cut points (c 1 , c 2 ), with c 1 < c 2 , subjects are classified into class 1 if T < c 1 ; class 2 if c 1 ≤ T < c 2 ; and class 3 otherwise. The true class fractions of the test T at (c 1 , c 2 ) are defined as The plot of (TCF 1 , TCF 2 , TCF 3 ) by varying the pair (c 1 , c 2 ) produces the ROC surface of T in the unit cube. Scurfield [16] and Nakas and Yiannoutsos [12] mentioned that a ROC surface is well defined as a generalization of the ROC curve. Indeed, the projection of ROC surface to the plane defined by TCF 2 versus TCF 1 yields the ROC curve between classes 1 and 2. Similarly, on projecting the ROC surface to the plane defined by the axes TCF 2 and TCF 3 , the ROC curve between classes 2 and 3 is produced (see also [13]). The ROC surface will be the triangular plane with vertices (0, 0, 1), (0, 1, 0), and (1, 0, 0) if all of three TCF's are equal for every pair (c 1 , c 2 ). In this case, we say that the diagnostic test is the random guess, again. In practice, one can imagine that the graph of ROC surface lies in the unit cube and above the plane of the triangle with three vertices (0, 0, 1), (0, 1, 0), and (1, 0, 0). A summary of the overall diagnostic accuracy of the test under consideration is the volume under the ROC surface (VUS) which can be seen as a generalization of the AUC. Reasonable values of VUS vary from 1/6 to 1, ranging from bad to perfect diagnostic tests. Nakas and Yiannoutsos [12] and Nakas [13] gave some interesting results about ROC surface analysis in absence of verification bias. In particular, the authors formularized the ROC surface by a functional form and proposed a nonparametric approach for VUS estimation. Again without verification bias, parametric estimation of VUS is supplied in the work of Xiong et al. [17], where the assumption of normality distribution was used, whereas Li and Zhou [10] tackled the nonparametric and semi-parametric estimation of the ROC surface. Li et al. [9] proposed a regression approach to ROC surface, and in Kang and Tian [8] a kernel smoothing based approach for estimation of VUS is employed.
The issue of correcting for the verification bias in ROC surface analysis is very scarcely considered in the statistical literature. Until now, only Chi and Zhou [4] discussed about the issue. The authors proposed maximum likelihood estimates for ROC surface and VUS. However, these results only concern ordinal diagnostic tests. This motivated us to develop bias-corrected methods for continuous diagnostic tests with three-class disease status.
In this paper, we propose several verification bias-corrected estimators of TCF 1 , TCF 2 and TCF 3 for continuous diagnostic tests. The proposed estimators are the extension of FI, MSI, IPW and SPE estimators for the ROC curves in [3]. The new estimators allow to obtain bias-corrected ROC surfaces. Corresponding estimators of the VUS are also presented. Consistency and asymptotic normality of the proposed estimators are established under the MAR assumption.
The rest of paper is organized as follows. In Section 2, we review the estimators of ROC curves discussed in [3]. The proposed extension, giving biascorrected estimators of the ROC surface and of VUS, is presented in Section 3, along with the relevant asymptotic results. In Section 4, some simulation results are produced and two applications of the methods are contained in Section 5. Finally, conclusions are drawn in Section 6.

Background
In this section, we review the approaches presented in [3] for bias-corrected ROC analysis in two-class problems.

Notation and assumption
Let us consider a study with n subjects, for whom the result of a continuous test T is available. The patient's true condition (or disease status), D, is defined by a GS test. D is a binary variable, that is 0 if the subject is healthy and 1 in case of disease. Further, let V be a binary verification status of a patient, such that V = 1 if he/she is underwent the GS test, and V = 0 otherwise. In practice, some information, other than the test results, can be obtained for each patient. Let A be a covariate vector for a patient, that may be associated with both D and V . For the sake of reader's convenience, without loss of generality, in what follows we will consider A to be univariate. We assume that the verification status V and the response D are mutually independent given the test result T and covariate A, i.e., Pr(V |T, A) = Pr(V |D, T, A) or, equivalently, Pr(D|T, A) = Pr(D|V, T, A). This assumption corresponds to the MAR assumption.

Bias correction for ROC curve
Let FPR = 1-TNR. When all subjects are verified by GS, we have a full (or complete) data set. Then, one can employ the empirical estimatorsβ 0 ,β 1 andθ to obtain the nonparametric estimators of TPR and FPR TPR(c) =β , (2.2) where I(·) is the indicator function.
If not all patients have their disease status verified, the nonparametric estimators (2.2) can not be computed. If one computes the Naïve estimators, i.e., estimators (2.2) based only on verified subjects, typically gets estimates that are biased and inconsistent.
Alonzo and Pepe [3] proposed four partially parametric estimators to assess continuous diagnostic (or screening) tests under the MAR assumption. In particular, FI estimators of TPR(c) and FPR(c) are Here, the estimatesρ i of ρ i = Pr(D i = 1|T i , A i ) are obtained by using some suitable parametric model (e.g., logistic regression model) computed from verified subjects. MSI estimators only imputes the disease status for subjects who did not undergo the GS, resulting to be IPW method weights each verified subject by the inverse of the conditional verification probability π i = Pr(V i = 1|T i , A i ) (i.e. the probability that the subject is selected for verification). Therefore, the estimators are Again, the estimatesπ i need to be obtained by using parametric regression models such as logistic or probit models. Finally, SPE estimators are defined as: Alonzo and Pepe [3] showed that SPE estimators are doubly robust because they are consistent if either the π i 's or the ρ i 's are consistently estimated. However, it is worth noting that SPE estimates may not be range-respecting, i.e., they could fall outside the interval (0, 1). This happens because the quantities For each of the above methods, an estimated bias-corrected ROC curve can be obtained by plotting TPR(c) versus FPR(c) for all cut points c.

Proposal
Consider now a three-class problem. We model the disease status by a trinomial random vector D = (D 1 , D 2 , D 3 ), such that D k is a Bernoulli random variable having mean θ k = Pr(D k = 1), with θ 1 +θ 2 +θ 3 = 1. Let β jk = Pr(T ≥ c j , D k = 1) with j = 1, 2 and k = 1, 2, 3. In this notation, When all subjects are verified, the nonparametric estimators of TCF 1 , TCF 2 and TCF 3 are given by In the presence of verification bias, we propose four estimators for TCF 1 (c 1 ), TCF 2 (c 1 , c 2 ) and TCF 3 (c 2 ). The proposed estimators work under the MAR assumption and are based on the estimation of the quantities θ 1 , θ 2 , β 11 , β 12 , β 22 and β 23 . They can be seen as an extension of estimators reviewed in Subsection 2.2. In expressions (2.1) and (3.1), we note that parameters θ and θ k , so as β 1 and β jk , play, in essence, a similar role. Therefore, estimates of θ k and β jk can be obtained by mimicking what was done in the two-class problem.

Full imputation
For each j = 1, 2 and k = 1, 2, 3, the FI estimators of θ k and β jk are obtained asθ whereρ ki is an estimate of ρ ki = Pr(D ki = 1|T i , A i ) given by some suitable model, such as the multinomial logistic or probit regression model, applied to the verified sample units. Therefore, the FI estimator TCF 1,FI (c 1 ), TCF 2,FI (c 1 , c 2 ) and TCF 3,FI (c 2 ) are .
It is worth noting that estimatesθ k,FI andβ jk,FI in (3.2) and (3.3) are the solutions of the estimating equations
Again, we can obtainθ k andβ jk as solution of the estimating equations

Inverse probability weighted
From the IPW estimators of β 1 and θ in (2.4), we derive, by analogy, The estimatesπ i are obtained in the same way as in the two-class case. Then, the IPW estimates TCF 1,IPW (c 1 ), TCF 2,IPW (c 1 , c 2 ) and TCF 3,IPW (c 2 ) are , and the estimating equations corresponding toθ k,IPW andβ jk,IPW are Note that the IPW estimators only use verified subjects.

Semiparametric efficient
Similarly to three previous cases, the SPE estimators of β jk and θ k are derived in analogy toβ 1,SPE andθ SPE in (2.5), i.e, Therefore, we obtain , , .
The estimatesθ k,SPE andβ jk,SPE solve the estimating equations
We consider also the following standard regularity conditions.
The above theorem gives a general result for all estimates, i.e., FI, MSI, IPW and SPE. In Appendix, the explicit form of the asymptotic variance-covariance matrix is obtained. In practice, the variance-covariance matrix Σ is replaced by a consistent estimateΣ It is worth noting that SPE estimators of θ k and β jk in (3.9) and (3.10), will inherit the double robustness property ofθ SPE andβ 1,SPE in (2.5). That is, θ k,SPE andβ jk,SPE remain consistent if only one of the disease model P (D k = 1|T, A) or the verification model P (V = 1|T, A) is correctly specified in the estimation process; they are inconsistent if both models are misspecified. Clearly, this property holds also for the estimators TCF 1,SPE (c 1 ), TCF 2,SPE (c 1 , c 2 ) and TCF 3,SPE (c 2 ).

VUS estimation
Let µ be the volume under the ROC surface (VUS) of T . A straightforward calculation (Nakas and Yiannoutsos [12]) shows that µ = Pr (T i < T < T r |D 1i = 1, D 2 = 1, D 3r = 1) or, equivalently, where I i r = I(T i < T < T r ) + 1 2 I(T i < T = T r ) + 1 2 I(T i = T < T r ) + 1 6 I(T i = T = T r ). Then, in the absence of missing data, a natural nonparametric estimatorμ of µ is given bŷ . (3.14) When the disease status is missing for some of the subjects, verification biascorrected estimators of VUS, can be obtained by using suitable estimates of quantities D 1i , D 2i and D 3i in (3.14). More precisely, FI, MSI, IPW and SPE estimators of VUS take the form As for the estimators of the TCFs, under the MAR assumption and certain suitable regularity conditions, we can establish consistency and asymptotic normality of the above given bias-corrected VUS estimators (proof available from the authors). Moreover, the asymptotic variance ofμ * , i.e. the variance of √ n (μ * − µ 0 ) , can be consistently estimated by with G i r,FI (µ, τ ρ , τ π ) = ρ 1i (τ ρ )ρ 2 (τ ρ )ρ 3r (τ ρ ) (I i r − µ) , for k = 1, 2, 3. In (3.15), the functions g τρ i (·) and g τπ i (·) are the elements of the functions g τ i (·) in the estimating function G τ (·) for the parameters of the models adopted for the disease and the verification processes. See the Appendix A for their specification when the models chosen are, the multinomial logistic regression and the logistic (or probit) model, respectively.

Simulation studies
In this section, the ability of FI, MSI, IPW and SPE methods to estimate TCF 1 , TCF 2 and TCF 3 are evaluated by using Monte Carlo experiments. Also, the square root of the estimates of the variances are compared with Monte Carlo and bootstrap standard deviations. Some simulation results concerning the behaviour of the estimators for the VUS are given in Appendix C.
Note that, the bias-corrected estimators of TCF 1 , TCF 2 and TCF 3 require a parametric regression model to estimate ρ ki = Pr(D ki = 1|T i , A i ), or π i = Pr(V i = 1|T i , A i ), or both. A wrong specification of such models may affect the estimation. Therefore, in the simulation study we consider four scenarios: (i) the disease model and the verification model are both correctly specified; (ii) the verification model is misspecified; (iii) the disease model is misspecified; (iv) the disease model and the verification model are both misspecified.
All scenarios allow to evaluate the behavior of the proposed estimators in finite samples. In practice, we consider 5000 Monte Carlo replications, and three sample sizes, i.e., 250, 500 and 1000 in scenario (i) and a sample size equal to 1000 in scenarios (ii)-(iv). The choice of such sample size in scenarios (ii)-(iv) allows to dig up expected bad behaviors of the estimators under misspecification, when a great amount of information is available, i.e., in large samples.

Study 1
The true disease D is generated by a trinomial random vector (D 1 , D 2 , D 3 ), such that D k is a Bernoulli random variable with mean θ k , k = 1, 2, 3. We set θ 1 = 0.4, θ 2 = 0.35 and θ 3 = 0.25. The continuous test results T and A are generated from the following conditional models where µ k = (2k, k) . We consider three different values for Λ, specifically giving rise to a correlation between T and A equal to 0.36, 0.69 and 0.84, respectively.
In this scenario -and also in the next one-we consider six pairs for cut points (c 1 , c 2 ), i.e., (2,4), (2,5), (2,7), (4,5), (4,7) and (5,7). Since the conditional distribution of T given D k is the normal distribution, the true values of TCF's are obtained as where σ T |D denotes the entry in the 1-st row and 1-st column of Λ and φ(·) and Φ(·) are the density function and the cumulative distribution function of the standard normal random variable, respectively. Under our data-generating process, the true conditional disease model is a multinomial logistic model The verification status V is generated by the following model This choice corresponds to a verification rate of about 0.65. In this study, the FI, MSI, IPW and SPE estimators are computed under correct working models for both the disease and the verification status. Therefore, in particular, the conditional verification probabilities π i are estimated from a logistic model for V given T and A. Tables 1-3 and Tables 9-14 in Appendix B show Monte Carlo means, Monte Carlo standard deviations (MC.sd), the square roots of the variance estimated via asymptotic results (asy.sd) and bootstrap standard deviations (boot.sd) of TCF 1 , TCF 2 and TCF 3 . Here and in the following bootstrap estimates are obtained from 250 bootstrap replications. Overall, the estimators FI, MSI, IPW and SPE behave similarly in this scenario, with the IPW estimator showing a slightly bigger standard deviation. Simulation results, in this and in the following scenarios, also show that, excluding the SPE approach, bootstrap estimates of standard deviations are generally more accurate than estimates obtained via asymptotic theory.

Study 2
In this study, the true disease status D and the test results T and A are generated in the same way as in the first scenario. The true conditional verification process π, instead, is chosen to be the following function of T and A where t (0.8) and a (0.8) correspond to the 80-th percentile of distribution of T and A, respectively. In this case, the verification probabilities are 1 for subjects with T > t (0.8) and A > a (0.8) ; 0.7 for subjects with T ≤ t (0.8) and A > a (0.8) ; 0.65 for subjects with T > t (0.8) and A ≤ a (0.8) ; 0.35 otherwise. In our setting, the verification rate is approximately 0.48.
The aim in this scenario is to evaluate the behavior of the estimators, in particular that of IPW and SPE, under misspecification of the verification process. Therefore,π i is estimated from a logistic regression model with V as the Table 1. Simulation results from 5000 replications when both models for ρ k and π are correctly specified (Study 1) and the first value of Λ is considered. "True" denotes the true parameter value. Sample size = 250.  Table 2. Simulation results from 5000 replications when both models for ρ k and π are correctly specified (Study 1) and the second value of Λ is considered. "True" denotes the true parameter value. Sample size = 250.  Table 3. Simulation results from 5000 replications when both models for ρ k and π are correctly specified (Study 1) and the third value of Λ is considered. "True" denotes the true parameter value. Sample size = 250. response and T as predictor, whileρ ki is still obtained from the multinomial logistic model (similarly to the first scenario). Clearly, the model used for verification status is misspecified. Table 4 and Tables 15-16 in Appendix B, show Monte Carlo means and standard deviations for the estimators of the true class fractions TCF 1 , TCF 2 and TCF 3 . Moreover, estimated standard deviations (via asymptotic theory) and bootstrap standard deviations are also presented. The results clearly show the effect of misspecification on IPW estimates, despite the high sample size. In particular, in terms of bias, the IPW method performs almost alway poorly, with high distortion in some cases (values highlighted in bold). On the other hand, the SPE estimator behaves well, due to its doubly robustness property.

Study 3
Starting from two independent random variables Z 1 ∼ N (0, 0.5) and Z 2 ∼ N (0, 0.5), the true conditional disease D is generated by a trinomial random vector (D 1 , D 2 , D 3 ) such that Here, h 1 and h 2 are two thresholds. We choose h 1 and h 2 to make θ 1 = 0.4 and θ 3 = 0.25. The continuous test results T and the covariate A are generated to be related to D through Z 1 and Z 2 . More precisely, where ε 1 and ε 2 are two independent normal random variables with mean 0 and the common variance 0.25, independent also from Z 1 and Z 2 . The verification status V is simulated by the following logistic model   The aim in this scenario is to evaluate the behavior of the estimators, in particular that of FI, MSI and SPE, when the estimatorsρ ki are inconsistent, whereasπ i are consistent. Therefore,ρ ki are obtained from a multinomial logistic regression model with D = (D 1 , D 2 , D 3 ) as the response and T as predictor. As the correct process is a multinomial probit process, the chosen model is clearly misspecified. To estimate the conditional verification process π, we use a generalized linear model for V given T and A with logit link. Clearly, this model is correctly specified. Table 5 shows Monte Carlo means and standard deviations for the estimators of the true class fractions TCF 1 , TCF 2 and TCF 3 . Moreover, estimated standard deviations (via asymptotic theory) and bootstrap standard deviations are also presented. The results clearly show the effect of misspecification on FI and MSI estimates, despite the high sample size. In particular, in terms of bias, the two methods performs almost alway poorly, with high distortion in some cases (values highlighted in bold). Again, the SPE estimator behaves well due to its doubly robustness property.

Study 4
We generate data exactly as in Study 3. The aim in this scenario is to evaluate the behavior of FI, MSI, IPW and SPE estimators when the estimatesρ ki and π i are inconsistent. Therefore,ρ ki are obtained from a multinomial logistic regression model with D = (D 1 , D 2 , D 3 ) as the response and T as predictor. This model is misspecified. To estimate the conditional verification disease π, we use a generalized linear model for V given T and A 2/3 with logit link. Clearly, this model is misspecified. Table 6 shows Monte Carlo means and standard deviations for the estimators of the true class fractions TCF 1 , TCF 2 and TCF 3 . Moreover, estimated standard deviations (via asymptotic theory) and bootstrap standard deviations are also presented. The results clearly show that when both the disease and verification models are misspecified, all estimators may behave poorly, with high distortion in some cases (values highlighted in bold).

Two illustrations
To illustrate the application of the proposed methods, in this section we consider two quite distinct real data examples, both dealing with epithelial ovarian cancer (EOC). In the first illustration, we consider diagnosis of EOC in one of three classes i.e., benign disease, early stage and late stage cancer on the basis of a well known tumor marker, i.e., CA125. We make use of a publicly available dataset in which the disease status is known for all subjects. Then, we simulate a verification process and apply our estimators. This allows to compare results obtained in the complete data case with those obtained in the incomplete data case after correcting for verification bias. In the second illustration, we focus on prediction of patients' response to chemotherapy, classified as sensitive, partially Table 5. Simulation results from 5000 replications when only model for ρ k is misspecified (Study 3). "True" indicates the true parameter value.
Sample size = 1000.  Table 6. Simulation results from 5000 replications when both models for ρ k and π are misspecified (Study 4). "True" indicates the true parameter value. Sample size = 1000. sensitive and resistant. Data are available for late stage EOC patients. In this second example, the response is missing for about 25% of the subjects involved in the study.

Diagnosis of EOC
We use data from the Pre-PLCO Phase II Dataset from the SPORE/Early Detection Network/Prostate, Lung, Colon, and Ovarian Cancer Ovarian Validation Study. The study protocol and data are publicly available at the address 1 , along with descriptions of the study aims and analytic methods. In particular, we consider the following three classes of EOC, i.e., benign disease, early stage (I and II) and late stage (III and IV) cancer, and 2 of the 59 available biomarkers, i.e. CA125 and CA153, measured at Harvard laboratories. In detail, we use CA125 as the test T s and CA153 as a covariate. Reasons for using CA153 as a covariate come from the medical literature that suggests that the concomitant measurement of CA153 with CA125 could be advantageous in the pre-operative discrimination of benign and malignant ovarian tumors. In addition, age of patients is also considered. Here, we have 134 patients with benign disease, 67 early stage samples and 77 late stage samples.
To mimic verification bias, a subset of the complete dataset is constructed using the test T and the vector A = (A 1 , A 2 ) of the two covariates, namely the marker CA153 (A 1 ) and age (A 2 ). In this subset, T and A are known for all samples, but the true status (benign, early stage or late stage) is available only for some samples, that we select according to the following mechanism. We select all samples having a value for both T and A above their respective medians, i.e. To apply FI, MSI and SPE estimators, we employ a multinomial logistic model to estimate ρ ki = Pr(D ki = 1|T i , A 1i , A 2i ), where D k = 1, k = 1, 2, 3 refers to benign, early and late, respectively. On the other hand, SPE and IPW methods require estimates of π i = Pr(V i = 1|T i , A 1i , A 2i ). For estimating such quantities, we make use, firstly, of a correctly specified model, i.e., a linear threshold regression model and, then, of a misspecified model, i.e., a logistic model.
The estimated ROC surfaces for the test T (CA125) obtained by applying the proposed methods are shown in Figure 1. For the sake of comparison, we also produced the estimate of the ROC surface with full data (Full estimate), reported in Appendix D, Figure 3. In Appendix D, Figure 4 and Figure 5, we also give the projections of the bias-corrected estimated ROC surfaces to the planes defined by TCF 1 versus TCF 2 , TCF 1 versus TCF 3 and TCF 2 versus TCF 3 , i.e., the ROC curves between classes 1 and 2, classes 1 and 3, classes 2 and 3. Such plots are obtained by setting TCF 3 = 0, TCF 2 = 0 and TCF 1 = 0, respectively. For example, the estimated ROC curves between classes 1 and 2 are defined as the set of points ( TCF 1, * (c 1 ), TCF 2, * (c 1 , +∞)), c 1 ∈ R , that is, we ignore the cut point c 2 . This is a construction equivalent to the most popular representation of an estimated ROC curve, which usually depicts TCF 2 versus 1 − TCF 1 .
Compared with the Full estimate, all the bias-corrected methods discussed in the paper seem to behave well, yielding reasonable estimates of the ROC surface and the ROC curves. Moreover, Table 7 shows the VUS estimates obtained with the FI, MSI, IPW and SPE estimators (both for the correctly specified and misspecified model for the verification process), along with approximated 95% confidence intervals. Inspection of the table highlights that estimators with better performance are, overall, IPW and SPE. This might be an indication that the multinomial logistic model chosen for the disease process might not be fully adequate in this case. Table 7 Bias-corrected (and Full) estimated VUS for the marker CA125, assessing the classification into three classes of EOC: benign disease, early stage (I and II) and late stage (III and IV). For the SPE and IPW approaches, results for the correctly specified and misspecified model for the verification process are given.

Prediction of response to chemotherapy
A major challenge in advanced-stage EOC is prediction of response to platinum chemotherapy on the basis of markers measured at molecular level. Indeed, several genomic profiling studies have shown that gene expressions relate with different aspects of ovarian cancer (tumor subtype, stage, grade, prognosis, and therapy resistance), although the measured association is usually very low. Here, we consider a cohort of 99 snap-frozen tumor biopsies taken from a frozen tissue bank, located at the Department of Oncology, IRCCS-Mario Negri Institute, Milano, Italy. Biopsies were collected from late stage (III and IV) cancer patients who underwent surgery at the Obstetrics and Gynaecology Department, San Gerardo Hospital, Monza, Italy between September 1992 and March 2010. For 75 of the 99 subjects, the three-class response to platinum therapy is available, being 31 patients sensitive, 11 partially sensitive and 33 resistant. For all the subjects, we consider as test predictive of the response to therapy the marker (T ) resulting as a given linear combination of the logarithm of the expression levels of six genes, i.e., Entrez Gene ID: 23513, 7284, 128408, 56996, 2969, 6170. As a covariate, we consider age at onset of patients.
The estimated ROC surfaces for T obtained by applying the proposed methods are shown in Figure 2. FI, MSI, IPW and SPE estimators are based on the multinomial logistic model for the disease process and/or the logistic model for the verification process. Table 8 shows the corresponding VUS estimates, along with the naïve estimate. The table also gives the estimated standard deviations (via asymptotic theory), bootstrap standard deviations and approximated 95% confidence intervals. Despite the limited sample size, the results show that T has some ability to predict response to therapy for late stage EOC patients.

Conclusions
This paper proposed several verification bias-corrected estimators of the ROC surface (and the VUS) of a continuous diagnostic test. These estimators, which can be considered an extension to the three-class case of estimators in [3], are partially parametric in that they require the choice of a parametric model for the estimation of the disease process, or of the verification process, or of both processes. In some cases, wrong specifications of such models can visibly affect the produced estimates, as highlighted also by our results in the simulation studies. To avoid misspecification problems, one possibility could be to resort on fully nonparametric estimators. This topic will be the focus of our future work.

Appendix A: Asymptotic distribution results
In this section, we discuss validity of conditions (C1), (C2) and (C3) for the proposed estimators. The discussion covers first the elements of the estimating functions corresponding to the parameter τ . Then, we pass to the elements of the estimating functions corresponding to the parameters θ 1 , θ 2 , θ 11 , β 12 , β 22 , β 23 , specializing the discussion to the various methods. Finally, we give the explicit form of the variance-covariance matrix in Theorem 1. Recall that α 0 denotes the true value of α. Parameter τ . We noted in Section 3 that estimators FI, MSI and SPE require a multinomial logistic or probit regression model to estimate the disease probabilities ρ ki = Pr(D ki = 1|T i , A i ) with k = 1, 2, 3. In the following, we adopt the multinomial logistic model, but arguments similar to those given below hold also for the multinomial probit model, despite the rather more complex algebra (see Daganzo [5], Chapter 3, as a general reference). The estimating function for the nuisance parameter τ ≡ τ ρ = (τ ρ1 , τ ρ2 ) , is obtained as the first derivative of the log likelihood function. With the multi-nomial logistic model, we get Under assumption (A2), condition (C1) trivially holds. Moreover, we get and ∂ ∂θs g τρ i (α) = 0, ∂ ∂β jk g τρ i (α) = 0 for each s, j, k. The second-order partial derivatives can be easily derived. Hence, for G τρ (α), condition (C2) holds and, by assumption (A3)-(A5) condition (C3) also holds.
The IPW and SPE estimators require estimates of π i = Pr(V i = 1|T i , A i ). With T and A as covariates, we can use the logistic or probit models to this end. In these cases, conditions (C1)-(C3) are satisfied by the estimating functions where φ(·) and Φ(·) are the density function and the cumulative distribution function of the standard normal random variable, respectively. Recall that τ π is the component of nuisance parameter τ corresponding the model for estimating π. The first-order derivatives are FI and MSI estimators. According to equations (3.4), (3.5) and (3.6), the estimating functions G θs * (α) for FI and MSI estimators can be presented in the form for j = 1, 2 and k = 1, 2, 3. Here, the notation IE means "imputation estimator". The estimating function corresponds to the FI estimator if m = 0, to the MSI estimator if m = 1. Using the conditional expectation and the assumption (A1), Similarly, we compute the expected value of the estimating function components g β jk i,IE (α 0 ) as follows Hence, under assumption (A2), condition (C1) holds for G θs IE (α) and G β jk IE (α). We now verify conditions (C2) and (C3). The partial derivative of G θs IE (α) with respect to β jk equals 0 for all j, k. Moreover, For each l = 1, 2 and s = 1, 2, we have Recall that, under the multinomial logistic model, and for the parameter β jk is We show that these estimating functions are unbiased under assumptions (A1) and (A2). In fact, we get Therefore, condition (C1) holds for G θs IPW (α) and G β jk IPW (α), all s, j, k. Next, we obtain the partial derivatives and, for the logistic model (used to estimate the verification process) The computation of the second-order derivatives is similar and the results imply that the conditions (C2) and (C3) hold.

SPE estimator. Recall that
for j = 1, 2 and k = 1, 2, 3. Under assumption (A1), E g θ k i,SPE (α 0 ) equals Therefore, condition (C1) holds for G θs SPE (α) and G β jk SPE (α), all s, j, k. Next, we obtain the partial derivatives and the partial derivative with respect to τ ρ ≡ (τ ρ1 , τ ρ2 ) where ∂ ∂τρ l ρ si is given in (A.2). The partial derivative with respect to τ π , are when the logistic model is used for the verification process. If the probit model is used, we have Also in this case, computation of the second-order partial derivatives develops similarly and the results imply that the conditions (C2) and (C3) hold. Asymptotic covariance matrix. Recall that the asymptotic covariance matrix of TCF estimators is obtained as It is easy to derive that The elements g i, * (α) of the estimating functions G * (α) are given in the previous paragraphs. Now, we derive the explicit form for ∂ ∂α g i, * (α).
First, we consider the class of imputation estimators. We get with j, l, s = 1, 2, k = 1, 2, 3 and i = 1, . . . , n (see (A.1) and (A.2) for the multinomial logistic modeling of the disease process). Thus, Then, we consider the inverse probability weighted estimators. Let Note that these quantities change according to the model, logit or probit, chosen for the verification process. We obtain Finally, we consider the SPE estimators. We have and C lsi and C i are defined above. Therefore

Appendix B: Simulation results of Study 1 and Study 2
In this section, we present results of simulations in Study 1 and Study 2. Tables 9-14 show simulation results of Study 1 for sample sizes equal to 500 and 1000, respectively. The results of Study 2 are presented in Tables 15 and 16, corresponding to the first and third value of Λ, respectively.

Appendix C: Simulation results of VUS estimators
In this section, we give some simulation results concerning the estimators of the VUS presented in Subsection 3.6. The disease D is generated by a trinomial random vector (D 1 , D 2 , D 3 ), such that D k is a Bernoulli random variable with mean θ k , k = 1, 2, 3. We set θ 1 = 0.4, θ 2 = 0.35 and θ 3 = 0.25. The pairs T, A are generated from the following conditional models where µ k = k(µ T , µ A ) . We consider three values of Λ, The true VUS value is equal to 0.9472 for the first value of Λ and (µ T , µ A ) = (3, 2); is equal to 0.7175 for the second value of Λ and (µ T , µ A ) = (2, 1); is equal to 0.4778 for the third value of Λ and (µ T , µ A ) = (2, 1). We simulate the verification status V by using the following model The parameters (δ 0 , δ 1 , δ 2 ) are fixed equal to (1, −2.87, 4.06) when the first value of Λ is considered, and equal to (1, −2.2, 4) otherwise. These choices give rise to a verification rate of about 0.52. Under our data-generating setting, the disease process follows a multinomial logistic model. We consider two sample Table 9. Simulation results from 5000 replications when both models for ρ k and π are correctly specified (Study 1) and the first value of Λ is considered. "True" denotes the true parameter value. Sample size = 500.  Table 10. Simulation results from 5000 replications when both models for ρ k and π are correctly specified (Study 1) and the first value of Λ is considered. "True" denotes the true parameter value. Sample size = 1000.  Table 11. Simulation results from 5000 replications when both models for ρ k and π are correctly specified (Study 1) and the second value of Λ is considered. "True" denotes the true parameter value. Sample size = 500.  Table 12. Simulation results from 5000 replications when both models for ρ k and π are correctly specified (Study 1) and the second value of Λ is considered. "True" denotes the true parameter value. Sample size = 1000.  Table 13. Simulation results from 5000 replications when both models for ρ k and π are correctly specified (Study 1) and the third value of Λ is considered. "True" denotes the true parameter value. Sample size = 500.   Table 16. Simulation results from 5000 replications when the model for the verification process is misspecified (Study 2) and the third value of Λ is used. "True" indicates the true parameter value. Sample size = 1000.  Tables 17-19 show Monte Carlo means, Monte Carlo standard deviations (MC.sd), the square roots of the variances estimated via asymptotic results (Asy.sd) and bootstrap standard deviations (Boot.sd) ofμ. In this section, we provide some extra plots related to the analysis of the first dataset used in the main paper. In particular, in Figure 3 we present the estimate of the ROC surface for the test CA125 based on the full data set. Figure 4 and Figure 5 present the projections of the estimated ROC surfaces to the planes defined by TCF 1 versus TCF 2 , TCF 1 versus TCF 3 and TCF 2 versus TCF 3 , i.e., the ROC curves between classes 1 and 2, classes 1 and 3, classes 2 and 3. For the IPW and SPE methods, to estimate the verification process, we make use, firstly, of a correctly specified model, i.e., a linear threshold regression model ( Figure 4) and, then, of a misspecified model, i.e., a logistic model  ( Figure 5). Finally, as an example, Figure 6 plots confidence regions for the pair (TCF 1 (c 1 ), TCF 2 (c 1 , +∞)) at three values of c 1 , when the MSI approach is used. An approximated 95% elliptical confidence region is obtained in a standard way as the set of points  the quantityΣ 12 is the estimated asymptotic covariance matrix of TCF 1,MSI (c 1 ), TCF 2,MSI (c 1 , +∞) and χ 2 0.95,2 is the 95-th quantile of a Chi-square distribution with 2 degree of freedom. In the plot, the black solid line represents the full data estimated ROC curve, whereas the blue dashed line is the bias-corrected estimated ROC curve.