Estimating the Area under the ROC Curve with Modified Profile Likelihoods

Receiver operating characteristic (ROC) curves are a frequent tool to study the discriminating ability of a certain characteristic. The area under the ROC curve (AUC) is a widely used measure of statistical accuracy of continuous markers for diagnostic tests, and has the advantage of providing a single summary index of overall performance of the test. Recent studies have shown some critical issues related to traditional point and interval estimates for the AUC, especially for small samples, more complex models, unbalanced samples or values near the boundary of the parameter space, i.e., when the AUC approaches the values 0.5 or 1. Parametric models for the AUC have shown to be powerful when the underlying distributional assumptions are not misspecified. However, in the above circumstances parametric inference may be not accurate, sometimes yielding misleading conclusions. The objective of the paper is to propose an alternative inferential approach based on modified profile likelihoods, which provides more accurate statistical results in any parametric settings, including the above circumstances. The proposed method is illustrated for the binormal model, but can potentially be used in any other complex model and for any other parametric distribution. We report simulation studies to show the improved performance of the proposed approach, when compared to classical first-order likelihood theory. An application to real-life data in a small sample setting is also discussed, to provide practical guidelines.


Introduction
Receiver operating characteristic (ROC) curves are frequently used to study the ability of a certain characteristic in discriminating and classifying units under study.One of the most popular summary measures based on the ROC curve is the area under the curve (AUC) (Krzanowski & Hand, 2009), which was originally developed in radar signal detection (Bamber, 1975), and later it has been used in a broad range of applied contexts such as radiology, psychiatry, reliability theory and industrial inspection systems, earthquake resistance.
The AUC is also widely applied in medicine as a measure of statistical accuracy of continuous markers for diagnostic tests (Faraggi & Reiser, 2002;Pepe, 2003;Zhou, McClish, & Obuchowski, 2009).A diagnostic test based on a continuous marker provides usually a response about the possible clinical status of subjects, identifying them as diseased (test positive) or non-diseased (test negative) patients.Such test requires that a certain cut-off point t is chosen.The probabilities that the test correctly classifies subjects as diseased and non-diseased, are called, respectively, the sensitivity and specificity of the test associated with t.
To formalize the problem more generally, denote with D and D, respectively, the true negative and true positive status of units in the population of interest (e.g., real condition of being non-diseased or diseased).Let us define two continuous random variables Y and X that describe a continuous characteristic of interest in the two distinct groups D and D, respectively.Let F Y (•) and F X (•) be the corresponding cumulative distribution functions, and f Y (t) and f X (t) the associated probability density functions.Consider a classification rule based on a certain cut-off point t (e.g., a diagnostic test that classifies subjects as 'non-diseased' if the observed value of the characteristic is below t, and as 'diseased' if the observed value is above t).The probability that a unit with true status D is correctly classified by the diagnostic test (test negative) is called 'specificity' and defined as p(t) = F Y (t), while the probability that a unit with true status D is correctly classified by the test (test positive) is called 'sensitivity' and defined as q(t) = 1 − F X (t).Sensitivity and specificity vary when different choices of t are made over the continuous scale of the characteristic.The ROC curve is then obtained by plotting p(t) versus 1 − q(t) for all possible values of t.
The AUC has the advantage of providing a single index that summarizes the overall performance of the test (or rule) based on the continuous characteristic, rather than an entire curve, and it is particularly useful for comparisons under different populations or different tests.The aim is often to minimize the error 1 − q(t) committed by the test, and simultaneously increase the efficacy in discovering units from the D population.Therefore, values of the AUC close to 1 indicate very high accuracy of the test, while very low accuracy corresponds to values closer to 0.5.Bamber (1975) showed that the AUC based on continuous distributions is a probabilistic measure that is equal to where p(t) = 1 − p(t).The quantity A can also be interpreted as the probability that, in a randomly selected pair of D and D subjects, the test value is higher for a subject from the D population.In more general contexts, the AUC is used as a measure of difference between distributions (Wolfe & Hogg, 1971).It is often used in engineering and reliability theory with the name of stress-strength model (Johnson, 1988;Kotz, Lumelskii, & Pensky, 2003).When X represents the strength of a certain component and Y is the applied stress, then A measures the probability that a component would not fail if it is put under a systematic stress.
Inference for the AUC has been studied under different modeling assumptions, following mainly a nonparametric, a parametric or a Bayesian approach.In practical applications, it has been suggested that all these approaches are useful and the comparison of their results may provide additional information on the consistency among them.Moreover, the AUC has been also investigated under various relevant settings, such as presence of explanatory variables, measurement errors and clustered data (Pardo-Fernández, Rodríguez-Álvarez, Van Keilegom, et al., 2013;Reiser, 2000;Zou, Carlsson, & Yu, 2012).
Recently, a special attention has been devoted to interval estimation of A and some related critical issues have been widely discussed in the literature (Feng, Cortese, & Baumgartner, 2015).Some of these issues concern a bad performance of confidence intervals for the AUC especially for small samples, more complex models, unbalanced samples or values near the boundary of the parameter space (i.e,A approaching 0.5 or 1).In particular, classical parametric approaches have the general problem that the smaller the sample size and the higher the number of parameters, less accurate they are in the interval and point estimation.On the other hand, nonparametric methods tend also to perform poorly when the sample size is small.Moreover, in general the parametric methods seem to outperform the nonparametric ones when the underlying distributional assumptions are not misspecified, and in presence of samples that show a nearly perfect separation between subjects in the two groups D and D (Obuchowski & Lieber, 2002).
In the current papers we restrict our attention to the parametric framework for inference on the AUC.For the binormal model, where Y and X are assumed to follow normal distribution with different means and variances, Reiser and Guttman (1986) and Reiser and Faraggi (1997) proposed a method for the construction of confidence intervals based on a standard approximate t of Student solution.Although their procedure appears to work well also for unbalanced or small samples, it is not extendible to different parametric models for Y and X, such as Weibull, Gamma or any other more general parametric distributions not in the location-scale family, or to e.g.mixture model in presence of bimodal distributions.Moreover, it is not clear how to handle presence of explanatory variables or clustered data.Classical asymptotic methods based on parametric likelihood theory can easily be applied for constructing confidence intervals or test of hypothesis for the AUC for any type of assumed parametric model.However, it is well known from the general likelihood theory that the resulting Wald type statistic and likelihood ratio statistic do not show a good performance in all situations, especially in the coverage probability of 95% confidence intervals (Severini, 2000).A recent parametric approach was based on higher-order asymptotic likelihood theory (Cortese & Ventura, 2013).However, it has been shown that such method has some limitations: it may easily fail in presence of very small or unbalanced samples or when the samples produce a nearly perfect observed discrimination, it is computationally unstable near the maximum likelihood estimate of A. Some of these problems have been underlined in Feng et al. (2015).
To overcome these drawbacks, the current paper addresses the problem of inaccurate parametric inference in case of small or unbalanced sample sizes, with special attention to confidence intervals and test of hypothesis for the AUC.Also the problem of correct inference near the limit values 0.5 and 1, which represent the situations of, respectively, lowest and maximal accuracy of the continuous characteristic under study, is investigated.In regard of these objectives, we present inference for the AUC based on a modified version of the profile likelihood function, denoted in the literature as 'modified profile likelihood' (Cox & Reid, 1992).In this setting, the parameter identifying the AUC is treated as parameter of interest, whereas the remaining parameters related to the underling parametric distributions of Y and X are treated as nuisance parameter.The proposed approach is very general, applicable to any type of parametric distribution assumptions and to any data setting, such as clustered data or additional data on explanatory variables (Sartori, 2003).
It has been widely studied that standard likelihood inference for a parameter of interest could be misleading in presence of relatively many nuisance parameters, with respect to the sample size, or for small samples.The classical approach for making inference on a parameter of interest in presence of nuisance parameters is based on profile likelihoods.The profile likelihood function is the likelihood in which the nuisance parameters are maximized out, for every fixed value of the parameter of interest.This likelihood is not a proper likelihood and therefore, the derived score function is biased (Severini, 2000).Consequently, this bias may increase with the dimension of the nuisance parameter and produce inaccurate estimation.The modified profile likelihoods are an interesting alternative to the profile likelihoods, since they correct for the presence of nuisance parameters (Cox & Barndorff-Nielsen, 1994;Cox & Reid, 1992) showing an improved performance.
The scope of the paper is to investigate the performance of modified profile likelihoods for inference on the AUC based on a general parametric model.The inferential procedure is presented in the general setting.Then, the methodological aspects are illustrated for the binormal model.In order to show how to obtain point estimates, confidence intervals and test of hypothesis based on the modified profile likelihood, we consider an application to real data in a setting of small samples.
The paper is organized as follows.Section 2 provides the general notation and introduces the inferential problem in parametric models for the AUC.Here the classical approach and the proposed approach based on modified profile likelihoods are described.In Section 3, the theory is applied to the specific case of a binormal model and computations are illustrated.Section 4 reports simulation studies comparing the different methods and Section 5 shows the application to real-life data on imaging for detecting brain tumor.Finally, conclusions and future directions are given in Section 6.

Notation and the Inferential Problem
In this section we consider a generic parametric model for the AUC, where the Y and X components are assumed to follow the parametric distributions F Y (t; θ Y ) and F X (t; θ X ), identified by the finite-dimensional parameter vectors θ Y and θ X , respectively.Let us define θ = (θ Y , θ X ) be the entire parameter vector of the model of dimension p, with θ ∈ Θ ⊆ R p .The AUC is then obtained as where the functional relation between A and (F Y (•), F X (•)) is defined with g(•), for ease of notation.
With the scope of making inference on the AUC, let y = (y 1 , . . ., y n 1 ) be a random sample of size n 1 of i.i.d.observations drawn from Y, and x = (x 1 , . . ., x n 2 ) be a random sample of size n 2 of i.i.d.observations drawn from X. Assume also that Y and X are independent.Let f Y (y; θ Y ) and f X (x; θ X ) be the probability density functions associated to Y and X, respectively.The log-likelihood function for θ is defined as , and under broad conditions, θ is the maximum likelihood estimator (MLE) obtained as unique solution to the score equation ℓ θ (θ) = ∂ℓ(θ)/∂θ = 0.The MLE of the AUC can be directly obtained as Â = g( θ), due to the likelihood invariance property.
In the proposed approach, we intend to treat the parameter A as a scalar parameter of interest, while the remaining parameters that identify the parametric distributions of Y and X are considered as nuisance parameter.Then, the original model needs to be reparameterized so that ψ = ψ(θ) = A is the parameter of interest, as defined in (2), and λ = λ(θ) is a nuisance parameter vector of length (p − 1), obtained by a transformation of the original parameter θ.Therefore, we can write the likelihood function for the new parameters (ψ, λ) as The MLEs Â = ψ and λ are the unique solutions to, respectively, the score equations ℓ ψ (ψ, λ) = ∂ℓ(ψ, λ)/∂ψ = 0 and ℓ λ (ψ, λ) = ∂ℓ(ψ, λ)/∂λ = 0.

Inference Based on the Profile Likelihood
From ℓ(ψ, λ), classical likelihood inference for the parameter of interest ψ = A in presence of nuisance parameters, can be based on profile likelihood procedures, which require to eliminate the nuisance parameter λ by replacing it by the constrained MLE, λψ , obtained by maximizing ℓ(ψ, λ) with respect to λ for fixed ψ.This method is based on the profile log-likelihood ℓ p (ψ) = ℓ(ψ, λψ ), which can then be easily maximized to get the estimated AUC, ψ = Â.The related standard error can be computed as (J p ( ψ)) −1/2 , where J p (ψ) = −∂ 2 ℓ p (ψ)/∂ψ 2 is the corresponding profile observed Fisher information.
Confidence intervals and test of hypothesis can rely on first-order approximations.Specifically, inference on A can be based on the Wald statistic or on the signed log-likelihood ratio statistic which have asymptotic standard normal distributions.
A 100(1 − α)% confidence interval for ψ based on the Wald statistic is given as where z 1−α is the (1 − α)-quantile of the standard normal distribution.Alternatively, a 100(1 − α)% confidence interval for ψ can be constructed from the R p (ψ) statistic, and can be written as {ψ : The Wald-type confidence interval is often preferred because it is very simple and immediate to be computed, as compared to the likelihood ratio confidence interval, which typically requires a numerical solution.However, it is well-known that in general inferential procedures based on the Wald statistics have a general poor performance and are less accurate than the procedures based on the signed log-likelihood ratio statistic, especially at the boundaries of the parameter space (Severini, 2000).

Inference Based on the Modified Profile Likelihood
The profile likelihood is a standard method for inference in large-sample situations, and does not always perform well in small-sample problems.When the focus of the inferential interest is a parameter ψ, while the remaining parameters are not of central concern (nuisance), an interesting alternative approach is based on the modified profile likelihoods.
With the scope to improve inferences, these likelihoods consist of an adjustment to the classical profile likelihoods by the inclusion of a penalization term for the possible presence of nuisance parameters.The amount of the penalization depends on the information available for λ, and increases when this information is large.Modified profile likelihoods have also the appealing property of being invariant to interest-preserving reparametrizations.This last property means that inferential results obtained for (ψ, λ) are also valid for (η(ψ), ξ(ψ, λ)), where η and ξ are one-to-one transformations.
The general expression for a modified profile log-likelihood (Severini, 2000) is where ℓ p (ψ) is the profile log-likelihood and M(ψ) is the modification term.For this term, a high degree of accuracy is obtained when it has the expression where In practice, the first term J λλ (ψ, λψ ) is easily computed numerically or analytically by differentiation of ℓ λλ (ψ, λ).When the log-likelihood can be written in terms of the MLE, ψ and λ, and an ancillary statistic a, i.e., as ℓ(ψ, λ; y, x) = ℓ(ψ, λ; ψ, λ, a), computation of the term ℓ λ; λ(ψ, λ; ψ, λ) is also straightforward.When differentiating with respect to λ, the quantities ψ, ψ and a need to be held fixed.However, we have here omitted the conditioning to the ancillary a because it is not needed explicitly for computations and, in our context of parametric models for the AUC, in most of the cases the modification term in (6) can be obtained without specifying a.
Inference for ψ can be easily performed by treating (5) as a standard log-likelihood for ψ, without the burden of dealing with nuisance parameters.The solution to the maximization of ℓ mp (ψ) provides a maximum modified profile likelihood estimate (MMLE), defined as ψmp .In particular, the standard error associated to ψmp is computed as (J mp ( ψmp )) −1/2 , where J mp (ψ) = −∂ 2 ℓ mp (ψ)/∂ψ 2 .Therefore, using the normal approximation, it is possible to use a Wald-type confidence interval, e.g., [ ψmp Moreover, the resulting signed modified log-likelihood ratio statistic, defined as has asymptotic standard normal distribution, and has properties that are superior to those of the usual signed likelihood ratio statistic (Sartori, 2003).The statistic R mp (ψ) is then preferred, with respect to the Wald-type statistic, for construction of confidence intervals and test of hypothesis.In practice, a 100(1 − α)% confidence interval based on R mp (ψ) is given as {ψ : |R mp (ψ)| ≤ z 1−α/2 }.A one-sided statistical test with null hypothesis H 0 : ψ = ψ 0 can be performed using the test-statistic R mp (ψ 0 ).

An Important Example: the Binormal Model
The main example about possible applications of the theory described in Subsections 2.2, is given for the popular binormal model, where Y and X are normally distributed with different means and variances, e.g., Y ∼ N(µ Y , σ 2 Y ) and X ∼ N(µ X , σ 2 X ).Under this assumption, it is known (Kotz et al., 2003) that the AUC can be written as where Φ(•) is the cumulative probability function of the standard normal distribution.Denote with δ = (µ the quantile of the standard normal which provide an area equal to A. Here, there are two possible interesting choices for the parameter of interest ψ.We may have either ψ = A or ψ = δ.These two choices are equivalent in terms of inferential results because both the profile likelihood and the modified profile likelihood are invariant for interest-preserving reparameterizations, and thus for the transformation A = Ψ(δ).In the current paper, for practical reasons, we illustrate the procedures for the second choice ψ = δ, since this case is relatively simpler to implement.Moreover, in this case, convergence in the corresponding parameter space Ψ = R is always obtained, whereas the choice ψ = A with parameter space Ψ = [0, 1] may yield computational problems on the boundaries.
We study the parameter of interest ψ = δ, while the nuisance parameter can be chosen to be, e.g., λ = (λ 1 , λ 2 , λ 3 ), with Other choices are also possible, where the parameter space is Ψ × Λ, and thus the range of λ is independent of the range of ψ.
Given the MLE θ computed from the original likelihood ℓ(θ), by the invariance property, the MLE for the AUC is Consider now the likelihood function for the new parameters (ψ, λ), and observe that it is a function only of the unknown parameters and the minimal sufficient statistic ( ψ, λ), where ψ = δ, X , and thus, depends on the data only through the MLEs.
For the binormal model, computation of the signed log-likelihood ratio statistic R p (ψ) given in ( 4) is then straightforward.The Wald statistic W p (ψ) in (3) requires to find the observed information J p ( ψ), which can be computed analytically or by a numerical procedure, for example by using the function hessian of package numDeriv in the R software.
The key parameter of interest is the AUC, therefore we can easily obtain inferential conclusions on A from those obtained from ψ.For example, the Delta method can be applied to find the standard error of Â = Φ( ψ), which is then equal to ŝA = Φ ′ ( ψ)(J p ( ψ)) −1/2 = f Z (( ψ)(J p ( ψ)) −1/2 , with f Z (•) being the p.d.f. of the standard normal.Therefore, a Wald-type confidence interval for A is given as [ Â − z 1−α/2 ŝA , Â + z 1−α/2 ŝA ], and a hypothesis testing concerning A can be based on the test-statistic ( Â − A 0 )/ ŝA .
In addition, to specify the modified profile log-likelihood in ( 5) and ( 6), we need to compute the modification term M(ψ).
In doing so, the block of the observed information matrix, J λλ (ψ, λψ ) is equal to minus the Hessian matrix, which can be easily obtained by numerical procedures in the R software, as above.The analytic expressions of the sample space derivatives ℓ λ; λ(ψ, λ; ψ, λ) for the binormal model are provided in the Appendix.The signed modified log-likelihood ratio statistic R mp (ψ) given in (7) can then be constructed to solve test of hypothesis concerning key values of the AUC, such as e.g.A = 0.5 or A = 1.For example, the one-sided test with hypotheses H 0 : ψ = ψ 0 = 0 versus H 0 : ψ > 0 is equivalent to testing whether the AUC is significantly higher than 0.5, and can be performed using the test-statistic ) , where ψ0 mp denotes the maximum modified likelihood estimate of ψ in the parameter space Ψ 0 = {ψ ∈ Ψ : ψ > ψ 0 }.
First, the simulation studies investigated the coverage probabilities of 95% confidence intervals based on the signed profile log-likelihood ratio statistic R p (ψ), the signed modified profile log-likelihood ratio statistic R mp (ψ), and the Wald statistic W p (ψ).All these statistics are asymptotically distributed as standard normal, and the approximation is often more accurate for R mp (ψ).Results in Table 1 show that R mp (ψ) is more accurate than R p (ψ) and W p (ψ), in terms of both central coverage probability and symmetry of the error rates, for all the considered AUC values and sample sizes.Of course, for all methods, we observe a less accurate coverage when the sample sizes are very small ((n 1 , n 2 ) (5, 5), (10, 10)), which then increases for higher sample sizes.However, the R mp (ψ) coverage is observed to reach nearly the 95% nominal level, being slightly affected by low values of sample sizes (see, e.g., for A = 0.95, 0.99), in contrast to the W p (ψ) and R p (ψ) that provide seriously poor performance for small samples.Very interestingly, this poor performance becomes even worse for higher values of the AUC, such as A = 0.95, 0.99.On the contrary, the good performance of the R mp (ψ) seems to be very stable for all values of the AUC.An important result is observed for unbalanced samples: W p (ψ) and R p (ψ) seem to be negatively affected by the sample unbalance, since their coverage decreases even more with respect to the nominal level, whereas, the R mp (ψ) coverage keeps stable and enough accurate in all the unbalanced settings.In particular, we note that the coverages are lower for samples with high n 1 and low n 2 (e.g., (n 1 , n 2 ) = (30, 5), (80, 10)) as compared to the inverse case of low n 1 and high n 2 .This fact may depend on the reparameterization chosen for the nuisance parameters λ, since we have that the MLEs λ2 and λ3 are both affected by the small sample size n 2 and then would be poorly estimated.
The very poor performance shown by the Wald statistic for high values of the AUC is expected.It is well known that when the profile log-likelihood is not quadratic around the MLE, as it happens in our AUC study (see Figure 1 of data example in Section 5), the Wald statistic may lead to very asymmetric confidence intervals.In fact, in Table 1 we observe a nearly null empirical lower error and a higher empirical upper error than expected.Asymmetric errors are also seen for the R p (ψ) statistic, although the discrepancy from the expected errors is negligible.
Simulation studies were also used to evaluate the properties of the R mp (ψ)-based estimator of A, in comparison with the MLE ψ.The two estimators are compared in terms of median bias and results are shown in Table 2, where estimated standard errors and simulations-based (empirical) median absolute deviation (MAD) are also reported.The choice of a median-bias criteria is due to the median unbiasedness property of the ψmp -based estimator, and it is more robust under model misspecification.It can be noted that the estimator based on modified likelihood, ψmp , is preferable to the MLE in terms of the considered criteria, since it is less median-biased than the MLE, in particular for small sample sizes and unbalanced samples.Estimates seem to be more biased for unbalanced samples with high n 1 and low n 2 .However, this problem is attenuated when the AUC value increases, and the bias of ψmp reduces to about the half of the bias of the MLEs.

A Worked Data Example
In this section, an application of the inferential approaches discussed in the current paper to real-life data is presented.We consider data from imaging studies used for brain tumor grading.The data have been originally collected in Tsuchida, Takeuchi, Okazawa, Tsujikawa, and Fujibayashi (2008), and were also discussed in the paper by Feng et al. (2015).This data are also available in the R package auRoc (Feng, 2015).The objective of the study was to evaluate the clinical significance of 1-11C-acetate (ACE) positron emission tomography (PET) in 10 patients with brain glioma, in comparison with 18F-fluorodeoxyglucose (FDG) PET.FDG and ACE are two different imaging techniques for detecting brain glioma.The aim of this section is to examine again the diagnostic accuracy of both techniques in discriminating between patients with low grade (grades I or II) and patients with high grade (grades III and IV).Patients grading was previously determined by magnetic resonance imaging, a gold-standard method used to classify patients with brain glioma in low and high grade classes.Five patients were characterized as low grade and the other five patients as high grade.All patients underwent FDG and ACE diagnostic measurements and the standard uptake value (SUV) was calculated for the same regions of interest in the brain.These SUV values were compared between low grade and high grade patients.The diagnostic accuracy of FDG and ACE was investigated by estimating the area under the ROC curve.Point estimates, confidence intervals and test of hypothesis were performed following the three approached presented in the paper.
Table 1.Two-sided empirical coverage of confidence intervals with 95% nominal levels for A based on the Wald statistic W p (ψ), the profile log-likelihood ratio statistic R p (ψ) and the modified profile log-likelihood ratio statistic R mp (ψ), under the binormal model.The central coverage probabilities and the non-coverage probabilities on the left and right tails, which represent, respectively, the lower and upper errors, are reported.Assumption of normality in the distributions of SUVs from FDG and ACE in the low-grade and high-grade patients has been shown not to be violated (Feng et al., 2015).For FDG, the average SUV values in the low and high groups were, respectively, 4.714 and 7.124, while for ACE, the average SUV values in the low and high groups were, respectively, 1.850 and 2.626.Lower SUV values are associated to the low grade patients.Therefore, here the random variable Y represents the FDG SUV values in the low grade population, while the random variable X represents the FDG SUV values in the high grade population.
Table 3 summaries the main inferential results for the AUC computed, separately, for the FDG SUV values and the ACE SUV values.For the FDG, the different statistical methods gave very similar estimates of the area under the ROC curve, equal to ∼0.7, showing that the FDG has poor discrimination accuracy between the low and high grade populations.
The standard errors are also very similar, whereas the Wald confidence interval equal to (0.413, 1) is right-shifted as compared to the confidence intervals based on the R p (ψ) and R mp (ψ) statistics, which are virtually identical.In addition, we performed a test for the null hypothesis H 0 : A = 0.5 versus the alternative H 1 : A > 0.5, and found that the R p (ψ) and R mp (ψ) statistics produce similar non significant p-values (p = 0.105 and p = 0.119, respectively), then suggesting that there is no evidence of any discriminatory power in the FDG technique.
From Table 3, we observe that also for the ACE, the different statistical methods gave similar estimates of the area under the ROC curve, equal to ∼0.9.Thus, it was found that the ACE is much more accurate at discriminating.The Wald confidence interval equal to (0.709, 1) is extremely and erroneously right-shifted, and thus it deviates from the other confidence intervals based on the R p (ψ) and R mp (ψ) statistics.These latter two differ in particular at the lower limit, as also illustrated in Figure 1.This fact is due to the skewed shape of the profile log-likelihood.In the case of ACE, the test of hypothesis with null H 0 : A = 0.5 gave significant results, which differ between the two R p (ψ)-and R mp (ψ)-based approaches (p = 0.009 and p = 0.015, respectively).This result indicates that the ACE technique has the ability to discriminate.When testing the null H 0 : A = 0.6, the resulting p-values (p = 0.030 and p = 0.043, respectively) provide evidence of a discrimination accuracy above 0.6, but the more correct R mp (ψ)-based approach gives less evidence for this conclusion.Note that here the power of the tests is low due to very small sample sizes.
Figure 1 reports the relative log-likelihoods, defined as ℓ(θ) − ℓ( θ), for both the parameters A and ψ.It is noted that relative modified profile log-likelihoods are shifted to the left with respect to the relative profile log-likelihoods, due the adjustment term M(ψ).Moreover, we observe that the Φ(•) reparameterization on the parameter of interest has the natural effect to make the quadratic functions for ψ = δ become skewed to the left.

Discussion
The paper has presented the performance of a new inferential approach in parametric models for the AUC, which was shown to be useful and easy to implement.The proposed method was applied to make inference for the binormal model, and can immediately be adapted to any other parametric distribution.Alternatively, when the normality assumption is violated, a Box-Cox type power transformation to the original data can also be applied (Box & Cox, 1964;Faraggi & Reiser, 2002).The additional unknown parameters concerning the Box-Cox transformation may be either treated within the entire model as nuisance parameters, or one may, first, apply the appropriate transformation to the original data, and then use inference for the normal theory presented in this paper.We note that the presence of additional nuisance parameters to the model is not expected to affect the accuracy of the inferential results when a modified profile likelihood approach is adopted.
Profile likelihoods have a biased score function of order O(1), which does not typically disappear asymptotically.Modified profile likelihoods have properties very similar to those of usual full likelihoods, and their adjustment term reduces the bias to order O(n −1 ).Consequently, the signed likelihood ratio statistic based on the modified profile likelihood has properties that are superior to those of the usual signed likelihood ratio statistic (Cox & Barndorff-Nielsen, 1994).
The results from simulation studies show that inference based on the modified profile log-likelihood approach has superior performance compared to the classical profile log-likelihood approach, in terms of central coverage probability, symmetry of error rates, and median bias.Wald statistics can lead to seriously misleading inferential conclusions for small or unbalanced samples, especially at the boundaries of the parameter space (Molenberghs & Verbeke, 2007).Moreover, Wald-type tests of hypothesis may lead to erroneous significant results.For example, in the real data application for the ACE technique, it was found that A 0 = 0.7 falls outside the Wald confidence interval, and a test of hypothesis of the null H 0 : A = 0.7 yields a significant p-value of 0.02, in contrast to the profile likelihood approach that shows no evidence for a discrimination ability above 0.7.
The proposed approach has the potential to be applicable to any general parametric setting, for example, in settings where Y and X follow two different parametric distributions, or in models with mixture distributions, which are often used when the empirical distribution shows a bimodal behaviour.Moreover, future developments concerning the modified likelihood approach could be very relevant in the context of AUC estimation, especially when the data are stratified, or the interest of the inquiry is on modeling different stratum-specific AUC, for example by including stratum-specific fixed effects as nuisance parameters.Another important case of application of the proposed approach may be when the two random variables X and Y depend on covariates, or in general when the AUC models rely on many nuisance parameters (Sartori, 2003).

Figure 1 .
Figure 1.Plot of r p (thick solid line) and r * p (thick dashed line) for a range of values of the parameter R. Vertical lines are drawn to identify confidence intervals for R based on r p (thin solid line) and r * p (thin dashed line)

Table 3 .
Point estimates (estimated A) and 95% confidence intervals (95% CI) based on the Wald, profile log-likelihood and modified profile log-likelihood statistics, for the FDG and ACE imaging techniques.