Analysis of AneuRisk65 data: warped logistic discrimination

We analyze the AneuRisk65 curvature functions using a likelihood-based warping method for sparsely sampled curves, and combine it with logistic regression in order to discriminate subjects with aneurysms at or after the terminal bifurcation of the internal carotid artery (the most life-threatening) from subjects with no aneurysms or aneurysms along the carotid artery (the less serious). Significantly lower misclassification rates are obtained when the warping functions are included in the logistic discrimination model, rather than being treated as mere nuisance parameters.


Introduction
The organizers of this section of the workshop are to be congratulated for their choice of data. Without being overly complicated, the AneuRisk65 data (Sangalli et al., 2013) presents many non-trivial challenges for analysis. For example: the 65 angiographic images are misaligned due to the different placement of the patients with respect to the image-capturing device; the images have different lengths, with the origin corresponding to a physiologically recognizable landmark but the endpoints being arbitrary; and the main feature of interest, the syphon (Piccinelli et al., 2011), varies in shape and location from person to person.
My analysis of the data was done on the curvature functions, not on the 3D images themselves; this avoids the problem of rotating and translating the 3D curves to remove subject-placement artifacts, but does not remove the inherent variability in shape and location of the artery syphon, corresponding to the peaks around t = −40 and t = −20 in Figure 1 (the variable t is negative arc length in this parametrization, so the curves run "backwards"). The problem of unequal endpoints is also present whether we analyze the original 3D images or the one-dimensional curvature functions. My approach here is to treat the shorter curves as incomplete curves (which they are). Since the curves for patients with aneurysms at or after the terminal bifurcation of the internal carotid artery (the "upper" group) rarely extend beyond t = −80 (i.e. the data is not missing at random), we truncated the curves at t = −80 in order to avoid artifacts. But many curves were shorter than this, so the problem of unequal endpoints persists; we deal with this by introducing a model that can handle missing data, as explained below.

The model
Let f 1 , . . . , f n be the (complete, unobserved) curvature functions, f i : 80,0]. The data actually observed is of the form Figure 1: Curvature functions, down-sampled to 30 measurements per curve, for (a) the "upper" group of patients and (c) the "lower" and no-aneurysm groups of patients. The corresponding warped curves are shown in (b) for the "upper" group and in (d) for the "lower" and no-aneurysm groups.
for different grids {t i1 , . . . , t im i } and random errors {ε ij } (the errors could be assumed to be zero because the curves were pre-smoothed, but the εs are still a useful slack variable to capture the random variation not explained by model (3) below). The variability in location of the syphon will be accounted for by the warping functions h i : I → I. We assume, then, that wheref 1 , . . . ,f n are functions that, loosely speaking, possess only amplitude variability and can therefore be modeled with a parsimonious principal-component decomposition, where the ξ k s are orthonormal functions in L 2 (I) and the z ik s are uncorrelated with decreasing variances. In fact, we will assume z i = (z i1 , . . . , z ip ) ∼ N p (0, Λ) with Λ = diag(λ 1 , . . . , λ p ) and λ 1 ≥ · · · ≥ λ p > 0. We will denote by F the family of functions spanned by (3), generally referred to as "the template" in the warping literature. The ξ k s, λ k s and µ will be estimated from the data; we will assume µ and the ξ k s are spline functions, thus reducing the estimation problem to a common multivariate problem: given e.g. a B-spline basis {φ 1 , . . . , φ q }, we assume µ(t) = q k=1 a l φ l (t) and ξ k (t) = q l=1 c kl φ l (t) for parameters a = (a 1 , . . . , a q ) and c k = (c k1 , . . . , c kq ) to be estimated from the data.
For the warping functions h i we also specify a family of functions H that is parsimonious but flexible enough to accommodate phase variability at the salient features of the curves. The family of monotone interpolating Hermite splines (Fritsch and Carlson, 1980) is very convenient to work with. Given a knot vector τ 0 of "locations on interest" (for example, τ 0 = (−60, −40, −20) in our case) and any τ i with monotone increasing coordinates, there exists an h i ∈ H such that h i (τ 0 ) = τ i ; this interpolating property provides all the warping flexibility we want at the features of interest, without increasing the dimension of H unnecessarily. The monotonicity of Hermite splines is very easy to enforce for any τ i s; see Fritsch and Carlson (1980). The individual τ i s could be either specified by the researcher (as in landmark registration) or treated as unobservable random effects, as we will do here. Since the coordinates of the τ i s must be strictly increasing and fall within the range I, it is more convenient to transform them into unconstrained vectors θ i using e.g. the Jupp transform, and assume θ i ∼ N r (θ 0 , Σ) with θ 0 the Jupp transform of τ 0 and Σ a covariance matrix to be estimated from the data. Therefore, our warping functions will be parameterized as h i (t) = g(t, θ i ) for a fixed function g that depends only on τ 0 (its exact form does not matter here).
A brief digression: the decomposition (2) is clearly not unique; given any f i and any arbitrary monotone function h i , one can always definef i = f i • h i and then the decomposition f i (t) =f i {h −1 i (t)} trivially follows. So it does not make sense to talk about "the" warping component h i and "the" amplitude componentf i for a given f i . Nevertheless, for a given template F and a given warping family H, the decomposition (2) is identifiable (except for the usual indeterminacy on the sign of the ξ k s). But different combinations of templates and warping models can give rise to essentially equivalent fits. The usual example is the random shift , so the f i s could be modeled by a one-amplitude-component model without warping just as well. Therefore, when we talk about "the" amplitude component and "the" warping component in this paper, it is always in the context of a specific pair (F , H).
Going back to the original problem: putting together (1), (2), F and H, and assuming the ε ij s are i.i.d. N(0, σ 2 ), we obtain the following random-effects model for the raw data y i = (y i1 , . . . , y im i ): (the inverse of g is taken with respect to the variable t for each θ i .) The model parameters a, C, σ 2 , Σ and Λ are estimated by maximum likelihood using the EM algorithm. A drawback of this approach is that it was developed for sparse and irregular time grids, and it becomes infeasible for large m i s; therefore we down-sampled the curves so that m i = 30 for all i. Some high-definition features were lost, but the main peaks are still clearly visible in Figure 1. The random-effect approach to warping described in this section is still unpublished for univariate samples, but a similar approach in the context of functional regression is described in Gervini (2012), where the interested reader may find more technical details.

Results
We fitted several models with warping knots τ 0 = (−60, −40, −20) and different numbers of amplitude components p ranging from 0 (mean-only model) to 5. We used cubic B-splines with 10 equispaced knots for µ and the ξ k s. The warped functions for p = 2 are shown in Figures 1(b) and 1(d). Plots ofμ plus/minusξ 1 andξ 2 are shown in Figure 2. The first principal component is mostly associated with amplitude variation at the syphon peaks, while the second component is mostly associated with amplitude variation at the origin. Can they be used to discriminate patients with aneurysms at or after the terminal bifurcation of the internal carotid artery (the "upper" group) from patients with no-aneurysms or with aneurysms along the carotid artery (the "lower" group)?
To answer this question we first tried logistic discrimination based on the registered curvesf 1 , . . . ,f n . Introducing a binary variable y, with y i = 1 indicating the "upper" group and y i = 0 the rest of the patients, the logistic model assumes that for parameters α ∈ R and β ∈ L 2 (I). Without loss of generality we can assume β ∈ span{ξ 1 , . . . , ξ p }, since in view of (3) the part of β orthogonal to span{ξ 1 , . . . , ξ p } will also be orthogonal tof i − µ. Then we have β(t) = p k=1 b k ξ k (t) and we can re-write (5) as which is just a common multivariate logistic model. The parameters α and b were estimated by conditional maximum likelihood, as usual. The crossvalidated misclassification rates for each p are given in Table 1 (first column). The lowest one is attained at p = 4, but in the interest of parsimony we choose the second-best, the two-component model, for which the misclassification rate is only slightly larger at 38.5%. This high misclassification rate is disappointing, and we wonder if the warping process may not contain additional information that could be useful for discrimination. An easy way to answer this question is to augment model (6) with the τ i s and assume that Estimating the parameters by conditional maximum likelihood as before, the crossvalidated misclassification rates we now obtain (Table 1, second column) are considerably lower, in particular for the optimal two-component model, which is 24.6%. The parameter estimators areb = (−8.12, −6.43) andd = (−.15, .22, .27). The sign ofb indicates that the probability of being in the "upper" group decreases as the height of the peaks at t = −40, t = −20 and t = 0 increases (this is somewhat visible to the naked eye in Figure   1(b) and 1(d).) The signs of the last two coefficients ofd also indicate that for patients in the "upper" group the peaks at t = −40 and t = −20 tend to occur closer to the origin; a caveat is that this could be an artifact of the image-capturing process and not a feature of artery shape, although the negative sign ofd 1 seems to rule this out (because, if the whole curve had been shifted,d 1 would also be positive). Either way, this example shows that the warping process sometimes does contain useful information for classification and There are a number of ways in which this analysis could be refined. For example, instead of the two-step process followed above, where estimation of amplitude principal components and warping functions is done separately from discrimination, both steps could be brought together by maximizing the likelihood of model (7) instead of (4). The principal components and warping functions thus obtained would have been optimized for discrimination and may yield lower misclassification rates than the two-step process; the author is currently investigating this approach. The other important issue is the handling of incomplete curves. The approach in this analysis was to down-sample the curves and apply a likelihood-based method originally developed for sparsely sampled curves, but in doing so, the sharpest peaks of the curves are dulled or lost entirely; that did not matter much for these data, but in other situations the impact may be significant. The existing registration methods that handle densely sampled curves usually involve functional inner products and norms that require computation of integrals over the whole range I, which cannot be done with incomplete curves (not in an elegant way at least, i.e. avoiding artificial truncations or extrapolations). Finding a way around this problem would be an interesting contribution to the registration literature.