Statismo - A framework for PCA based statistical models.

,


Introduction
Statistical models have become a firmly established tool in computer vision and medical image analysis.They model the variability of a class of objects by means of a normal distribution, learned from example data.In particular in the form of shape models, they have been employed in a wide range of applications such as object recognition [10,20], image manipulation [8], surgery planning [24,12] or implant design [9,15].Furthermore, they have been used as a prior for different algorithms, such as segmentation [13] or registration [2].In spite of their success, there is still no established toolkit for the creation and application Latest version available at the Insight Journal [ http://hdl.handle.net/10380/3371]Distributed under Creative Commons Attribution License of such models.This requires the same method to be implemented over and over again.Moreover, it makes the exchange of models difficult, and thus hinders the collaboration between different researchers as well as the reproducibility of experiments.In this paper we introduce the Statismo framework, which is a C++ framework for the creation and use of different kinds of statistical models.In contrast to other libraries (such as the statistical models currently implemented in ITK), Statismo provides a high-level interface: Statistical models are interpreted as a probability distribution over the modeled object, from which samples (i.e.shapes, images, etc.) can be drawn and probabilities of object instances can be computed.The current implementation focuses on the most commonly used type of statistical models, i.e.PCA-based shape models, and its associated statistical interpretation as a multivariate normal distribution.
The standard example for a statistical PCA model is the statistical shape model.It represents a class of shapes in terms of deformations of a triangulated reference surface.Nevertheless, the concept can be applied to various other types of data, such as images, displacements fields, volumetric meshes, etc.Independent of the concrete type of model, the statistical analysis requires each dataset to be converted in to a vectorial representation.Statismo provides a unified treatment of these models by letting the user provide special template classes, so called Representer classes, which define this conversion.Depending on the type of data, this may include preprocessing steps, such as establishing correspondence or alignment of the datasets.A representer also provides the reverse functionality, to convert the vectorial representation back to its original representation.The abstraction introduced by the Representer does not only allow for a unified treatment of different kinds of PCA based models, but also to make Statismo independent of the library used to represent the data.Thus, the exact same code that is used for creating a shape model from shapes represented in VTK can also be used for creating deformation models from ITK displacement fields.
One of the main design goals of Statismo is to facilitate the exchange of statistical models among different research groups and end users.We therefore defined a platform independent storage based on HDF5 [1].Statismo stores all the information needed to capture the complete probabilistic interpretation of the model.This requires in particular that all the pre-processing steps, such as alignment or normalization can be reproduced.Statismo enforces this by requiring the same Representer to be used in the application of the model, as was used for model building.Moreover, Statismo stores all the parameters and additional metadata necessary to make the models more easily reproducible.Another goal of Statismo is to allow for the easy integration of different model building algorithms.This is achieved by the modular design of the data management, model building and model representation components.Thus, new algorithms for model building can be easily integrated into Statismo and make use of all the functionality already implemented.
In addition to standard PCA models, Statismo currently supports the building of conditional models based on surrogate variables [4,7], as well as geometric constraints [16].
Statismo includes standard Representers for the most commonly used data representations in VTK and ITK, making it possible to build and exchange PCA based shape models, image models, and deformation models, using both ITK and VTK.Statismo also provides wrapper classes that allow for the seamless integration of Statismo into ITK.Furthermore, a special itkTransform class is provided, which makes it possible to exploit the power of the ITK Registration Framework for statistical model fitting.
2 Background: PCA based statistical models 1) Discretization: This aims to define a discrete domain Ω = {p 1 , . . ., p N } on which we approximate all the objects, i.e. we define an object as: For shape models, the domain Ω ⊂ R 3 is usually defined by the points of a 3D reference mesh and ϕ i : R 3 → R 3 is a function that maps each point p j ∈ Ω to the corresponding point of an example shape Ôi .For images, Ω is an image domain and ϕ i : Ω → R that assigns to each point its image intensity.Similarly, for deformation models, ϕ i : Ω → R d assigns to each point a deformation that is defined on the deformation field Ôi .
Note that this definition implies correspondence between all the examples.The exact notion of correspondence depends on the type of both the objects and the application.The correspondence between the examples is usually established through registration.
2) Normalization: Depending on the type of the objects, further preprocessing steps may follow.Shapes, for example, are usually rigidly aligned using Procrustes registration before building the model.For images, a histogram equalization or a low pass filtering may be performed to reduce the excess variability of the samples.
3) Model building: Once the shapes are processed, each object O i can be represented as a high-dimensional vector v i = (ϕ i (p 1 ), . . ., PCA estimates an orthonormal basis spanning this high-dimensional space, defined iteratively as the directions onto which the projection of the data has maximum variance.This offers the opportunity to reduce the dimensionality of the model, keeping only the principal directions covering a prescribed proportion of the total data variability.Mathematically, these are obtained from the sample mean and sample covariance matrix of the training data, using the usual formula from statistics: Using singular value decomposition, the sample Covariance Matrix is decomposed as S = UD 2 U T .The column u i of the matrix U ∈ R Nd×n is referred to as the i-th principal component.The principal components form a basis for the vector space in which all the examples are defined.An entry λ i in the diagonal matrix D 2 = diag(λ 1 , . . ., λ n ) denotes how much variance is represented by the i-th principal component [18].A statistical model v is then defined as the low dimensional generative model: where m < n is the number of principal components that span the model space.Each vector α = (α 1 , . . ., α m ) introduces a unique shape.Usually it is assumed that α follows a standard normal distribution: Latest version available at the Insight Journal [ http://hdl.handle.net/10380/3371]Distributed under Creative Commons Attribution License This in turn introduces a probability distribution over the discretized vectors representing the objects, namely v ∼ N (m,UD 2 U T ) ≈ N (m, S). (3)

Probabilistic PCA
In section 2.1 we have sketched the fundamentals of PCA based statistical modeling as it is commonly presented and used in the literature.There is, however, a flaw with this simple probabilistic interpretation (Cf.Equations ( 2) and ( 3)).This model assumes that all the probability mass is concentrated on the subspace spanned by the principal components.Thus the probability of an object that does not strictly lie in this subspace is zero.In practice, however, an object might be a valid and likely instance of the class and still not lie exactly on that subspace.This may on one hand happen since this subspace is only approximated from a finite number of examples, but also because the data itself is usually noisy.To alleviate this problem, Probabilistic PCA (PPCA) [23] is implemented in Statismo.PPCA provides a well defined probabilistic interpretation, by assuming a small amount of noise on the examples.The new generative model is of the form . By virtue of the added noise term, this model now defines a valid probability distribution on the whole of R Nd .The probability of an instance now depends on the distance from this subspace, i.e. the objects that are far away are less likely than those instances that are closer to the space.It can be shown that when σ 2 goes to 0 a standard PCA model is obtained [22].
3 Design / Architecture Statismo has been designed to achieve the following goals: High level: Most applications using statistical models exploit the fact that these models define a probability distribution over the objects of interest.The interface of Statismo is designed to reflect this fact and provides high-level methods to work with the represented distribution, rather than exposing the eigenvalues and eigenvectors of the model.

Independence of data representation:
Statistical models are used to model the variation in various types of objects, using a variety of different toolkits to represent the data.Independently of the concrete representation of the data, the ideas and interpretation of PCA-based statistical models is always the same.Statismo is designed to work for any kind of PCA models, independently of the data representation or the toolkit used.
Reproducibility and easy data exchange: It is often fruitful to exchange data and models with colleagues or other researchers.An important design goal is to facilitate this exchange by providing a platform independent data format, which contain all the information needed to use the model from an application.Furthermore, to help reproduce an experiment, the stored model should contain all the parameters that were used to build the model.

Flexibility and Extensibility:
Statismo aims to simplify the integration of new algorithms for model building.This will not only increase the usefulness of Statismo itself, but also will allow such algorithms to benefit from the functionalities implemented in Statismo.
Latest version available at the Insight Journal [ http://hdl.handle.net/10380/3371]Distributed under Creative Commons Attribution License Most terminology used in Statismo is directly motivated by the probabilistic interpretation, and does not require further clarification.However, while in statistics the term sample and dataset are usually used interchangeably, in Statismo these terms have precise meanings.A dataset is a representation of an object used as an input to build the model.A sample represents the same object but after discretization and normalization have been applied to it (Cf.Section 2.1).It is an instance of the object as it is represented by the model.In other words, the probability distribution defined by the statistical model represents the probability of the samples but not of the datasets.Finally, a sample vector is the representation of a sample as a vector.

Representers
The key concept that allows Statismo to be independent of a specific data representation, and to build all types of PCA models is the concept of a Representer.The first method takes care of the discretization and normalization, the second method is used to convert a dataset to its vectorial representation.The last method is used to convert the vectorial representation back to an object of the modeled type.Figure 2 illustrates this concept.The input to Statismo is always a dataset.This is converted into a sample, and then in its vectorial representation, from which all the statistics is computed.Note that the vectorial representation is never exposed to the user, but it is always converted back to the sample representation.For shape models, for instance, the output sample is a mesh, in the specific data representation used by the Representer.

ITK Support
Statismo was designed to work together with different toolkits and libraries.It was therefore not possible to follow the conventions of a specific toolkit.To facilitate the integration with other libraries, we have tried to make as few assumptions as possible, and limit the interface to use standard C++.The main difference between the two interfaces is that the former generic Statismo interface requires all arguments as the object is created (here with the Load function) and thus ensures that the objects themselves are well defined after construction.The memory management is done manually whereas, with the ITK Wrapper, the memory is managed using the ITK smart pointers.
Specifically for ITK, Statismo also provides a class itk::StatisticalModelTransform, which enables the use of statistical models as ITK transforms.This has great practical value since these transform can then be used in the ITK registration framework for statistical model fitting.

Using Statismo
In this section we illustrate how Statismo can be used to perform typical tasks related with statistical models: Most examples below use the generic interface with the vtkPolyDataRepresenter.This representer is created by providing it with a reference dataset that defines the discretization and mesh topology.Before converting a dataset to its vectorial representation, it performs a Procrustes alignment [14] of the dataset to the reference, and thus normalizes pose variations automatically.Note that this representer assumes, that the datasets are already in correspondence.The last example illustrates the use of Statismo with the ITK registration framework and makes use of the ITK Wrappers for statismo.For this we use the itkMeshRepresenter that represents datasets of type itk::Mesh.
The illustrations shown in this section have been obtained from a model built from 6 femur bones, which are distributed along with this article.We create a representer and provide it with a reference.As we are using the vtkPolyDataRepresenter, the reference is in this case a vtkPolyData object.Furthermore, we load all the datasets that we wish to use in our model.
DataManagerType* dataManager = DataManagerType::Create(representer); dataManager->AddDataset(dataset_1, path_to_polydata_1); ... dataManager->AddDataset(dataset_n, path_to_polydata_n); In the next step we build a model and save it as a HDF5 file.This restores the full model, including the representer.We stress however, that it is important that the same type of representer is used when loading the model, as the one that was used to build the model.Statismo will throw an exception if this is not the case.

Visualizing the variability
Probably the first and simplest application that is performed with a new model is to visualize the variability of the modeled objects.This can easily be achieved by drawing random samples and visualizing them using standard tools (such as e.g.Paraview). 2tkPolyData* sample = model->DrawSample() We can also explore the shape space more systematically by specifying the PCA coefficients for a given sample.In this example, we obtain three samples: The mean, a sample corresponding to 3 standard-deviation in the direction of the first principal component and one corresonding to 3 standard-deviation in the opposite direction.Note that each call to DrawSample create a new sample which needs to be deleted by the user.

Using statistical models as a prior
In many algorithms, statistical models are used as a shape prior.A common approach is to take an instance from an algorithm, and project it back into the shape space, to get an approximation of the given shape.In Statismo, such a projection can be obtained by computing the PCA coefficients for a given dataset and then restoring it using the DrawSample method: vtkPolyData* dataset = someAlgorithm(); VectorType coeffs = model->ComputeCoefficientsForDataset(dataset); vtkPolyData* projection = model->DrawSample(coeffs); Often, an algorithm is regularized by penalizing unlikely solutions.For this purpose, Statismo provides a method to compute a (log) probability of a given dataset.

Cross validation
An often neglected, but important task is the evaluation of statistical models.One particular focus of interest is the generalization ability of a model: the accuracy with which it is able to represent new instances of the object.This ability may be estimated through cross-validation studies.Cross-validation has also recently been employed to derive confidence bounds quantifying the reliability of statistical model based prediction from partial observations [5,6].Statismo provides convenience functionalities in order to facilitate such cross-validation studies.
Cross-validation is done using the DataManager.

Model Fitting using the itk registration framework
One of the most common applications of statistical models is to fit them to a patient dataset.In Statismo this is done using the ITK registration framework.In the example we show how a shape model is fitted to an image using itkPointSetToImageMetric.In this example, we use the ITK wrapper of Statismo, which allows us to use Statismo objects as normal ITK objects.
In the next step we load the model and obtain the reference mesh that was used to build the model via the representer.This mesh is used by the PointSetToImageMetric, and determines the points on which the metric is evaluated.The registration is then set up in the usual way.We omit here the details.
MeshType::Pointer finalMesh = model->DrawSample(transform->GetCoefficients()); 5 Adding more prior information: Constrained PCA models One goal of Statismo is to allow the integration of different algorithms for building statistical models.Currently, Statismo comes with three different methods to build statistical models (Cf. Figure 1).Besides the standard PCAModelBuidler that was used in all the examples presented so far, there is a PartiallyFixedModelBuilder [16] and a ConditionalModelBuilder [4].Both model builders incorporate additional prior information into the model building and thus reduce the variability of the resulting model.The resulting models are standard PCA models themselves, and can be used and interpreted in exactly the same way as the models discussed so far.

Partially fixed models
Partially fixed models allow for introducing geometric constraints on the model.This allows us to model the variability in a model, when we know the shape for a part of the model.A typical application of this model is the reconstruction of a fractured or traumatized bone.Another scenario where these models are useful is in model fitting.Consider the shape model fitting example from Section 4.6.Assume that we know for some points of the model the corresponding points on the target image.Using the PartiallyFixedModelBuilder, we can incorporate these constraints into the model building, to obtain a model that represents only shapes that match the target points.The following (incomplete) code illustrates how such models can be built.Figure 4 shows the first main variation for the same shape model as in Figure 3, but this time with 2 points (in red) fixed.We see that the landmark points remain fixed in the first main variation.By using this constrained model for model fitting, we can thus guarantee that the landmark points are matched, without having to change our fitting algorithm.An additional advantage is that the search space becomes smaller, and thus the optimization problem easier.For more details on the theoretical background and the applications of these models, we refer to the papers [16,17].
Latest version available at the Insight Journal [ http://hdl.handle.net/10380/3371]Distributed under Creative Commons Attribution License

Conditional
The second algorithm for model building, which we refer to as the ConditionalModelBuilder for constrained model is used when there are additional surrogate information about the data available.The underlying idea is to exploit such information to generate a statistical model of the object variability given prescribed constraints.Statismo handles both continuous and categorical types of surrogates (e.g. the height, age and gender of the patient).To achieve this, the potential useful surrogate variables have to be associated to the training samples.
A special DataManagerWithSurrogates is provided for this purpose.When creating such a data manager, a file describing the number and types of variables is passed as an argument.The class has a special method to add datasets together with the associated surrogate data: AddDatasetWithSurrogates(dataset, surrogateFilename) (the exact format of these input files is described in the internal documentation).
In order to learn a conditional model, the user then specifies which of surrogate variables are used for conditioning, and specifies the corresponding values against which the model should be conditioned by filling a structure of type ConditionalModelBuilder<RepresenterType>::CondVariableValueVectorType.
In this implementation, only the training samples which respect the requested categories are selected.Continuous variables are used to learn a conditional distribution of the variability, using the assumption of joint multivariate normality.Practically speaking, an unconditional statistical model is first learned using the PCAModelBuilder class using the selected training sample.The PCA coefficients of each of these training samples are computed, and pooled with the requested continuous surrogates.The conditional distribution of the model coefficient given the prescribed constraints is derived.The assumption of multivariate normality then allows to reformulate this conditional model in the reduced space back into the original high dimensional space.Therefore, the obtained conditional model is expressed in the exact same mathematical formulation as standard PCA models.The implementation is currently limited to cases where the number of retained training samples (after selection based on categories) is larger than the number of original PCA coefficients retained plus the number of continuous surrogate variables used as conditions.
More details, and possible extension to more complex distributions models can be found in [4].

Conclusion
In

A Appendix
A.1 The file format Statismo uses the platform independent HDF5 [1] to save the statistical models.The HDF5 file contains all the information that is needed to use the model from an application as well as information about the data and parameters used to create the model.HDF5 provides a hierarchical, file-system like data organization.
For Statismo, we have organized the data into 3 main groups: Model Contains the vectors and matrices that define the model, i.e. the mean, the pcaBasis, the pcaVariance as well as the noiseVariance assumed in the model.
Modelinfo Holds the metadata for the model.The metadata consists of the buildtime of the model, the scores (the PCA coefficients of the original example data), a section using builderInfo that contains all the parameters of the specific model builder used to build the model, as well as a section dataInfo that contains the URI's of the datasets that have been used to build the model.
Representer This group is used to store all the information that is needed by the representer.
The data stored in the HDF5 file can eihter be explored using the API provided by Statismo (Cf. the class StatisticalModel), or directly from one of the many software packages that support HDF5, such as e.g.Matlab [19], R [21] or Pytables [3].

A.2 Currently available Representers
Statismo comes with several Representer classes for building various types of statistical models, using VTK and ITK.Currently, all the representers requires that the input datasets are discretized in the same way and in correspondence.
In the following we briefly describe the different Representers: TrivialVectorialRepresenter This is the most basic representers.It can be used to build statistical models in the case where the input is already available in a vector representation.
vtkPolyDataRepresenters Used to build shape models from VTK, in case where the data is represented as VTK Polydata.This representer performs a Procrustes alignment [14] of the datasets before the model is built.
vtkStructuredPointsRepresenter This representer is used to build image (intensity) models and deformation models from data that is represented as vtkStructuredPoints.
itkMeshRepresenter This representer is used to build shape models from itk Meshes.In contrast to the vtkPolyDataRepresenter, no alignment is performed.
Latest version available at the Insight Journal [ http://hdl.handle.net/10380/3371]Distributed under Creative Commons Attribution License itkImageRepresenter Used to build 2D and 3D image (intensity) models from itk images.
itkVectorImageRepresenter Used to build 2D and 3D deformation models from itk displacement fields.
itkVectorImageLMAlignRepresenter This representer is similar to the itkVectorImageRepresenter. In addition, the user can provide a number of landmark points.the input displacement fields are evaluated at these landmark points, to obtain a set of corresponding points.From these points, a rigid transformation is computed, which is used to resample the displacement fields.

A.3 Third Party libraries used in Statismo
Statismo depends on a number of open source libraries: HDF5 [1] is used to store the statistical models.For linear algebra we use the Eigen template library [11].Furthermore, Boost.tr1 was used to make TR1 available on all platforms.All the libraries are distributed along with Statismo.Detailed information regarding the license terms of the different libraries can be found in the source tree.

Figure 1 :
Figure 1: The core classes in Statismo

Figure 2 :
Figure 2: The dataflow in Statismo.Inputs to Statismo are datasets, in a user defined data representation.The output is always a sample from the probability distribution over the objects that are modeled.

1 .
Generation and storage of a statistical model 2. Loading a model from disk 3. Visualizing the variability in a model 4. computing the likelihood of a dataset 5. performing a cross-validation experiment 6. using the ITK registration framework for model fitting.

Figure 3 :
Figure 3: The first mode of variation (±3 standard deviations) for a shape model of the femur.In this case the first variation mainly captures the size of the bone.

Figure 4 :
Figure 4: A shape model of the femur, with two landmark points (in red) fixed.As the landmark points determine the size of the femur the first component is not the size anymore (as in 3), but shows mainly the variations of the femoral head and the condyles.
Building statistical models in Statismo4.1 Building statistical models in StatismoThe first step in using Statismo is to define which representer is used and to parametrize all classes with that representer.
1Latest version available at the Insight Journal [ http://hdl.handle.net/10380/3371]Distributed under Creative Commons Attribution License 4.1 At the end we need to delete the created Statismo objects (an alternative would, of course, be to use smart pointers).In most applications, the first step is not to build a model, but to load an exisiting model.A model is loaded using Latest version available at the Insight Journal [ http://hdl.handle.net/10380/3371]Distributed under Creative Commons Attribution License After we have set up the DataManager, and populated it with the training data, we can obtain a list with cross validation folds.Each entry of the cvFoldList contains a list of training samples and a list with the test samples.We iterate over all the folds and build for each fold a new model.A typical validation of the model would consist of computing an approximation error between the test sample, and its projection in the model.
for (CVFoldListType::const_iterator it = cvFoldList.begin();it !=cvFoldList.end();++it) { Latest version available at the Insight Journal [ http://hdl.handle.net/10380/3371]Distributed under Creative Commons Attribution License this paper we have presented Statismo, which is a framework for PCA based statistical models.Statismo implements a high level view on PCA models by interpreting these models as a normal distribution over the modeled objects.The underlying low level vectorial representation of an object is never exposed to the user.The conversion between the data representation of the user and the vectorial representation used by Statismo is done via special Representer classes, which are defined for each data representation.This allows Statismo to ensure that the same discretization and normalization steps are applied when using the model, as those that were used during the model building.A statistical model in Statismo thus always has a well defined interpretation and semantics.This, together with the fact that Statismo writes all its information into a single, portable HDF5 file, greatly simplifies the exchange of statistical models built with Statismo.To use the model, the user only needs to have access to the same Representer that was used to build the model.Statismo with other toolkits than ITK or VTK.Statismo also simplifies the integration of new methods for building models.As it is a powerful toolkit for building statistical models, we hope that the number of different representers and model builders for Statismo will grow quickly, enabling its use in several applications across various fields.
Statismo comes with ready made Representer classes for the most important object representations used in VTK and ITK.Currently all these representers require that the data is already in correspondence.It is, however, straight-forward to write representers that also establish correspondence, and thus to cover the full model creation pipeline with Statismo.By writing custom representers, it becomes also possible to Latest version available at the Insight Journal [ http://hdl.handle.net/10380/3371]Distributed under Creative Commons Attribution License use