An automated pattern recognition system for classifying indirect immunofluorescence images for HEp-2 cells and specimens

Immuno ﬂ uorescence antinuclear antibody tests are important for diagnosis and management of auto-immune conditions; a key step that would bene ﬁ t from reliable automation is the recognition of sub-cellular patterns suggestive of different diseases. We present a system to recognize such patterns, at cellular and specimen levels, in images of HEp-2 cells. Ensembles of SVMs were trained to classify cells into six classes based on sparse encoding of texture features with cell pyramids, capturing spatial, multi-scale structure. A similar approach was used to classify specimens into seven classes. Software implementations were submitted to an international contest hosted by ICPR 2014 (Performance Evaluation of Indirect Immuno ﬂ uorescence Image Analysis Systems). Mean class accuracies obtained on heldout test data sets were 87.1% and 88.5% for cell and specimen classi ﬁ cation respectively. These were the highest achieved in the competition, suggesting that our methods are state-of-the-art. We provide detailed descriptions and extensive experiments with various features and encoding methods. & 2015 The Authors


Introduction
Antinuclear antibody (ANA) tests are important in the diagnosis and management of autoimmune diseases.These include systemic lupus erythematosus, Sjogren's syndrome, rheumatoid arthritis, polymyositis, scleroderma, Hashimoto's thyroiditis, juvenile diabetes mellitus, Addison disease, vitiligo, pernicious anemia, glomerulonephritis, and pulmonary fibrosis.Immunofluorescene ANA tests have been recommended as the gold standard for ANA testing due to their relatively high sensitivity [1].Specifically, human epithelial (HEp-2) cell specimens are examined using Indirect ImmunoFluorescence (IIF) imaging [2].The flow of the IIF procedure includes the following steps: image acquisition, mitosis detection, fluorescence intensity classification and staining pattern recognition.The pattern recognition step is an important one as different patterns are suggestive of different autoimmune diseases.A nucleolar pattern is often associated with scleroderma, for example.Manual analysis of IIF images is laborious and timeconsuming.Furthermore, two or more experts can be required to examine each ANA sample due to inter-observer variability.
Aiming to standardize the procedure compared to current manual practice and to reduce workloads, computer aided diagnosis (CAD) systems have been proposed for the analysis IIF images [3,4].
This paper describes a system to classify pre-segmented immunofluorescence images of individual HEp-2 cells into six classes (homogeneous, speckled, nucleolar, centromere, golgi, and nuclear membrane) as well as a system to classify HEp-2 specimen images into seven classes (homogeneous, speckled, nucleolar, centromere, golgi, nuclear membrane and mitotic spindle).Instances from these classes are shown in Figs. 1 and 2. These two classification tasks correspond to those used in the contest on Performance Evaluation of Indirect Immunofluorescence Image Analysis Systems (I3A) 1 held in conjunction with the 22nd International Conference on Pattern Recognition (ICPR 2014).Our approach to classifying cell and specimen images is based on sets of local features which describe texture and intensity properties of local image regions.Extracted features are encoded via sparse coding and classification is performed using support vector machine (SVM) ensembles.
After reviewing competing methods (Section 2) and describing our proposed method in detail (Sections 3-7), results on the I3A contest datasets are presented (Sections 8-10).This paper builds on our earlier I3A workshop papers [5,6].It sets the proposed method in the context of related literature, describes it in more detail, presents more extensive experiments to investigate the effect of various components on performance, and summarizes performance in direct comparison with other methods.We present systems tailored to capture cell class-specific properties, leveraging state-of-the-art computer vision techniques.The experiments we report should help guide future developments in this area by providing evidence on the relative contributions of, e.g., feature combinations, data augmentation, cell pyramids, and sparse coding.Systems we proposed and describe in this paper won both tasks at the I3A contest in controlled tests.As such they can serve as benchmarks for researchers developing pattern recognition systems for this important application.

Related work
This section concisely reviews previous work related to HEp-2 cell image classification in the context of ANA testing.There exists a wider literature on the recognition of fluorescence image patterns characteristic of subcellular structures more generally [7].However, its review is beyond the scope of this paper.
Perner et al. [8] presented an early attempt at developing an automated HEp-2 cell classification system.Cell regions were represented by a set of basic features extracted from binary images obtained at multiple grey level thresholds.Those features were then classified into six categories by a decision tree algorithm.This approach was further employed and integrated by Sack et al. [9] for identification of positive fluorescence and a set of immunofluorescence patterns.Hsieh et al. [10] performed classification of immunofluorescence patterns using learning vector quantization (LVQ) and various texture features that included grey-level histogram statistics, co-occurrence matrix features, and an estimate of fractal dimension.
The problem of HEp-2 cell classification attracted increased attention among researchers with the benchmarking contests held in conjunction with the International Conference on Pattern Recognition (ICPR) in 2012 [11] (HEp-2 Cell Classification2 ) and the International Conference on Image Processing (ICIP) in 2013 [12] (Competition on Cells Classification by Fluorescent Image Analysis3 ).A special issue following the ICPR 2012 contest was also organized in the journal Pattern Recognition [4].Various image descriptors were adopted in those contests, including popular local texture features such as local binary patterns (LBP) and its variants (e.g., multi-resolution LBP), SIFT, summative intensity statistics (e.g., mean and standard deviation) of local or whole image regions, and morphological properties (e.g., eccentricity) [12].Standard feature encoding methods, in particular bag of words (BoW), were often applied to represent the statistics of local features in the feature space.The majority of the classifiers used were support vector machines although k-nearest neighbour classifiers, boosting classifiers, random forest classifiers, and neural networks were also used by some groups [4,12].For more detailed descriptions of the methods submitted to those contests, the reader is referred to the associated reports [4,11,12].
The I3A contest held in conjunction with ICPR 2014 was the most recent in this series of contests.It received 11 submissions to Task 1 (cell classification) and 7 submissions to Task 2 (specimen classification) [3].Methods submitted for Task 1 can be classified broadly into two categories: those with feature coding and those without feature coding.The feature coding-based methods basically adopted the popular bag of words framework using either hard coding or sparse coding [13,14] and either SVM or boosting classifiers.They differed in their choice of local features.For instance, Ensafi et al. [15] used SIFT and SURF features with sparse coding and max pooling.Theodorakopoulos et al. [16] combined a set of local features, including LBP and rotation-invariant SIFT, with vectors of locally aggregated descriptors (VLAD) [17].Paisitkriangkrai et al.'s method (as described in [3]) combined features including region covariance and co-occurrence of adjacent LBPs.Gragnaniello et al. [18] used a recently developed local feature called dense scale-invariant descriptors [19] to characterize cell images.The feature coding methods were among the top performing in the contest [3].The methods without feature coding basically followed a paradigm of global feature representation with a popular machine learning classifier.For example, Roberts' method (as described in [3]) used a set of wavelet-based features with a SVM classifier.Codrescu [20] learned a neural network with  finite impulse response filters applied to cell images.Ponomare et al.'s method (as described in [3]) utilized a set of morphological features (such as number of isolated regions in each cell) with a SVM classifier.Taormina et al.'s method (as described in [3]) used an ensemble of nearest neighbour classifiers on 108 features.Those methods generally performed less well in the contest.It is worth noting that Gao et al. [21] used a deep convolutional neural network (CNN) with 8 layers.The last layer was in principle a logistic regression classifier with a soft-max activation function.This method performed reasonably well in the contest.Although deep CNNs have proven very successful on various large-scale image classification benchmarks, they have very large numbers of parameters and it can be difficult to tune their structure to a specific task such as I3A.
Most of the methods used for Task 2 (specimen classification) first performed classification at the individual cell (or sub-region) level using approaches similar to Task 1 and then aggregated the results by majority voting.For instance, Ensafi et al. [15] performed specimen classification by applying voting to the cell classification results for a small number of extracted cells.Similarly, Liu et al. and Paisitkriangkrai et al. (as described in [3]) classified specimen images based on a majority voting over regions where only features within a cell were considered when describing a region.Ponomare et al. [3] applied an unweighted voting scheme for a final classification of the specimen image based on morphological features.
Notable trends when comparing the I3A contest entries to previous work are the use of more advanced hand-crafted local features (e.g.multi-resolution local patterns with cell pyramids as in our entries [5,6], and dense scale-invariant descriptors [19]), dataset augmentation, and the deployment of deep learning for automatic feature learning.Our proposed method benefits from a combination of the following factors which we believe contribute to its state-of-the-art performance: complementary multiresolution features, the use of a specifically designed spatial structure for cell images, sparse coding, and data augmentation.

System overview
Fig. 3 gives an overview of the system used for generating a feature representation from an image of a cell for input to a classifier.Firstly, each cell image was intensity-normalized. Sets of local features were then extracted and a feature encoding method (e.g.sparse coding) was employed to aggregate the local features into a cell image representation.A two-level cell pyramid was used to capture spatial structure of cell images.Support vector machines were then used to classify images of cells or to classify specimen images containing multiple cells.The following sections describe these system components in detail.Experiments investigating the effect of different system components, feature representations and encodings are then reported.

Local feature extraction
Prior to feature extraction, each cell's image was intensity normalized; specifically, the segmentation mask was dilated (using a 5 Â 5 structuring element) and the intensity values within each cell's dilated mask region were then linearly rescaled so that 2% of pixels in each cell became saturated at low and high intensities (Fig. 4).
Local features were extracted densely from each pre-processed cell image.Four types of local feature were considered, namely, Multi-resolution Local Patterns (mLP), Root-SIFT (rSIFT), Random Projections (RP) and Intensity Histograms (IH).

Multi-resolution Local Patterns
Multi-resolution Local Patterns (mLP) are a multi-resolution adaptation of the local higher-order statistical (LHS) patterns proposed by Sharma et al. [22] for texture classification.LHS is a non-binarized version of the well-known Local Binary Patterns (LBP) descriptor.It operates on a small image neighbourhood of size 3 Â 3. To capture information from a larger neighbourhood and reduce noise effects, we extended LHS from a multi-resolution perspective by employing the sampling patterns described by Mäenpää [23].This sampling pattern is inspired by the spatial structure of receptive fields in the human retina and has been widely adopted in recently developed visual features in computer vision, e.g., BRISK [24].Fig. 5 shows an example sampling pattern and the generation of the mLP descriptor.In Fig. 5 the local neighbourhood is quantized radially into three resolutions (radii), and at each resolution a set of (N ¼ 8) sampling regions (indicated as circles) are considered.At each sampling point a Gaussian filter is applied, integrating information from the filter's region of support.We call the combination of LHS and these sampling patterns multi-resolution local patterns.

Root-SIFT
Root-SIFT (rSIFT) is a variant of the widely used SIFT descriptor that produces better performance than SIFT on some image matching and retrieval tasks [25].The standard SIFT descriptor is a histogram representation of local image derivatives and was originally designed to be used with Euclidean distance.Using Euclidean distance to compare histograms often yields inferior performance compared to other measures such as χ 2 or Hellinger for texture classification and image categorization [25].Therefore, standard SIFT was modified in [25] to create rSIFT such that comparing rSIFT descriptors using Euclidean distance is equivalent to using the Hellinger kernel to compare SIFT vectors.

Random projection
Random projection (RP) is a simple yet powerful method for dimensionality reduction [26].It projects patch intensity vectors from the original patch-vector space R D 0 to a compressed space R D using randomly chosen projection vectors.Such a scheme has been successfully applied to texture image classification [27].Let x be a D 0 -dimensional patch vector and x be its D-dimensional representation in the compressed space.The RP method simply maps these vectors using a D Â D 0 random projection matrix R, such that Each element in matrix R is sampled from a Gaussian distribution with zero mean and unit variance.The key point of RP is that when projecting the patch-vectors from the original space to the compressed space their relative distances are approximately preserved.

Intensity histograms
Intensity histograms (IH) were computed from small image patches to represent the local intensity distribution.

Feature encoding
Feature encoding methods aggregate the local features from an image or image region and play an important role in classification.Four methods for feature encoding were compared, namely bagof-words (BoW), a sparse coding method (SC), Fisher vectors (FV), and vectors of locally aggregated descriptors (VLAD).A pyramid was used to encode spatial structure.Each of these methods is now described briefly.

Bag-of-Words
Bag-of-Words (BoW) is widely applied as a feature encoding method for medical [2] as well as natural [13,28] image  classification.In BoW local features sampled from training images are clustered to build a dictionary (codebook).This dictionary represents a set of visual words (or clusters) which are then used to compute a BoW frequency histogram as a feature vector representation of any given image.BoW uses hard quantization where each local image descriptor is assigned to only one visual word.

Sparse coding
Sparse coding (SC) has shown improved performance over BoW for image classification [28].In SC each local image descriptor is reconstructed using a sparse weighted combination of visual words.Locality-constrained linear coding (LLC), an efficient variant of sparse coding, utilizes the local linear property of manifolds to project each descriptor into its local coordinate system [14].Let X i A R DÂN i be a matrix in which each of the N i columns is a D-dimensional local descriptor extracted from an image I i , i.e.
where denotes the element-wise multiplication and Ã T and σ is a decay parameter.A fast approximation to LLC was described in [14] to speed up the encoding process.Specifically, instead of solving (2), the Kð o D o MÞ nearest neighbours of x ij in B were considered as the local bases B ij and a much smaller linear system (Eq.( 4)) was solved to get the local linear codes: The image representation z i of an image I i is then obtained by pooling the sparse codes associated with the local descriptors.Two kinds of pooling, max and sum, are applied in the literature for SC.The max-pooling can be defined as z k i ¼ max c k ij ; j ¼ 1; …; N i , and the sum pooling can be defined as z k i ¼ , where, z i k and c k ij are respectively the kth element of z i and c ij [29].

Fisher vectors
Fisher vectors (FV) capture additional information about the distribution of the image descriptors compared to the count (0thorder) statistics in BoW.FV has shown improved performance over BoW and SC for image classification in [30].In FV, the dictionary is first modelled as a Gaussian mixture model (GMM) pðxj ΘÞ where Θ tively the weight, the mean and the covariance of the mth Gaussian.GMM uses a soft descriptor-to-cluster assignment: In FV each cluster is then represented based on the derivative of the GMM with respect to its parameters fμ m g and fΣ m g (1st and 2nd order statistics), i.e., The final image description z i is the concatenation of G i μ m and G i Σm for all m ¼ 1; :::;M, leading to a dimensionality of 2MD.

Vector of locally aggregated descriptors
Vector of locally aggregated descriptors (VLAD) [17], a simple approximation to FV, uses k-means to learn the dictionary.VLAD uses the 1st-order statistics to represent each cluster; the mth cluster (Q m ) representation of an image I i can be given as The VLAD image representation z i is the concatenation of v i m for all m ¼ 1; :::;M followed by L 2 normalization, leading to a dimensionality of MD.

HEp-2 cell classification
From each cell image, each of the four feature types was densely extracted from patches of size 12 Â 12, 16 Â 16, and 20 Â 20 pixels with a step-size of 2 pixels.

SVM ensemble
Augmenting a classifier's training set with rotated versions of the images may improve classification performance but it also increases memory requirements.Instead we used an ensemble of multi-class one-vs-rest, linear SVMs; the ensemble consisted of four SVMs, one trained on the original training set images, and others trained on images after they were rotated through 90°, 180°, and 270°.The overall system which includes data augmentation as well as the ensemble training is shown in Fig. 6.At test time, each test image was rotated by 0°, 90°, 180°, and 270°, and each rotated image was then given to the ensemble.This resulted in a set of 16 classification scores for each class (4 rotations Â 4 SVMs in the ensemble).Scores were treated as probabilities using Platt rescaling [31].The final classification decision was made by averaging these probabilistic scores and selecting the highest scoring class.Fig. 7 illustrates the process of classifying a cell image in detail.
Ensemble classification has previously been used for HEp-2 cell classification by Schaefer et al. [32] in the form of a trained linear fusion of classifier outputs.Here the approach we have adopted is based on simply averaging classifier outputs, avoiding the need for training of a fuser.This approach has been used for example to combine the outputs of neural network columns [33] and was shown to perform better than a trained linear combination on handwritten digit recognition [34].For earlier work on this issue, see e.g., Duin [35].

Cell pyramids
To capture spatial structure within a cell, a 2-level cell pyramid was used in a similar fashion to the dual-region used by Wiliem et.al. [2,36].Separate dictionaries of size M were learned for each feature type described above.The encoded local features using these dictionaries were pooled to get an image representation.At the first level of the cell pyramid, the encoded features from the whole cell were pooled to get a feature vector of size P (e.g., for BoW, P ¼M).At the second level, feature vectors were computed from the inner region and from the border region of each cell (see Fig. 3).These three feature vectors were concatenated to give a 3P-dimensional vector.Finally, encoded features from each of the four feature types were concatenated to give a 12P-dimensional vector on which classification was based.

HEp-2 specimen classification
Our approach for classifying specimens is similar to the approach we proposed for cell image classification, the main difference being that we extracted features at two sets of scales.Specifically, after preprocessing, two types of local feature, mLP and rSIFT, were extracted.These features were densely extracted from image patches using a patch step-size of 4 pixels.Both small patches (12 Â 12 pixels and 16 Â 16 pixels) and large patches (48 Â 48 pixels and 64 Â 64 pixels) were used.Intuitively, small patches can capture local properties at cellular level while large patches can capture information about groups of cells.Features from outside the dilated cell masks were discarded.
A separate dictionary of size M was learned for each feature type with each group of patches (i.e., one for small and one for large patches).The image representations using the small-patch dictionary and the large-patch dictionary for each feature type were concatenated.Sparse coding with max-pooling was used.
The dataset provided for Task 2, described in Section 8.2, was augmented by including a 90°-rotated version of each original image.This resulted in a set of 2016 images.Five sub-images were extracted from each image based on the layout shown in Fig. 8(a) with the exception of images in the mitotic spindle class.In mitotic spindle images, metaphase cells in which stained mitotic spindle was apparent were manually identified.Five subimages were then extracted around those cells, with some random variation, as shown in Fig. 8(b).Finally, the 10,080 extracted subimages were added to the 2016 images, resulting in an augmented dataset of 12,096 images.
A one-vs-rest, multi-class SVM classifier was then trained on the augmented dataset.Since each specimen was imaged at four different locations in the test SVM classifier was applied to each of these four images, resulting in four sets of classification scores per specimen.Scores were treated as probabilities using Platt rescaling [31].The final classification decision was made by averaging these probabilistic scores and selecting the class with the highest average score.

Implementation details
The following parameter settings were used for different feature types: mLP: a 3-resolution version with 8 sampling points at each resolution was used as shown in Fig. 5.The parameters of the Gaussian filters at each sampling point were selected as in [23].
RP: The dimension D 0 of each linearized patch was reduced to D ¼300 whenever D 0 4 300.IH: Local intensity histograms of 256 bins were used.
The public library, vlfeat [37], was used for dictionary learning and feature encoding.For SC, we used the implementation of LLC from [14] with 10 nearest neighbours (K ¼10).K-means with 300,000 randomly selected instances of each type of local feature was used to build the dictionaries for BOW, SC and VLAD methods.
In all the reported experiments we used the L 2 and power normalizations [30] to normalize the final image representation z i of an image I i , given by where j z i j 1=2 applies the square root to each component of z i .We used the LIBLINEAR [38] implementation for the SVM classifiers.The code 4 from the authors of [39] was used for Platt scaling.We sampled equal number of positive and negative instances from the training set when learning the Platt calibration [31].
Mean Class Accuracy (MCA) was used as one of the evaluation metrics, as required metric by the I3A contest.It is defined as where CCR k is the correct classification rate for class k and K is the number of classes.

Datasets
In reported experiments, the Task 1 and Task 2 datasets from the I3A contest were used.These were collected between 2011 and 2013 at the Sullivan Nicolaides pathology laboratory, Australia.For each task, a set of training images was provided to the contest participants.Submitted systems were then evaluated on a separate hidden test set which was privately maintained by the contest organizers and not released to the participants.The results obtained on the contest's hidden test set by our entries are reported in Sections 9.9 and 10.4.This paper also reports crossvalidation results on the contest training sets.
The Task 1 dataset consists of 68,429 images of individual cells extracted from 419 patient positive sera (approximately 100-200 cell images per patient serum) along with their binary segmentation masks.13,596 images were available during training.The remaining 54,833 images were used for the hidden test set to evaluate performance of systems submitted to the contest.The specimens were automatically photographed using a monochrome high dynamic range cooled microscopy camera.Cell images are approximately 70 Â 70 pixels in size.The dataset has six pattern classes: homogeneous, speckled, nucleolar, centromere, nuclear membrane, and golgi.An example image from each of the six classes is given in Fig. 1.
The Task 2 dataset consists of uncompressed, monochromatic images of 1001 patient sera with positive ANA test.Each specimen was photographed at four different locations (four images per specimen).A total of 1008 images from 252 specimens were made available (approximately 25% of the data) while the remaining images were retained by the organizers for testing.Each image was 1388 Â 1040 pixels and cell masks were obtained based on an automatic segmentation for each image.The dataset has seven pattern classes: homogeneous, speckled, nucleolar, centromere, nuclear membrane, Golgi and mitotic spindle.An example image from each of the seven classes is given in Fig. 2.
The distribution of classes for both tasks is shown in Table 1.The homogeneous, speckled, nucleolar and centromere classes represent common ANA patterns whilst the other three classes are less common.
Experimental results are presented in the following two sections.For cell classification (Section 9), we first used the I3A contest training set to explore the effect that different aspects of the system had on performance.These aspects included feature types, encoding methods, cell pyramids, and data augmentation.Subsequently, we used a leave-one-specimen-out setting to test the ability to classify cells from previously unseen specimens.Finally, we include the results reported on the held-out contest test data for the proposed system as well as for the other contest entries (Section 9.9).For specimen classification (Section 10), we report results comparing performance using different features.We then report results on cross-specimen and cross-image generalization.Finally, we include the results reported on the held-out contest test data (Section 10.4).

Experiment 1: comparison of different features and encoding methods
We compared the performance of different features and encoding methods on the Task 1 training dataset.Two-fold cross- validation experiments were carried out, and each was repeated 10 times.Fig. 9 reports the MCAs for different dictionary sizes.rSIFT gave a slightly better performance than the other features.IH gave the worst results.For all encoding methods, larger dictionaries gave higher MCA.SC with sum pooling always gave better performance than other encoding and pooling methods.For all features except IH, FV performed better than VLAD indicating that the additional (2nd order) information it captured was useful.When the dictionary size was 64, FV obtained similar MCA to SC with sum pooling with a dictionary size of 4000, but with an increased feature dimensionality.For example, using rSIFT the dimensionality of an FV image representation was 16,384 compared to 4000 using SC with sum pooling.

Experiment 2: combinations of features
We investigated the performance of combinations of different features.We used BoW and SC encodings for this purpose as they gave better performance than VLAD and FV in Experiment 1.The dictionary size was fixed to 1500.Table 2 reports the results (see columns 2-4).Similar performance was observed using BoW and SC when combining all four types of feature.An improvement of more than 3% was obtained when combining other features with rSIFT (Fig. 9 and Table 2 columns 2-4).Table 5 reports the confusion matrix obtained when combining all the features and encoding with SC and max-pooling.The Golgi class was the least accurately classified; about 8% of Golgi images were misclassified as nucleolar.

Experiment 3: effect of cell pyramids
To improve classification accuracy, particularly of the Golgi class, we incorporated spatial structure into the feature encoding process via cell pyramids (CPM).Table 2 reports the performance of different feature combinations with and without CPM using BoW and SC approaches (see columns 5-7).When combining all the features and using CPM, the overall MCA was improved by about 1%.In particular, CPM improves the classification accuracy of the Golgi images by about 3% (see Tables 5 and 6).

Experiment 4: effect of data augmentation
We investigated the effect of augmenting the training set by including rotated images as explained in Section 6.An ensemble SVM was used for classification.Augmenting the dataset improved the classification accuracy (see Table 2 columns 8-10 vs. columns 5-7, and Table 7 vs.Table 6).When combining all the features, the overall MCA was further improved by about 1%.

Table 2
Two-fold cross-validation results (MCA 7 std) for different feature combinations with and without CPM and data augmentation (dictionary size of 1500).

Features
Original dataset without CPM Original dataset with CPM Augmented dataset with CPM BOW SC-sum SC-max BOW SC-sum SC-max BOW SC-sum SC-max

Table 3
Different performance measures for classification based on 2-fold cross-validation (all features, SC, max pooling, dictionary size of 1500).BoW and SC gave similar performance; the MCA saturated at about 95%.

Different measures for classification
In experimental results reported above, MCA was used as the performance measure.This measure was specified for use in the I3A contest.Here we report results using various alternative measures, specifically accuracy, precision, recall, and F-score.Different measures summarize performance in different ways; e.g., accuracy ignores class imbalance thus is biased towards classes that have more examples in the dataset.On the other hand, MCA gives equal importance to all the classes.We direct the interested reader to [40] for detailed explanation and definitions of these measures for multi-class classification.Table 3 reports performance using these measures.These results suggest consistent conclusions regardless of the measure used.Adding CPM improved performance.CPM with data augmentation gave the best performance by all measures.9.6.Computational time for feature extraction and encoding Table 4 reports comparisons of the computational time required for feature extraction and encoding in order to compute the cell-level representations.This was by far the most time consuming part of the proposed system.These timings were obtained using Matlab 2014b and Windows 7 running on a machine with a Core i7 processor and 8 GB RAM.IH took more time than other features while resulting in lower MCA (see Fig. 9).On the other hand, mLP took the least time and resulted in competitive MCA.When all feature types were used along with data augmentation and CPM, the system took approximately 21s to compute the celllevel representation for one image.

Experiment 5: leave-one-specimen-out
The above experiments disregarded the identities of the specimens from which cells had been extracted.To test the generalization performance of our system across different specimens, we conducted an experiment in a leave-one-specimen-out setting.Specifically, we used the specimen IDs to split the data into training and validation sets.Since 83 different specimens were available, we used images from 82 specimens for training in each split, and the images from the remaining specimen for testing.In this experiment we used the combination of all feature types, the augmented dataset, CPM, SC, max-pooling, and dictionary size of 1500.Table 8 reports the confusion matrix.An MCA of 81.1% was obtained.The Golgi class had poor results (66.7%).This class exhibits high intra-class variability and was poorly represented in the available data set; only 4 Golgi specimens were in the training set.

Experiment 6: performance on images extracted from Task 2 dataset
We also made use of cell images segmented from the Task 2 dataset.(We did not use the mitotic spindle images in the experiment reported in this section.)An automatic procedure was used to select cells from the Task 2 dataset given the segmentation masks provided with that dataset.Firstly, all disjoint regions were identified in the segmentation mask images using connected component analysis.Secondly, eccentricity values were calculated for each connected component.Finally, low-eccentricity components that could be bounded by an 80 Â 80 square with which no other component overlapped were selected.Approximately 5000 isolated cells were selected in this way.This is illustrated in Fig. 10 where red bounding boxes denote cell images that were extracted.We trained an ensemble classifier using all the images from the Task 1 training dataset and then tested it on the cell images extracted from the Task 2 dataset.We used the combination of all feature types with the augmented dataset, CPM, SC and maxpooling (dictionary size of 1500).The results are reported in Table 9; an MCA of 86% was obtained.

Performance on the test dataset
We submitted two systems to the I3A contest for Task 1; the first system used only data made available in the Task 1 training set; the second system trained on a data set consisting of the Task 1 training set together with the additional 5000 cell images extracted from the Task 2 training set (see Section 9.8).Both systems used all the features together with SC (max-pooling, dictionary size 1500), the rotated versions of the images, and CPM.Fig. 11 reports the MCAs obtained by all of the methods submitted to the contest on the Task 1 test set.Our first submission which made use of only the Task 1 training data obtained an MCA of 84.2%, higher than all the other teams' entries.Our second submission which used additional data (cells extracted from the Task 2 dataset) achieved an MCA of 87.1%.The next best entry, that of Gragnaniello et al. [18], obtained an MCA of 83.6%.Table 10 reports confusion matrices from our method and the method of Gragnaniello et al.The reader is referred to the I3A report [3] for detailed results of other entries.

Examples of correctly and incorrectly classified cells
Figs. [12][13][14][15][16][17] show examples of cells from each class that were correctly and incorrectly classified.Some of the misclassified images are particularly noisy, e.g., the misclassified Golgi images in Fig. 17.In most other cases of misclassification shown, visual inspection reveals qualitative similarity to the class whose label was assigned to it.10.Specimen classification results (Task 2)

Experiment 1: comparison of different features
We performed five-fold cross-validation experiments to compare the performance obtained when using different features for Task 2. The dictionary size was fixed to 5000.Table 11 reports the MCA for each feature type as well as their combination.mLP with larger patch sizes outperformed other features.rSIFT with larger patch sizes gave the worst result.Combining features together resulted in an improved MCA of 89.9%.Table 12 shows the confusion matrix.

Table 9
Confusion matrix of the system trained on Task 1 images and tested on the cell images extracted from Task 2 (SC with max pooling, dictionary size of 1500).

Experiment 2: leave-one-specimen-out experiments
A leave-one-specimen-out experiment was carried out using the specimen IDs provided to split the data into training and validation sets.Since 252 different specimens were available, we used images and their rotated versions from 251 specimens for training in each split.Table 13 reports the confusion matrix.Accuracy of 100% was obtained for the centromere and Golgi classes.The mitotic spindle class had the lowest accuracy being confused mostly with the homogeneous class.An overall MCA of 89.9% was obtained.

Experiment 3: classification of individual images taken at different locations
In the above two experiments (Sections 10.1 and 10.2), the classification decision for each specimen was made based on averaging the classification scores (probabilities) of its four images taken from different locations as explained in Section 7. In this experiment we consider each of the four images separately and classify them individually.A leave-one-specimen-out experiment was performed, where at each iteration a classifier was trained on the images (and their rotated versions) obtained from 251 specimens, and tested on each of the four images of the test specimen.An MCA of 87.9% was obtained, 2% lower than the best accuracy (89.9%) obtained in the experiment reported in Section 10.2.

Performance on the test dataset
Fig. 18 reports the MCAs obtained by each of the submitted methods on the Task 2 test set.Our method achieved an MCA of 88.5%, outperforming all the other methods submitted.The second placed method, that of Liu et al. (described in [3]), achieved an MCA of 86.1%.Table 14 reports the confusion matrices of our method and that of Liu et al.

Conclusion and recommendations
In this paper we explained in detail our winning entries for both Task 1 (cell image classification) and Task 2 (specimen image classification) of the I3A contest.To more fully understand the contributions of the components of our classification systems, we       System design choices were guided by using mean class accuracy as the measure of classification performance.(This was the measure specified for use in the I3A contest.)However, experiments with alternative measures suggested similar conclusions.Future contests could be usefully enhanced by investigating the statistical and clinical significance of differences in performance between competing methods.One way to go about assessing statistical significance is bootstrap sampling, as recently incorporated in the PASCAL Visual Object Classes Challenge [41].It would also be informative to quantify the reliability of the ground truth labels, especially given the visual similarity of many of the misclassified examples to their assigned classes (see Figs. [12][13][14][15][16][17].The features we used were pre-defined (hand-crafted).We believe that future work could further explore learning adaptive feature extractors for cell and specimen classification.

Fig. 3 .
Fig. 3.An overview of the system for generating the image-level feature representation using only one feature type: dictionary learning from training images (first row) and feature encoding to obtain the image-level feature representation (second row).The final image representation is a concatenation of the image level representations obtained by different types of features.

Fig. 4 .
Fig. 4. Image preprocessing: (a) an example cell image and its mask, (b) histogram of intensity values inside cell region in (a), (c) normalized histogram, and (d) preprocessed image.

Fig. 6 .
Fig. 6.An overview of the system for data augmentation and SVM ensemble training.Each image can be encoded as shown in Fig. 3.

Fig. 7 .
Fig. 7. Testing an image using the SVM ensemble for single cell classification.

Fig. 8 .
Fig. 8. Sub-images extracted from specimen images for (a) the homogeneous, speckled, nucleolar, centromere, Golgi, and nuclear membrane classes, and (b) the mitotic spindle class.White blobs in the images indicate (a) the segmented regions of individual cells and (b) the manually identified metaphase cells.

Fig. 9 .
Fig. 9. Performance of various features with different encodings (size of the dictionary vs MCA).

Fig. 10 .
Fig. 10.Sample specimen images from I3A Task-2 dataset.The bounding boxes indicate the cell images which are automatically extracted from these specimen images.(a) A centromere specimen.(b) The segmentation mask for (a).(c) A speckled specimen.(d) The segmentation mask for (c).

Fig. 11 .
Fig. 11.The MCA at cell level attained by each method on the test set of Task 1.
empirically studied the local feature extraction and encoding methods, as well as data augmentations applied in training the systems.We found that (1) the combination of different features outperforms individual feature types; (2) for cell classification SC performs better than BOW.FV and VLAD could achieve similar accuracies to BOW and SC but only with feature representations of much higher dimensionality; (3) adding spatial information from the cell images via the use of cell pyramids can improve classification performance for cell images (an improvement of $ 3% was observed for Golgi images); and (4) augmenting the training set by the use of rotated training images further improves the classification performance.When combined, these aspects differentiate

Table 4
Computational time (in sec.averaged over 500 cell images) required for different descriptors for feature extraction and encoding (SC, max pooling, dictionary size of 1500).

Table 6
Confusion matrix obtained using all features combined, SC with max pooling, dictionary size of 1500, and CPM.(No data augmentation was used here.)

Table 8
Confusion matrix for leave-one-specimen-out experiment.(All features, CPM, data augmentation, SC, max pooling, dictionary size of 1500.)