Deep CNN for IIF Images Classification in Autoimmune Diagnostics

The diagnosis and monitoring of autoimmune diseases are very important problem in medicine. The most used test for this purpose is the antinuclear antibody (ANA) test. An indirect immunofluorescence (IIF) test performed by Human Epithelial type 2 (HEp-2) cells as substrate antigen is the most common methods to determine ANA. In this paper we present an automatic HEp-2 specimen system based on a convolutional neural network method able to classify IIF images. The system consists of a module for features extraction based on a pre-trained AlexNet network and a classification phase for the cell-pattern association using six support vector machines and a k-nearest neighbors classifier. The classification at the image-level was obtained by analyzing the pattern prevalence at cell-level. The layers of the pre-trained network and various system parameters were evaluated in order to optimize the process. This system has been developed and tested on the HEp-2 images indirect immunofluorescence images analysis (I3A) public database. To test the generalisation performance of the method, the leave-one-specimen-out procedure was used in this work. The performance analysis showed an accuracy of 96.4% and a mean class accuracy equal to 93.8%. The results have been evaluated comparing them with some of the most representative works using the same database.


Introduction
Antinuclear antibodies (ANAs) are a very large category of autoantibodies, or antibodies that the body produces against itself. They are related to many autoimmune diseases [1].
There are several laboratory investigations for the research and differentiation of antibodies to the nucleus. The most commonly used technique is indirect immunofluorescence, which uses a substrate of HEp-2 cells, that is, human epithelial type 2 cells, to detect antibodies to the nucleus. These particular cells show a very high nucleus/cytoplasm ratio and, by virtue of their neoplastic nature, present numerous mitotic figures allowing the operator to identify antibodies directed against the cellular antigens expressed during the mitotic phase. The ANA research is the first level test in the diagnosis of systemic autoimmune diseases, linked to an altered regulation of immune tolerance control mechanisms and characterized by the production of antibodies directed against cellular components that are no longer recognized as "self".
Indirect immunofluorescence (IIF) is the reference method for the determination of ANAs. Despite recent progress in the standardization of the indirect immunofluorescence method (automation of the analytical procedure), the technique still presents some methodological and interpretative limitations [2,3]. The importance of autoimmune diseases and the need for their early diagnosis has pushed the interest of manufacturers and the research world to develop expert fluoroscopic In Benammar et al. [9] this difficulty of interpretation has been quantified by finding the percentage of concordance between the classifications of two senior immunologists; the analysis on 589 wells, both positive (six types of patterns) and negative, showed a concordance of around 71%.
Automatic systems to support diagnosis and in particular computer-aided diagnosis (CAD) systems are widely used for different tasks within medicine such as second reading, increasing the diagnosis speed, training physicians for special task, etc. [10,11].
In the last few years, the interest of the scientific community in the problem of HEp-2 cells classification or staining pattern recognition has been remarkable. This was also due to various contests that were organized on the topic [12][13][14], and to the availability of the first public databases [15,16].
In the work of Manivannan et al. [17] the authors extracted sets of local features that were aggregated through sparse encoding. They used a pyramidal decomposition of the cell that consisted in the central part and in the crown that contains the cell membrane. Linear support vector machines (SVMs) are the classifiers used on the learned dictionary; specifically, they used four SVMs, the first trained on the orientation of the original images and the remaining three on the images rotated 90, 180 and 270 degrees, respectively. With this technique they won the indirect immunofluorescence images analysis (I3A) competition in the International Conference on Pattern Recognition (ICPR) 2014.
Larsen et al. [18] developed a multiscale method using shape index histograms. The decomposition was spatial and presented a radially symmetric pattern. Second-order descriptors of textural type were extracted from the decompositions using a shape index. Finally, the shape index statistics were collected in histograms. The system was trained and tested on data from the I3A Task-1 public database.
Ensafi et al. [19] used a variant of the superpixel approach to extract patches from cells in combination with the sparse codes scheme. The features were extracted from both the superpixel patches and their boundary (high gradient zones). From the patches, features wer extracted using scale-invariant feature transform (SIFT) and speeded up robust feature (SURF) to form a dictionary learning. With the sparse codes, a multiclass linear SVM with a one-versus-all strategy is used.
In the work of Gragnaniello et al. [20] the authors used local descriptors at different scales and invariant to rotation. The descriptors were obtained from a log-polar grid transformation, from a multiscale smoothing of directional gradients, and from the Fourier transform. The bag of words In Benammar et al. [9] this difficulty of interpretation has been quantified by finding the percentage of concordance between the classifications of two senior immunologists; the analysis on 589 wells, both positive (six types of patterns) and negative, showed a concordance of around 71%.
Automatic systems to support diagnosis and in particular computer-aided diagnosis (CAD) systems are widely used for different tasks within medicine such as second reading, increasing the diagnosis speed, training physicians for special task, etc. [10,11].
In the last few years, the interest of the scientific community in the problem of HEp-2 cells classification or staining pattern recognition has been remarkable. This was also due to various contests that were organized on the topic [12][13][14], and to the availability of the first public databases [15,16].
In the work of Manivannan et al. [17] the authors extracted sets of local features that were aggregated through sparse encoding. They used a pyramidal decomposition of the cell that consisted in the central part and in the crown that contains the cell membrane. Linear support vector machines (SVMs) are the classifiers used on the learned dictionary; specifically, they used four SVMs, the first trained on the orientation of the original images and the remaining three on the images rotated 90, 180 and 270 degrees, respectively. With this technique they won the indirect immunofluorescence images analysis (I3A) competition in the International Conference on Pattern Recognition (ICPR) 2014.
Larsen et al. [18] developed a multiscale method using shape index histograms. The decomposition was spatial and presented a radially symmetric pattern. Second-order descriptors of textural type were extracted from the decompositions using a shape index. Finally, the shape index statistics were collected in histograms. The system was trained and tested on data from the I3A Task-1 public database.
Ensafi et al. [19] used a variant of the superpixel approach to extract patches from cells in combination with the sparse codes scheme. The features were extracted from both the superpixel patches and their boundary (high gradient zones). From the patches, features wer extracted using scale-invariant feature transform (SIFT) and speeded up robust feature (SURF) to form a dictionary learning. With the sparse codes, a multiclass linear SVM with a one-versus-all strategy is used.
In the work of Gragnaniello et al. [20] the authors used local descriptors at different scales and invariant to rotation. The descriptors were obtained from a log-polar grid transformation, from a multiscale smoothing of directional gradients, and from the Fourier transform. The bag of words technique was used in combination with a linear SVM. The I3A Task-1 database was used by the authors.
Xu et al. [21] presented a method based on linear local distance coding in which, starting from local features, a local distance vector transformation was used by Euclidean distance. Finally, linear coding and max pooling were used, both on the local distance vector and on the local features. The concatenations were provided as examples to a linear SVM.
In recent scientific research on pattern recognition, deep learning methods and in particular the convolutional neural networks (CNNs) have been proven to be efficient and reliable models to achieve remarkable performance for image classification and object detection tasks [22]. Moreover, it has been demonstrated that pre-trained CNN architectures can play an important role as feature extractors and allow high classification performance.
Very recently, depth learning methods have been applied to IIF image classification problems. In the work of Li Y. et al. [23] the authors addressed the problem of segmentation and classification of IIF images. In particular, they used a variant of the note VGG-16 (16-layer network used by the VGG team) called FCN which is aimed at segmentation. They used CNNs for both segmentation and classification. For the development and the performance test, they used the I3A Task-2 database.
Gupta et al. [24] used the known CNN AlexNet in combination with an SVM classifier for the classification of cells in mitosis.
In our previous work [25] we addressed the problem of intensity fluorescence classification. The problem was faced by analyzing the whole image and starting from it by extracting the features. To this end, several pre-trained networks, used as feature extractors, were analyzed, and the image classification was obtained by training an SVM classifier.
Oraibi et al. [26] used the known CNN VGG-19 to extract features and combine them with local features such as RIC-LBP (rotation invariant co-occurrence local binary pattern) and JML (joint motif labels) for an efficient cell classification. The combination of features was used to train a random forest classifier.
Li H. et al. [27] proposed a method for analyzing HEp-2 images based on the use of a CNN to construct a pattern histogram, and through this a linear SVM was trained. The CNN used was composed of 10 layers, of which the first nine were convolutional layers while the last was a softamax layer for classification. The system was trained and tested on data from the I3A Task-2 public database.
In the present paper a system able to classify the fluorescence patterns is presented. The analysis was conducted both at the cell-level (i.e., in terms of cell images correctly classified) and at image-level (i.e., in terms of IIF images correctly classified). The system uses a pre-trained network, AlexNet [28], as a feature extractor and is able to classify the following six fluoroscopic patterns: Homogeneous, speckled, nucleolar, centromere, Golgi, and nuclear membrane. The classification phase was carried out by developing six linear SVM classifiers with a one-against-all training scheme (OAA). The cell-pattern association was obtained by means of a k-nearest neighbors (KNN) classifier. Furthermore, in the present work, the effectiveness of the extraction of the features both from the segmented cells (internal) and from the boundary boxes containing the cells was evaluated. The data augmentation method as a tool for improving classification performance was evaluated. Finally, different layers of the best known and used pre-trained network were evaluated as feature extractors for the problem of the classification of HEp-2 image patterns. For an effective performance comparison, the method was evaluated on a public data set issued by the 2014 ICPR Competition [14].

Database and Statistics
In this study the publicly available dataset Task-1 from the I3A Contest [14] was used. The HEp-2 images Database I3Asel were made public in the "Contest on Performance Evaluation on Indirect Immunofluorescence Image Analysis Systems", hosted by the 22th International Conference on Pattern Recognition (ICPR 2014). The dataset was collected between 2011 and 2013 at Sullivan Nicolaides Pathology laboratory, Australia. The competition was based on two tasks: Task-1 on HEp-2 pre-segmented cells classification, and Task-2 on well images classification.
In Task-1, it was necessary to identify six straining patterns (homogeneous, speckled, nucleolar, centromere, Golgi, nuclear membrane). The total number of cells was 13,596, extracted from 83 specimens. Table 1 shows the pattern distribution of the images provided for Task-1. The specimens were automatically photographed using a monochrome high dynamic range cooled microscopy camera which was fitted on a microscope with a plan-Apochromat 20x/0.8 objective lens and an LED illumination source.
The labelling process involved at least two scientists who read each patient specimen under a microscope. A third expert's opinion was sought to adjudicate any discrepancy between the two opinions. They used each specimen label for the ground truth of cells extracted from it. Furthermore, all the labels were validated by using secondary tests such as ENA (extractable nuclear antigens) and anti-dsDNA (Anti-double stranded DNA antibody) in order to confirm the presence and/absence of specific patterns.
It is known that in a supervised training, in order not to invalidate the classification result, it is necessary that the test does not contain examples used in training [29,30]. In our specific case, since cells belonging to the same image have a very similar informative contribution, it is advisable, in order not to distort the performance result on the test, that if cells of an image are used in training, no cell of the same image, let alone examples obtained from these for data augmentation, is present in the test. For these reasons, the procedure called leave-one-specimen-out (LOSO) was used in this work [31]. In the LOSO strategy, each time all cell images (and the relative images obtained by data augmentation) from one of the 83 specimens are used for testing, the rest are used for training. Since 83 different specimens were available, we used images from 82 specimens for training in each fold. In this work, the statistical analysis of system performance was based on accuracy and mean class accuracy (MCA) of classification [32] defined as follows: where CCR k is the correct classification rate for class k determined as follows: Figure 2 shows the flow of operations adopted in this work for cellular classification: The generic segmented region of interest (ROI) is decomposed by the multilayer neural network to obtain the features used as inputs of the six SVMs. The six output values obtained from the six binary classifiers represent how much the generic region "resembles" each of the analyzed pattern classes. The image classification is achieved by means of the cell classification. Indeed, the classification at the image-level was obtained by analyzing the prevalence of the patterns at the cell-level and associating the generic image to the pattern with the highest rate.

System Workflow
The choice of best features and parameters was performed automatically, using the mean class accuracy as a figure of merit.
To verify the classification power contained in the region immediately adjacent to the cell, in addition to the segmentation mask, the boundary box containing the cell was also analyzed and the relative performance results were obtained.

Data Preparation
In order to reduce the intensity variability present in the database, a contrast stretching was performed, defined as follows: where I and Ic denote respectively the images after and before the transformation, and min and max represent the minimum and maximum intensity of input image, respectively. Since the images are stored in 8 bits the normalization is referred to the maximum possible value, that is, 255. Furthermore, to increase the number of training examples, a data augmentation was made. In particular, an increase for rotation of cell images at angles of 20° was achieved; overall, a multiplication of the data by a factor of 18 was obtained. Data augmentation is a very effective practice especially when the data set for training is limited, or as in our case, when some classes are not particularly represented in the set of examples. The effect of this data augmentation was valued quantitatively in terms of performance. The choice of best features and parameters was performed automatically, using the mean class accuracy as a figure of merit.
To verify the classification power contained in the region immediately adjacent to the cell, in addition to the segmentation mask, the boundary box containing the cell was also analyzed and the relative performance results were obtained.

Data Preparation
In order to reduce the intensity variability present in the database, a contrast stretching was performed, defined as follows: where I and I c denote respectively the images after and before the transformation, and min and max represent the minimum and maximum intensity of input image, respectively. Since the images are stored in 8 bits the normalization is referred to the maximum possible value, that is, 255. Furthermore, to increase the number of training examples, a data augmentation was made. In particular, an increase for rotation of cell images at angles of 20 • was achieved; overall, a multiplication of the data by a factor of 18 was obtained. Data augmentation is a very effective practice especially when the data set for training is limited, or as in our case, when some classes are not particularly represented in the set of examples. The effect of this data augmentation was valued quantitatively in terms of performance.

Deep CNN
In recent years, deep learning networks have allowed significant improvements in classification performance for many problems of pattern recognition and in fact representing in many areas the state of the art of research.
The term "deep" usually refers to the number of layers hidden in the neural network. Traditional neural networks contain only one to two hidden layers, while deep networks can contain up to 150. Deep learning models are trained using large labeled data sets and neural network architectures that learn features directly from data without having to manually extract them.
One of the most common types of neural networks is known as a convoluted neural network (CNN or ConvNet). A CNN conveys the characteristics learned with the input data and uses the convolutional layers in 2D, which make this architecture suitable for 2D data processing, such as images. CNNs eliminate the need for manual feature extraction [33][34][35][36][37], so the user does not have to identify features used for image classification. In fact, it is possible to use the power of the pre-trained networks, without investing time and effort in training, to implement the extraction phase of the characteristics. Feature extraction can be the fastest way to use in-depth learning. The operation of CNN is based on the extraction of the features directly from the images. The automatic extraction of the features allows a high precision of the deep learning models intended for artificial vision activities, such as the classification of objects.
In this work it was decided to use one of the best known CNN (Convolutional Neural Network) networks: AlexNet [28]. This network was trained on the ImageNet database [38] consisting of more than a million images with 1000 object categories. The architecture of the pre-trained network used is shown in Figure 3. The peculiar characteristics of this network are briefly described below: The network architecture consists of eight layers of depth. The first layer accepts an RGB input image of size 227 × 227. The first five layers are convolutional (some of which are followed by max-pooling layers) while the last three are fully connected layers with a final 1000-way softmax. AlexNet uses a ReLU (rectified linear unit) layer that consists of a simple activation function such as the max(0,x) thresholding at zero that is faster than the traditional sigmoid. Furthermore, to avoid the problem of overfitting, dropout layers in the flully-connected layers are used as a regularization method.
Appl. Sci. 2019, 9 FOR PEER REVIEW 7 Figure 3. AlexNet architecture used in this work for patterns classification.

Classification
In order to associate the generic image with the correct pattern class, the effectiveness of the characteristics extracted from the CNN has been used in input to the SVMs classifiers.
Since the problem to be addressed required a multiclass classification, we followed the OAA strategy which breaks down the multiclass problem into a series of binary classifiers. In this case, the classification of the six patterns to be identified was decomposed into six binary classifiers Since the network has been weighed and trained to classify a large variety of objects, the feature representation is very robust. In particular, the first layers of the network express lower-level features that are gradually elaborated by the deeper layers, refining them into higher-level features. In this work, the sub-images containing the cells to be classified have been appropriately rescaled to acquire the correct dimensionality for the network entrance (227 × 227) The vector of features extracted from the pretrainded CNN are used in the training of six support vector machine (SVM) of linear type as specified in Section 2.5. Different layers of the AlexNet network have been evaluated as feature extractors for the problem of the classification of HEp-2 image patterns and the best configuration has been identified.

Classification
In order to associate the generic image with the correct pattern class, the effectiveness of the characteristics extracted from the CNN has been used in input to the SVMs classifiers.
Since the problem to be addressed required a multiclass classification, we followed the OAA strategy which breaks down the multiclass problem into a series of binary classifiers. In this case, the classification of the six patterns to be identified was decomposed into six binary classifiers that consider the pattern i-th (with i between one and six) as the first class and the remaining five patterns as the second class of the relative binary classifier [39,40].
The choice of the SVM classifier is due to the possibility of this classifier to be implemented, and to allow good classification performance, even if there are few examples available [41][42][43][44][45][46]; this is possible because the SVM classifier has few parameters. In particular, six SVM classifiers with linear kernel were implemented, the simplest in terms of parameters to search. Matlab's "logspace" function in the range between 10 −6 and 10 1.5 was used as the parameter search method for the linear kernel; 11 equidistant values on a logarithmic scale were analyzed.
Usually in the OAA scheme, each example of the test is assigned to the majority pattern, that is, the pattern that has obtained the highest output value from the relative SVM. In this work the cell-pattern association was implemented by means of a KNN classifier using the outputs of the six SVMs; this choice, carried out because a summary of the classifications of each SVM contributes to improving the classification result, was analyzed in terms of performance. The classification at the image-level was obtained by analyzing the prevalence of the patterns at the cell-level and associating the generic image to the pattern with the highest rate.

Results
The performance of the method proposed here was obtained both at cell-level and at image-level. The best confusion matrix, obtained with the LOSO procedure on the 13,596 cell images present in the Task-1 database, is presented in Table 2. The results contained in Table 2 have been summarized in Table 3 in order to highlight the per-class accuracy. An overall accuracy of 81.93% and an MCA of 82.16%, at cell-level was achieved. A performance analysis was performed to evaluate the classification power contained in the region immediately adjacent to the cell, using the boundary box containing the generic ROI rather than the segmentation mask. In this configuration the performances were slightly decreased and in particular the following were obtained: Accuracy = 80.3%, MCA = 79.9%.
The prevalence of patterns at the cell-level was analyzed and, for classification at the image-level, the generic image was associated to the most present pattern. Table 4 shows the confusion matrix at the image-level.  The mean class accuracy obtained was equal to 93.8%, while the accuracy achieved was equal to 96.4%. In the same configuration, the performance results without using the KNN classifier were: Accuracy = 94.0%, MCA = 91.5%.
Regarding the analysis of the features and the use of CNN, the best configuration of the method, which allowed us to obtain the confusion matrix of Table 4, made use of the 4096 features of the Fc7 layer. Table 5 shows the details of the performances obtained when the pre-trained network layer changes. The best performances, at image-level, obtained without the phase of data augmentation were: Accuracy = 94.0%, MCA = 91.67%. For all the analyzed layers, when using the data augmentation there has always been an increase in performance; the magnitude of the increase was between 2% and 4%.
In order to have a clear comparison of the performances, Table 6 reports the values of accuracy, MCA and the training method used, for various CAD systems proposed in the recent literature and using the database I3A Task-1. Table 6. Comparison of performance between the adopted method and previous investigations.

Method
Training Method Cell-Level Image-Level

Discussion and Conclusions
In this work an automatic system, able to characterize IIF images in terms of fluorescent pattern which is critical for the diagnosis of autoimmune diseases, has been proposed.
The developed system was evaluated on a public database, consisting of 83 specimens (13,596 cell images) obtaining an overall accuracy of pattern classification around 96%.
The proposed system is based on the use of the well-known pre-trained AlexNet network. The AlexNet network was used in the convoluted neural network mode as a feature extractor. The different layers of the network were analyzed and the best was identified for the classification of fluorescence patterns.
The procedure for using the data in training/testing was the leave-one-specimen-out method. The data augmentation, carried out by rotations, led to a significant increase in performance; the less statistically present classes in the database such as the Golgi pattern benefited the most from the correct classification. Moreover, for the feature extraction, we could verify better performances in using the segmentation mask rather than the boundary box.
The results were evaluated by comparing them with some of the most representative works using the same public database. The results obtained show an excellent ability of the method to classify the patterns under analysis, despite the small number of specimens in the database. The method proposed here has shown better performance or in any case comparable with other methods of recent literature (including the winner of the last competition on the subject). Hence, the system here presented can be proposed as a valid solution to the problem of ANA testing automatization.
Author Contributions: D.C. conceived of the study, performed the statistical analysis and drafted the manuscript. V.T. developed the software, optimized the parameters and helped to write the draft. G.R. participated in the design and coordination of the manuscript.