Characterization of coronary artery pathological formations from OCT imaging using deep learning

: Coronary artery disease is the number one health hazard leading to the pathological formations in coronary artery tissues. In severe cases, they can lead to myocardial infarction and sudden death. Optical Coherence Tomography (OCT) is an interferometric imaging modality, which has been recently used in cardiology to characterize coronary artery tissues providing high resolution ranging from 10 to 20 µ m . In this study, we investigate diﬀerent deep learning models for robust tissue characterization to learn the various intracoronary pathological formations caused by Kawasaki disease (KD) from OCT imaging. The experiments are performed on 33 retrospective cases comprising of pullbacks of intracoronary cross-sectional images obtained from diﬀerent pediatric patients with KD. Our approach evaluates deep features computed from three diﬀerent pre-trained convolutional networks. Then, a majority voting approach is applied to provide the ﬁnal classiﬁcation result. The results demonstrate high values of accuracy, sensitivity, and speciﬁcity for each tissue (up to 0.99 ± 0.01). Hence, deep learning models and especially, the majority voting method are robust for automatic interpretation of the OCT images.

mediators may transform vascular smooth muscle cells into an osteoblast phenotype, resulting in intimal calcification. Calcification may be extended within a fibrous cap, which is visualized as a signal poor area with sharply delineated borders in OCT imaging [5,11,12].
Kawasaki Disease (KD), mucocutaneous lymph node syndrome, is an acute childhood vasculitis syndrome, which is the leading cause of coronary artery sequelae, complicated by coronary artery aneurysms with subsequent intimal hyperplasia, media disappearance, neovascularization, fibrosis, calcification, and macrophage accumulation [1,13]. Progression of pathological formations caused by coronary artery disease can be followed by acute coronary syndrome (ACS). Therefore, it is significant to develop robust coronary artery tissue characterization techniques to evaluate the pathological formations [14].
While conventional imaging techniques such as CT and MRI may be used for clinical assessment of the coronary arteries, they are limited to providing useful information about the underlying coronary artery tissue layers. Also, they are restricted to reflect the histological reality of the regressed aneurysmal coronary segments, which are inappropriately considered as normal coronary segments [1,3,4,13]. Catheter-based Intravascular Ultrasound (IVUS) has been used for many years in interventional cardiology to evaluate coronary artery tissues by providing information on coronary arterial wall and lumen [15]. IVUS imaging is restricted to be used in pediatric cardiology due to its suboptimal spatial imaging resolution (100-150 µm), and low pullback speed. Arterial plaque formations are structural abnormalities, which require an imaging modality with high-resolution to be detected [3,7].
Cardiovascular Optical Coherence Tomography (OCT) is a catheter-based invasive imaging modality, which typically employs a near-infrared light to provide cross-sectional images of the coronary artery at depth of several millimeters relying on low-coherence interferometry. The unique characteristic of OCT is its high axial resolution of 10-15 µm, which is measured by the light wavelength and is decoupled from the lens dependent lateral resolution ranging from 20-40 µm. The image-wire is inserted into the coronary artery using an over-the-wire balloon catheter from patient's groin. A sequence of cross-sectional images of coronary artery segment is recorded using the backscattered light from the arterial wall through each pullback. Considering the fact that light can be attenuated by blood before reaching the vessel wall, blood clearance is required before starting the image acquisition [16][17][18].

Related works
Automated tissue analysis and plaque detection were focused on 2D intracoronary OCT images in adult patients to visualize plaque formations [19][20][21][22][23][24][25]. Combination of light backscattering and attenuation coefficients have been estimated from intracoronary time domain OCT for three different atherosclerosis tissues, namely calcification, lipid pool, and fibrosis [19]. Fibrosis and calcification in coronary atherosclerosis was detected by estimating the optical attenuation coefficient. The estimated values were compared with histopathological features of each tissue to determine the corresponding optical properties [20]. Another study proposed a tissue classification method using Support Vector Machine (SVM) with the combination of texture features and optical attenuation coefficient extracted form atherosclerotic tissues [21]. Another study focused on volumetric estimation of backscattered intensity and attenuation coefficient. [22]. Classification approach using SVM was used to discriminate between fibrosis, calcification, and lipid. Another group were focused on identification and quantification of fibrous tissue based on Short-Time Fourier Transform (STFT) using OCT imaging. [23]. A classification framework is developed to detect normal myocardium, loose collagen, adipose tissue, fibrotic myocardium, and dense collagen. Graph searching method is applied to segment various tissue layers of the coronary artery. Combination of texture features and optical properties of tissues is used to train a relevance vector machine (RVM) to perform the classification task [24]. A plaque tissue characterization technique based on intrinsic morphological characteristics of the A-lines using OCT imaging is proposed to classify superficial-lipid, fibrotic-lipid, fibrosis, and intimal thickening by applying Linear Discriminant Analysis (LDA) [25].
None of the studies in the literature focused on characterization of all the intracoronary tissues including arterial wall layers and pathological formations. Even though texture features and optical properties of the tissues are providing fair representation of the intracoronary tissues, but considering the fact that a tissue characterization model with high precision and low computational complexity is required, recent computer vision models may yield better results.
Convolutional Neural Networks (CNNs) have gained a wide popularity in medical image analysis. Application of CNNs in medical image analysis was first demonstrated in the work of [26] for lung nodule detection. This idea was extended to various applications in the field of medical imaging [27][28][29][30][31][32][33][34].
Transferability is defined as transferring the knowledge embedded in the pre-trained CNNs for other applications, which is performed in two different ways: Using a pre-trained network as feature generator and fine-tuning a pre-trained network to be used for classification of medical images. Common networks, which are used as pre-trained models with applications in medical image analysis are categorized into three groups. Simple networks with few convolutional layers use kernels with large receptive fields in upper layers close to the input and smaller kernels in deeper layers. The popular network in this group, which has a broad application in medical image analysis is AlexNet and is introduced by [35,36]. The second group of architectures is deep networks such as VGG models. They have the same configuration as simple networks with more convolutional layers and kernels with smaller receptive fields [30,36]. The third group of networks is categorized as complex building blocks with higher efficiency of the training process compared to other groups of networks. GoogleNet was the first network in this category [37]. ResNet and Inception models are other networks of this group. An improved version of GoogleNet, which is used recently in the field of medical image analysis is Inception-v3 [36][37][38]. VGG-16, VGG-M-128, and BVLC reference CaffeNet are used as feature extractors to classify the knee osteoarthritis (OA) images by training SVM using deep features [39]. The fine-tuned network was applied to evaluate the retinal fundus photographs from adults by detecting referable diabetic retinopathy [40]. In these studies, it is demonstrated that the results of classification using fine-tuned network competes against the human expert performance [40,41]. Very recent studies are focused on using deep learning approaches for segmentation of retinal OCT images. Segmentation of OCT retinal images is performed using a combination of CNN and graph search models. Graph search layer segmentation is performed based on the probability maps of the layer boundary classification using Cifar-CNN architecture [42]. A fully convolutional network was proposed for semantic segmentation of retinal OCT B-scans into seven layers and fluid masses [43]. A deep learning algorithm to quantify and segment the intraretinal cystoid fluid in SD-OCT images using FCNN is proposed by [44]. Another study is focused on Geographic Atrophy (GA) segmentation method using a deep network [45]. Automatic detection and quantification of the intraretinal cystoid fluid (IRC) and subretinal fluid (SRF) was proposed by [46] using a CNN with encoder-decoder architecture. The other study focused on identification of retinal pathologies from OCT images by fine-tuning GoogleNet [47].
Nevertheless, most of the studies are focused on fine-tuning the networks and comparison of the results of the fine-tuned networks with the results of other studies. Also, there are some studies that focused on designing the architectures from scratch. Considering the fact that We have limited number of annotated images in medical imaging domain, pre-trained networks are trained on millions of images and they demonstrated very high performance, which can be applied in the field of medical image analysis in an efficient way.
A recent study was performed for binary classification of intracoronary OCT images. The method discriminate between plaque and non-plaque images of coronary artery using transfer learning and fine-tuning [48]. However, we aimed to develop a model, which can characterize among various pathological formations as well as normal tissues (intima, and media layers) not only by fine-tuning the pre-trained networks, but also to design a tissue characterization model, which is computationally less expensive than fine-tuning while it can characterize various intracoronary tissues with high precision.
Recently, we proposed a tissue characterization model to characterize coronary artery layers, intima and media, of intracoronary OCT images. In our previous work, the performance of different state-of-the-art classifiers (SVM, RF, and CNN) were compared against each other, while all the classifiers were trained on deep features extracted from a convolutional neural network. In our previous study, we aimed to find the prominent features that can describe each tissue properly, and the classifier with high performance, low computational complexity, and low risk of overfitting [49]. In our previous work, the experiments were performed on the normal intracoronary OCT images, since it was less challenging than diseased coronary arteries to design the infrastructure tissue characterization model, which can be extended to characterize all the intracoronary pathologies caused by disease.
In this study, we focused on designing a tissue characterization model to detect the pathological formations, and normal coronary artery tissues using OCT imaging. The model should be able to characterize the pathological formations and normal tissues, intima and media since intimal hyperplasia is one of the most common intracoronary complications caused by KD, which can be followed by existing or disappearance of the second layer, media. Also, we considered that pathological formations can be grown partially in coronary artery tissues. Therefore, the coronary artery can be partially normal with the three-layered structure in some cases. Characterization of pathological formations is a challenging task considering the similar structure of the pathological formations, and the artifacts of the imaging system. The small size of the arteries in infants and children, and the small available population with coronary artery disease in children and infants make the tissue characterization more challenging in KD patients. Therefore, we need detailed information of each tissue to make the model robust to characterize different pathologies. For this reason, we extract the features from the three different state-of-the-art categories of pre-trained networks, which are widely used in the medial image analysis domain. The contributions of this study are: • Characterization of complex pathological formations in KD from OCT imaging: neovascularization, fibrosis, calcification, and macrophage accumulation as well as normal tissues, intima and media.
• Evaluation of different pre-trained CNN models for OCT image analysis with a limited labeled dataset.
• Assessment of the clinical usefulness of deep feature learning for OCT imaging in pediatric cardiology.
This work is organized as follows. First, data collection and pre-processing are explained in section 2.1. Second, Convolutional Neural Networks (CNNs) and pre-trained network architectures are explained in section 2.2. The process of training and validation are presented in Section 2.3 The results of the experiments are reported and discussed in section 3 and section 4 respectively. Finally, this study is concluded in section 5.

Data collection and pre-processing
The experiments are performed on 33 pullbacks of intracoronary cross-sectional OCT images of patients affected by KD. This study was approved by our institutional review board. The images are acquired using the ILUMIEN OCT system (St. Jude Medical Inc., St. Paul, Minnesota, USA).
Image acquisition is performed using in vivo intravascular OCT imaging with axial resolution of 10-15 µm, and lateral resolution ranging from 20-40 µm. FD-OCT with pullback speed of 20 mm/sec and frame rate of 100 frames/sec was used for image acquisition. The total numbers of frames are 270 per pullback. The size of the original RGB images before applying any pre-processing is 704 × 704. For each pullback ∼ 120 frames per pullback was used for the experiments. All the 33 pullbacks used for this study are obtained from patients with Kawasaki Disease. Therefore, all the frames of each sequence are affected by disease. Intimal hyperplasia is the most common complication caused by KD, which can appear as intimal thickening with preserved media or intimal thickening with media destruction. Accordingly, in most of the cases intima and media layers are detectable. Other pathological formations are developed in intima layer when the disease is not diagnosed and treated in acute phase. Hence, in KD patients, the number of occurrence of pathological formations is considerably lower compared against the intima and media layers ( Table 1). For the first step, the pre-processing is performed on all the frames of each sequence by automatic detection of the approximate region of interests including the lumen, normal intima and media, calcification, neovascularization, macrophage, fibrosis and surrounding tissues for each pullback frame using active contour ( Fig. 1(b)). The catheter and unwanted red blood cells are removed by applying the smallest connected components approach ( Fig. 1(c)). The images were converted to planar by transferring all the points from Cartesian coordinates to planar representation in Polar coordinates to simplify the calculations.

Learning model architecture
CNNs are built on convolutional layers, which are responsible to extract features from the local receptive field of the input image. Each convolutional layer consists of n sets of shared weights between the nodes to find similar local features in the input channels, which are called convolutional kernels. Each kernel creates a feature map when it slides through the whole input image with a defined stride. Feature maps extracted from one convolutional layer will be the input of the next layer [36]. It is standard to calculate the output of a neuron by applying a hyperbolic tangent or logistic regression, which are both saturating activation functions. Saturating nonlinearities are slower than non-saturating non-linearities while stochastic gradient descent is used to minimize the cost function with respect to the weights at each convolutional layer. Therefore, a non-saturating activation function, which is called Rectified Linear Unit (ReLU) can accelerate the training process by keeping non-negative values and replacing negative values by zero in the feature map [35]. CNNs alternate between the convolutional and pooling layers to achieve computational efficiency, since pooling layers are used for dimensionality reduction by aggregating the outputs of neurons at one convolutional layer and reducing the size of the feature maps. Pooling layers can keep the network invariant to small transformations, distortions, and translations in the input image as well as control overfitting by reducing the number of parameters and computations [35].
CNNs are trained using back-propagation algorithm and stochastic gradient descent is commonly used to minimize the following cost function: where X is the size of the training set and ln(p(y j |X j ) denotes the probability of j th image to be classified correctly with the corresponding label y. For each layer of the network, the weights are updated at each iteration i as follows: where µ is the momentum, α is the learning rate, γ is the scheduling rate, which reduces the learning rate at the end of iterations and W is the weight at each iteration i for each layer [35,49]. Pre-trained networks are widely used as both feature extractor and classifier for different tasks. Among the most common architectures, we selected three pre-trained networks with different architectures. AlexNet is a simple and shallow network, which is popular for clinical applications. The network consists of five convolutional layers, and three fully connected layers, which are followed by a final softmax with GPU implementation of the convolutional operation. The model is trained on 1.2 million images from the ImageNet dataset, which are annotated and categorized into 1000 semantic classes. The model uses 60 million parameters and consists of 650000 neurons, which is trained using stochastic gradient descent with the batch size of 128, momentum of 0.9, and weight decay of 0.0005 to reduce the training error of the model [35]. The network architecture is shown in Fig. 2.
Deeper models were designed by stacking convolutional layers to increase the depth of the network. Instead of using a large receptive field, kernels with very small receptive field and fixed size were applied in each convolutional layer. Every set of convolutional layers is followed by a max pooling to reduce dimensionality, and every convolutional layer is followed by a ReLU to introduce non-linearity. VGG networks are trained on 1.2 million images of 1000 classes from ImageNet. The batch size and momentum are set to 256, and 0.9 respectively. The learning rate was initialized to 0.01 and was decreased by the factor of 10 when the accuracy on validation set stopped improving [30]. Among deep network architectures of VGG we selected VGG-19 with 144 millions of parameters and deeper network architecture consists of 16 convolutional layers, and three fully connected layers, which is shown in details in Fig. 3.
Complex building blocks (inception blocks) are introduced as models with the fewer numbers of parameters and higher efficiency of the training process by replacing the fully connected architectures with sparsely connected architectures. The network has been built from convolutional building blocks called inception modules, which are stacked on top of each other. Each inception module consists of a combination of convolutional layers with kernel sizes of 1×1, 3×3, and 5×5, which their output filter banks concatenated into a single output vector that will be the input of the next stage. 1×1 convolutions in each inception module is used for dimensionality reduction before applying computationally expensive 3×3 and 5×5 convolutions. Factorization of convolutions into smaller convolutions results in aggressive dimension reduction inside the network, which leads to the fewer numbers of parameters and low computational cost. Inception models are trained using stochastic gradient descent. Batch size is chosen as 32 for 100 epochs and momentum with the decay of 0.9. Learning rate is initialized by 0.045 and decayed every second epoch by the exponential rate of 0.94 [37,38]. Pre-trained Inception-v3 is used in our experiments. The inception models are updated in this version of the network to further boost ImageNet classification accuracy. The last part of the network, which is used for fine-tuning in our experiments is shown in Fig. 4.

Training and validation
In our experiments, the total of 3149 different tissues are extracted from OCT pullback images and are manually labeled as calcification, fibrosis, normal intima, macrophage, media, and neovascularization. Annotated images are validated by expert cardiologists. the ROIs are extracted from each frame of the sequence using the manual segmentation and they are labeled as 1 to 6 for calcification, fibrosis, normal intima, macrophage, media, and neovascularization respectively. To start the experiments, 66% of the ROIs are selected randomly as the training set. To avoid any correlation between training, test, and validation sets, 50% of the remaining ROIs are randomly selected as the validation set and the test set is built on the last residual ROIs. The experiments are performed in four different steps to find the optimal tissue characterization framework.

Classification using fine-tuned networks
For each convolutional neural network, before starting the training process, fine-tuning is performed as follows: Considering that the number of nodes in the last fully connected layer depends on the number of classes in each dataset, for the first step, we removed the classification layers and replaced them by the layers, which are designed appropriately for our classification task. The iterative weight update in any convolutional neural network is performed by random weight initialization at each layer of the network. Since the number of labeled data is limited in our experiments, weight initialization can be performed using the weights of the pre-trained networks. Therefore, iterative weight updates of equations 2 and 3 lead to a fast convergence to find the desirable local minimum for the cost function (equation 1). Therefore, for the next step, the weights are initialized at each layer of the network with the weights of the pre-trained network. The iterative weight update can be started using layer-wise fine-tuning by finding the optimal learning parameters at each convolutional and fully connected layer. Considering the complexity of the pathological formations, the process of fine-tuning the pre-trained AlexNet is performed based on our new dataset. The last three layers of the pre-trained network (fc8, prob, and classification layer) are replaced by a set of layers, which are designed for multi-class classification task to classify calcification, fibrosis, macrophage, neovascularization, and normal tissues (intima, and media). The values of µ and γ are kept at 0.9 and 0.95 respectively and the learning rate for the last fully connected layers (fc6, fc7, and fc8) is set to 0.1 to learn faster in the last layers and we started decreasing the learning rates to 0.01 from the last convolutional layer (Conv5).
Since by adding convolutional layers and reducing the size of the filters, we will have access to detailed image information, increasing the depth and width of the network can improve the quality of the network architecture. To have a fair comparison among the performance of pre-trained networks, we selected VGG-19 from the category of very deep CNN architectures, which is the last modified version of this category. As it is explained in the previous section, VGG-19 has almost the same configuration of the AlexNet with more convolutional layers and smaller filter sizes. Therefore, fine-tuning the VGG-19 is performed using the same strategy that is applied to AlexNet. We started fine-tuning by removing the classification layers (fc8, prob, and output) and replacing them by a set of layers, which are appropriate for multi-class classification of various coronary artery tissues (calcification, fibrosis, macrophage, neovascularization, normal intima, and media). We started fine-tuning from the last fully connected layer (fc8) and increase the depth of fine-tuning gradually by evaluating the network performance at each fine-tuning level. To find the optimal parameters at each level of fine-tuning, an interval of values close to the optimal values of fine-tuned AlexNet is chosen. For all the networks applied in this study, the optimal parameters are determined by grid searching for the defined interval of values and evaluating the performance of the network at each step. The best performance of the network obtained by assigning fixed values of 0.8, and 0.85 to µ and γ respectively. The learning rate is determined as 0.2 for the last fully connected layers (fc6, fc7, and fc8) and is decreased to 0.01 from the last convolutional layer (Conv5-4).
Complex building blocks are very deep network architectures, which use the particular configuration of inception modules to reduce the number of parameters and consequently improve the efficiency of the training procedure. We selected Inception-v3 from the category of complex network architectures to perform our experiments because in the latest versions of inception models, factorization into smaller convolutions is performed. Therefore, each 5×5 convolution is replaced by two 3×3 convolutions in the latest versions such as inception-v3. Also, in this version, the grid sizes between the inception blocks are reduced, which results in reducing the computational cost and fast training the network. Considering the complexity of the Inception architectures, changing the network can interfere with computational gains. Therefore, it is more difficult to adapt these types of networks to a new classification task. To fine-tune the network, we removed the last layers of the network (predictions, predictions-softmax, and ClassificationLayer-predictions), which aggregates the extracted features from the network for classification task and added a new set of classification layers adapted to our data set to the network graph. The new layers are connected to the transferred network graph and the learning rate for the fully connected layer is set to 0.1.
At each step of fine-tuning for all the networks, the accuracy is calculated on the validation set and the training process is stopped when the highest accuracy on the validation set is obtained. By terminating the training process, classification is performed on the test set using each fine-tuned network separately.

Training random forest using deep features generated by pre-trained networks
In this experiment, pre-trained networks are used as feature generators. The activations extracted from the last layer before classification layer is used to train Random Forest to classify various coronary artery tissues. Using AlexNet, and VGG-19, features are extracted from the last fully connected layer right before the classification layer (fc7). Each feature vector represents 4096 attributes of the labeled tissue. Using Inception-v3, features are extracted from the last depth concatenation layer (mixed10). Each feature vector represents 131072 attributes of the labeled tissue. It is demonstrated in our previous work that Random Forest is a robust classifier with quick training process and low risk of overfitting [49]. It works based on generating an ensemble of trees. The trees are grown based on the CART methodology to maximum size without pruning. Generalization error for Random Forest classifier is proportional to the ratio ρ/s 2 , which (s) and (ρ) are respectively defined as the strength of the trees and correlation between them. the smaller this ratio results in the better performance of Random Forest [50,51]. To find the optimal number of trees, The performance of Random Forest is evaluated for 1000 of trees while it is trained on each set of features extracted from each network separately. The OOB error rate is stopped decreasing when the tree number is assigned to 250 using the features extracted from Inception-v3, and VGG-19, and 300 using the features extracted from AlexNet (see Fig. 5). The fewer number of trees accelerates the training process by reducing the computational complexity. The number of randomly selected predictors (m tr y ) is set to 7.
Training features extracted from each pre-trained network and used separately to train Random Forest. Classification is performed on the test set using the test features extracted by each pre-trained network.

Classification using majority voting
Inspired by the ensemble learning approaches, we applied weighted majority voting (equation 4) on the classification results obtained by the second experiment. Classification is performed by Fig. 5. OOB error rate is calculated to find the optimal number of trees to train Random Forest model. The performance of Random Forest is evaluated by calculating OOB errors while it is trained on each set of features extracted from each network separately. The OOB error rate is calculated for 1000 of trees.
where C(x) is the classification label with the majority vote, i is the class label (it can be varying from 1 to 6 for calcification, fibrosis, normal intima, macrophage, media, and neovascularization), w j is the weight of j th tissue label and I is the indicator function. Thus, majority voting is applied to search in all the classification labels for the most frequent label assigned to each tissue using equation 5.
where C 1 (x), C 2 (x), and C 3 (x) are Random Forest classification results using the features extracted from AlexNet, VGG-19, and Inception-v3 respectively. weights are set to 1/3 for all the three sets of classification results except those which are predicted with three different tissue labels, . Because for each tissue, we can get different information of that particular tissue using each network separately, which may be significant in proper characterization of the tissues. Regardless the overall performance of each network, AlexNet works very well to characterize normal intima and VGG-19 gives us important information regarding the calcification. Therefore, except the situation that the model should take the best decision between three different labels to choose the one, which has higher probability to belong to the true class label, we decided to look for more frequent label as the majority vote in other cases. Since the mode of C 1 (x), C 2 (x), and C 3 (x) when C 1 (x) C 2 (x) C 3 (x) gives us the smallest tissue label as the majority vote, we put more weight on the third group of predicted labels if C 1 (x) C 2 (x) C 3 (x) considering the strength of deep Inception-v3 features. Therefore, the majority vote will be on the class label with the highest probability of belonging to the true class label. Fig. 6. Confusion matrix of intracoronary tissue classification using fine-tuned AlexNet.

RF classification using deep feature fusion
To consider all possible ways to find the optimal tissue characterization framework, we combined the features obtained from AlexNet, and VGG-19 to train Random Forest. Classification is performed on the test set and the results are compared against the previous experiments. The features extracted from Inception-v3 are not used in this experiment since the size of the feature matrix is huge to be combined with other feature matrices. Matlab 2017b is used for all the experiments in this study. The computer configuration is as follows: Intel core i7-6700k, 16GB of RAM. The experiments are performed on GPU (GForce Tiran X, RAM: 12GB), Windows 10 (64 bit).

Classification using fine-tuned networks
For the first experiment, fine-tuning is performed on AlexNet, VGG-19, and Inception-v3 from different categories of simple architectures, very deep architectures, and complex networks respectively. The optimal fine-tuning parameters are estimated and the networks are trained by assigning the new learning parameters. Classification is performed by each network separately and accuracy, sensitivity, and specificity are measured using the corresponding confusion matrix for each network. The results are shown in Figs. 6-8, and Tables 2-4. The results of the experiments demonstrate the higher performance of VGG-19 and Inception-v3 compared against AlexNet, which was expected considering the deep structure of VGG-19, and Inception-v3 architectures. Although using pre-trained networks reduce the computational burden, which results in reducing the training time and convergence issues, but a considerable amount of time is still required to find the optimal learning parameters and retrain the fine-tuned networks (approximately two hours for each network). Also, there is a risk of overfitting in deep fine-tuning a network. The following steps are proposed to find the optimal tissue characterization model, which can overcome the mentioned issues in an efficient way.

Training random forest using deep features generated by pre-trained networks
In this experiment, deep features are extracted from AlexNet, VGG-19, and Inception-v3. By applying each network separately as feature generator, the training features are extracted to train Random Forest and the classification is performed on the test set. Features are extracted from the last fully connected layer before the classification layer (fc7) in AlexNet, and VGG-19 Table 2. Measured sensitivity, specificity, and accuracy of tissue classification using fine-tuned AlexNet.

Tissue
Accuracy Sensitivity Specificity    architectures, and the last depth concatenation layer (mixed10) in Inception-v3 architecture. Accuracy, sensitivity, and specificity are measured using the corresponding confusion matrix for each classification result, which are shown in Figs. 9-11, and Tables 5-7. Regardless of the time, which is spent to find the optimal learning parameters, the process of feature extraction from all the three networks, and training the Random Forest using each set of features takes approximately twice less time than retraining a network. Using pre-trained networks as feature extractor overcomes the problems of fine-tuning, training time, and overfitting concerns. But, the classification performance is not as high as using CNNs as the classifiers (Figs. 14-16). To solve this problem, the following two experiments are performed and the results of all experiments compared against each other.

Majority voting
In this experiment, weighted majority voting is applied on Random Forest classification results using each set of features extracted from the three mentioned networks. The results are illustrated in Fig.12, and Table 8. The results show a good improvement of accuracy, sensitivity, and specificity, which are calculated for the final classification using majority voting.      Fig.13 and Table 9. The results of the last two experiments show that majority voting approach performs better than Random Forest classification result while it is trained on the combination of features. CNN features are very strong to describe various pathological formations and tissues, and the results of the experiments are high due to the classification of ROIs. Therefore, to choose the optimal tissue characterization model considering all the experiments, and to compare the results of the experiments against each other, the mean ± standard deviation of the values of accuracy, sensitivity, and specificity obtained for all the tissues performing each experiment are calculated and the results are shown in Figs. 14-16 and Table 10. Although the combination of features can improve the classification results compared against using each network separately as feature extractor, the results of majority voting approach are considerably higher than the classification results using the combination of features (Figs. 14-16).

Discussion
In this study, the performance of pre-trained networks is discussed. Three different state-of-the-art networks (AlexNet, VGG-19, and Inception-v3) are used in four different experiments. The experiments started with fine-tuning the networks and using them for tissue classification of six different tissue labels (calcification, fibrosis, neovascularization, macrophage, normal intima, and media). We started with fine-tuning the networks, which is the most common way of applying pre-trained networks for various applications in the field of medical image analysis. Each experiment is designed based on the limitations of the previous experiment to achieve the main goal of this study, which defined as designing an accurate intracoronary tissue classification model using deep feature learning in an efficient procedure. The second experiment is performed to avoid convergence issues in fine-tuning the networks, overfitting by deep fine-tuning the networks, and training time. Deep features are very strong to describe arterial tissues and Random Forest works efficiently on large datasets with a very low risk of overfitting. Also, the training process is considerably fast using Random Forest. But, when pre-trained networks are used as feature generators without fine-tuning, the classification results show lower accuracy, sensitivity, and specificity compared against using fine-tuned networks as classifiers. Majority voting on classification results of Random Forest can considerably improve the results of the second experiment without adding a huge computational burden. The accuracy, sensitivity, and specificity obtained from the third experiment (majority voting from Random Forest classification) can compete against the classification performance of the fine-tuned networks. By evaluating the results of all the experiments, it is more efficient if we use pre-trained networks as feature extractors and train Random Forest for each set of generated features to perform the classification. Then, majority voting method provides the final tissue classification result. Fig. 17 shows classification results for each coronary artery tissue. The results of the experiments are high due to the classification of ROIs. The optimal model is based on extracting features from pre-trained networks without any fine-tuning, and train Random Forest as the classifier. Also, Random Forest is known as a classifier with the low risk of overfitting. However, to overcome the concern of overfitting, leave-one-out cross-validation is performed by leaving out the OCT images of one patient for test set and training the classifier on the OCT images of the remaining patients at each step of the experiment (Fig. 18).
The experiments performed by one random selection of training, validation, and test sets to reduce the computational burden. To evaluate the performance of the model using various randomizations of the training, validation, and test sets, we performed the experiments for 10 iterations using the final characterization model (feature extraction using CNNs, classification using RF, and final classification result by majority voting). As it is shown in Table 11, although there are some variations between the results obtained from each iteration because of different selections of training, validation, and test sets, but the accuracy, sensitivity, and specificity of the tissue characterization demonstrate the robustness of the model to characterize between different coronary artery tissues. Table 11. Measured sensitivity, specificity, and accuracy of tissue classification: Using the final model (feature extraction using CNNs, classification using RF, and final classification result by majority voting), we perform the experiment in 10 iterations to evaluate the performance of the model using various randomization of the training, validation, and test sets. The accuracy, sensitivity, and specificity are reported as the mean ± std for all the iterations.

Conclusion
The goal of this study was to propose a new approach for OCT imaging using deep feature learning from different CNN models and to evaluate their performance on a complex multi-class classification problem such as pathological formations in coronary artery tissues. The most significant outcome is to be able to automatically differentiate between intracoronary pathological formations observed from OCT imaging. This might be highly relevant for the automatic assessment of coronary artery disease in KD. Majority voting from Random Forest classification using deep features have been successful in classifying coronary artery tissues. The final tissue labels were obtained with high accuracy, sensitivity, and specificity, which confirm the robustness of our proposed technique considering the high variability of pathological formations, OCT artifacts, and the small size of the arteries in pediatric patients, which is followed by very thin layers in coronary artery structure. In this work, we have outlined the relevance of deep features obtained using transfer learning for OCT imaging and the practical aspect of using RF classification to obtain the final decision in a clinically acceptable computational time. For future works, we will focus on detecting intimal hyperplasia by measuring the thickness of intima, and severity of pathological formations by evaluating distensibility variations as a result of calcification, and fibrous scarring. With the proper dataset and manual annotation, this might be adapted for adult coronary artery diseases to fully assess the structural information of the coronary artery.

Funding
Fonds de Recherche du Québec-Nature et technologies.

Disclosures
The authors declare that there are no conflicts of interest related to this article.