Characterization of Optical Coherence Tomography Images for Colon Lesion Differentiation under Deep Learning

(1) Background: Clinicians demand new tools for early diagnosis and improved detection of colon lesions that are vital for patient prognosis. Optical coherence tomography (OCT) allows microscopical inspection of tissue and might serve as an optical biopsy method that could lead to in-situ diagnosis and treatment decisions; (2) Methods: A database of murine (rat) healthy, hyperplastic and neoplastic colonic samples with more than 94,000 images was acquired. A methodology that includes a data augmentation processing strategy and a deep learning model for automatic classification (benign vs. malignant) of OCT images is presented and validated over this dataset. Comparative evaluation is performed both over individual B-scan images and C-scan volumes; (3) Results: A model was trained and evaluated with the proposed methodology using six different data splits to present statistically significant results. Considering this, 0.9695 (±0.0141) sensitivity and 0.8094 (±0.1524) specificity were obtained when diagnosis was performed over B-scan images. On the other hand, 0.9821 (±0.0197) sensitivity and 0.7865 (±0.205) specificity were achieved when diagnosis was made considering all the images in the whole C-scan volume; (4) Conclusions: The proposed methodology based on deep learning showed great potential for the automatic characterization of colon polyps and future development of the optical biopsy paradigm.


Introduction
Colon cancer is the second most common cause of cancer death in Europe both for women and men, and the third most common cancer worldwide [1]. About 1.8 million new cases of colorectal cancer were recorded globally in 2018 [2], being the third most common cancer in men and second in women. The five-year survival rate is 90 percent for colorectal cancers diagnosed at an early stage, but unfortunately only 4 out of 10 cases are found this early [3].
OCT technology capabilities in the diagnosis of colon polyps have been investigated in the latest years with promising results on the future adoption in clinical practice. Several studies [18][19][20][21], both in murine and human models, have reported the identification of tissue layers and the discrimination capacities of the technology on the differentiation of different types of benign (including healthy) and malignant tissue. When analyzing 44 polyps from 24 patients [18], endoscopists detected fewer subsurface structures and a lower degree of light scattering in adenomas, and that, in comparison, hyperplastic polyps were closer in structure and light scattering to healthy mucosa. The scattering property was calculated by a computer program applying statistical analysis (Fisher-Freeman-Halton test and Spearman rank correlation test), confirming the previous appreciation. A comparison of OCT images with respect to histopathological images was performed in [19] using previously defined criteria for OCT image interpretation on the identification of tissue layers. Upon the observations, hyperplastic polyps are characterized by a threelayer structure (with mucosa thickening) whereas adenomas are characterized by the lack of layers. Then, under these assumptions, measured over a group of 116 polyps from patients, lesions could be visually differentiated in OCT images with 0.92 sensitivity and 0.84 specificity. Later, a fluorescence-guided study performed on 21 mice [20] after administrating a contrast agent showed the OCT ability to differentiate healthy mucosa, early dysplasia, and adenocarcinoma. Visual analysis of normal tissue revealed that the submucosa layer is very thin in some specimens and not always well appreciated in the OCT images, although the tissue boundaries remain distinguishable. In adenoma polyps, a thickening of the mucosa (in first stages) or disappearance of the boundary between layers is detected, whereas in the case of adenocarcinoma, the OCT images showed a loss of tissue texture, absence of layers, and the presence of dark spots caused by the high absorption in necrotic areas. In the latest study [21], they go beyond and propose a diagnosis criterion over micro OCT images with some similarities to the Kudo pit pattern [22] and demonstrate the diagnosis capacity of the OCT technology as clinicians can reach 0.9688 sensitivity and 0.9231 specificity on the identification of adenomas over 58 polyps from patients.
Both the cross sectional and the en-face images have been shown to provide clinically relevant information in the mentioned studies, and the combination of both views for the detailed study of tissue features suggests an important advance [23][24][25]. In addition to previous studies, the calculation of the angular spectrum of the scattering coefficient map has also revealed quantifiable variances on the different tissue types [26].
The clinical characteristics of the lesions that can be observed on the OCT images can be further exploited by image-based analysis. Image and signal processing methods can allow dealing with the noisy nature of the signal, whereas machine learning algorithms are able to exploit the spatial correlation of the biological structures to make the most of them. These types of algorithms can detect, and quantify, subtle variations on images that the naked human eye cannot and can be applied with the goal of performing automatic interpretation of the images for image enhancement, lesion delimitation, or classification tasks. However, as seen in previously reviewed studies, few attempts of applying these methods for colon polyps on OCT images have been found, showing that there are opportunities of research in the area.
The main limitation of traditional machine learning methods is the need to process the original data from their natural form to another form of representation appropriate for the targeted problem. Image processing methods must be carefully applied to extract the most representative features of the data, aiming to resemble how the experts analyze the images. Then, the extracted features are passed as input to the selected classifier method. Unlike deep learning approaches, traditional machine learning methods require tailored feature extraction which is followed by a shallow machine learning method. This makes them less prone to generalization and leads to lower discriminative power [27]. Under the deep learning paradigm, image feature extraction and classification are simultaneously performed through a network architecture representing all possible solution domains and which is optimized by means of a loss function minimization that seamlessly drives the network parameters towards a suitable solution. Convolutional neural networks (CNN) [28,29] have surpassed classical machine learning methods [30,31], and even medical expert capabilities [32][33][34]. They have been also successfully applied in colon cancer histopathological classification [35,36], MPM classification [37], polyp detection on colonoscopy [38][39][40], or histological colon tissue staining [41].
The application of deep learning methods to OCT medical images is a recent trend and only few examples of application are available. Ophthalmology being the oldest context of application of OCT, most examples are found in this area, and some others in cardiology and breast cancer [42][43][44][45]. In gastroenterology of the lower track (colon), only one recent work has been identified [46]. A pattern recognition network called RetinaNet [47] has been trained to distinguish normal from neoplastic tissue with a 1.0 sensitivity and 0.997 specificity. The success of the model is based on a dentate structural pattern, identified in normal tissue in previous studies, being utilized as a structural marker on the images used as input during training and evaluation. In this sense, the B-scan images on the dataset (26,000 images acquired from 20 tumor areas) are manually inspected to identify "teeth" samples representing normal colonic mucosa and "noisy" samples representing malignant tissue. On evaluation, the network provides a list of boxes where these patterns are found along with the probability, and average scores are calculated over a sequence of N adjacent B-scan images. The drawback of this approach resides in the identification of the "teeth" pattern in normal tissue, but no other patterns have been identified for malignant tissue, just assuming that the "teeth" pattern does not appear in that case.
The work presented in this paper further investigates the application of deep learning methods over a collected database with more than 94,000 OCT images of murine (rat) colon polyps to study the discrimination capacity of this imaging technique for its future adoption as a real-time optical biopsy method. The aim of this proposal is to contribute to setting the bases for the automatic analysis of images with latest state-of-the-art techniques that could lead to the development of new computer-aided diagnosis (CADx) applications. Once image analysis methods demonstrate this capacity, colon polyp diagnosis with OCT can be progressively mastered by clinicians and the adoption of the technology naturally accomplished. With this aim, this work implements a classification (benign vs. malignant) approach based on an Xception deep learning model that is trained and tested over a large dataset of OCT images from murine (rat) samples that have been collected for this purpose. We propose a pre-processing method for data augmentation and to validate the application of deep learning methods for colon polyp classification as benign or malignant. In addition, to further investigate the diagnosis capacity of the proposed approach, evaluation is performed twice, once over individual B-scan images and then also over C-scan volumes for comparison. Finally, a strategy to maximize results when evaluating individual B-scans is applied.
In comparison with previous studies [46], this work proposes a general diagnosis strategy based on classification instead of pattern recognition, which avoids time consuming manual annotation of the database providing automatic identification of the characteristics representing polyps tissue type. The classification strategy model can generalize better upon new polyp categories than the segmentation strategy, the performance of which is biased by the available annotations of the database. A classification strategy can help in the identification of subtle characteristics present on noisy OCT images that are not easily distinguished by the naked eye, and with proper visualization of them, can help clinicians to better understand the OCT imaging technique. In the future, the combination of both approaches could be considered for maximizing automatic diagnosis results.

Animal Models
Sixty animals with colorectal cancer (CRC) from the strain PIRC (polyposis in the rat colon) rat F344/NTac-Apcam1137 model (sex ratio: 50/50) from the Rat Resource and Research Centre (RRRC) were used for the extraction of neoplastic colonic samples. This animal model was used in the study for the following main reasons: (a) it is an excellent model for studying human familial colon cancer; (b) ENU (N-ethyl-N-nitrosourea)-induced point mutation results in a truncating mutation in the APC (adenomatous polyposis coli) gene at a site corresponding to the human mutation hotspot region of the gene; (c) heterozygotes develop multiple tumors in the small intestine and colon by 2-4 months of age; (d) PIRC tumors closely resemble those in humans in terms of histopathology and morphology as well as distribution between intestine and colon; (e) provides longer lifespan compared to related mouse models (10-15 months); and (f) tumors may be visualized by CT (computerized tomography), endoscopy, or dissection. Moreover, the absolute incidence and multiplicity of colonic tumors are higher in F344-PIRC rats than in carcinogen-treated wild-type F344 rats, or in mice [48,49].
Additionally, thirty rats from the strain Fischer344-F344 wildtype model (sex ratio: 50/50) were used for the development and extraction of hyperplastic colonic samples. A rat surgical model of hyperplasia in the colon was developed in novo for endoscopic applications. It recreates important features of human hyperplasia, such as the generation of new cells in the colonic mucosa and tissue growth, as well as the corresponding angiogenesis. It consists of an extracolonic suture on which lesions are inflicted with a biopsy extraction forceps during a period established in different weekly follow-ups for the correct induction of the model [50,51].
Finally, as a control group, ten healthy tissue samples from three specimens were extracted from the colon of rats from the strain Fischer344-F344 wildtype model (sex ratio: 50/50). Uninvolved areas of the hyperplasia animals (ascending colon, transverse colon, and regions of the descending colon without lesion) were used as healthy tissue samples. This ensured meeting one of the three r's of animal research that aims to maximize the information obtained per animal, making it possible to limit or avoid further use of other animals, without compromising animal welfare.

Equipment
The equipment used for imaging the murine (rat) samples was a CALLISTO from Thorlabs (CAL110C1) [52] spectral domain system with central wavelength 930 nm, field of view of 6 × 6 mm 2 , 7 µm axial resolution, 4 µm lateral resolution, 1.7 mm measurement in depth, 107 dB sensitivity at 1.2 kHz measurement speed, and 7.5 mm working distance. Samples were scanned using the high-resolution scan lens (18 mm focal length) and a standard probe head with a rigid scanner for stable and easy-to-operate setup.

Sample Acquisition Procedure
Rats were acclimatized before surgery in individually housed cages at 22-25 • C with food and water ad libitum. All surgical procedures were performed under general inhalation anesthesia [53][54][55] by placing them in an induction chamber to administrate sevoflurane 6-8% in oxygen with a high flow of fresh gas (1 L/min). Then, they were connected to a face mask to continue the administration of sevoflurane (3-3.5%) in oxygen (300 mL/min) and placed in dorsal decubitus to carry out the endoscopic procedure. Atropine (0.05 mg/kg), meloxicam (1 mg/kg/24 h), and pethidine (10-20 mg/kg) were injected subcutaneously before beginning the surgical procedure. A thermal blanket was used throughout the procedure. Once the animals had acquired the appropriate surgical plane, a colonoscopy was performed to rule out the presence of abnormalities that could interfere with the study. The aim was locating all those lesions that could be found through observation by using white light and a rigid cystourethroscope of 2.9 mm in diameter, which reached a diameter of up to 5 mm when working with an intermediate sheath and an external sheath (size appropriate for this animal model), with the objective of not damaging said structures at the start of the procedure. After shaving the abdomen and preparing the area with povidone-iodine and 70% ethanol, animals were covered with an open sterile cloth. Then, an average laparotomy of 4-5 cm in length was performed. A retraction device with hooks (Lonestar ® ) was used as support tool to make this section circular and externalize all the necessary intestinal content outside the abdomen. Animals were kept at constant temperature thanks to successive peritoneal washes made with tempered serum. Then, the block of the colon was fixated with a suture to prevent the reversion of the content throughout the colon and cecum. Three areas (ascending colon, transverse colon, and descending colon) were studied consecutively taking advantage of the anatomical division of the colon. They were divided with the help of ligatures (silk 4/0) through the mesentery of each portion and scanned in the proximal to distal direction making use of the rigid cystoscope to check the number of polyps.
At each point with lesions, a disposable bulldog clamp was used to mark the distribution of the lesions, thus avoiding cutting the lesions in the next procedure of colostomy of ascending and transverse portions. After that, the colon was extracted in block and then, the animals were euthanized under general inhalation anesthesia by rapid intracardiac injection of potassium chloride (KCl) (2 mEq/kg, KCl 2M), according to the ethical committee recommendations. The colon was opened by a longitudinal colotomy with the help of scissors to eliminate the tube shape of the colon, exposing thus the mucosa with the localized polyps to improve their visualization, handling, and analysis. At this time, magnification was provided by a STORZ VITOM ® HD for a better location of the lesions with the extended organ.
For each localized lesion, a sample was extracted for later ex vivo analysis with the OCT equipment. Instead of acquiring the images directly on the fresh sample after resection, samples were fixed and then preserved for several further analyses while maintaining the properties of the tissue. Based on [56], the fixation procedure for each sample consisted in the immersion of the sample in 4% formaldehyde for at least 14 h at about 4 • C. Then, after two washes with phosphate buffered saline 0.01 M (PBS) each 30 min, the sample was submerged in PBS and 0.1% of sodium azide and stored in refrigeration at 4 • C. This method was established to provide safer handling of samples, avoiding the adverse effects of manipulating formaldehyde-embedded samples in a surgical environment. Additionally, it was checked with histopathological analysis that this fixation procedure did not alter the properties of the tissue, showing no noticeable differences from fresh tissue.

Image Acquisition Protocol
First, each sample was placed on a plate, secured, and fixed for the correct exposure of the tissue. Once placed on the platform under the OCT probe, a B-scan of the sample was acquired for further calibration of the equipment. While scanning, the sample was focused by approaching the OCT probe. The super-fine focus allows to acquire a high-quality OCT signal with the better penetration depth. Due to the anatomical differences of the samples, it was always necessary to repeat this step for each new sample. Once the sample was properly focused and the 2D signal quality optimized, the next step was the acquisition of a C-scan of the sample. In this case, the software allowed drawing a rectangle ( Figure 1) indicating where to perform the 3D acquisition on the sample. When considered, various 3D scans covering different parts of the lesion were recorded for the same lesion.  The database was visually inspected before training the model ignoring all C-scans or B-scan images acquired with errors, large aberrations, or artifacts to ensure the quality of the data. Note that this database is a preliminary version of an ongoing larger dataset that will be made openly available. Access to the database used in this article is possible

Dataset Summary
The database consists of healthy, hyperplastic, and neoplastic (adenomatous and adenocarcinoma) samples. Following the previously described acquisition procedure, the subsequent number of cases were included in the database for each tissue type: 10 healthy samples with 48 C-scans, 13 hyperplastic samples with 53 C-scans, and 75 neoplastic samples with 245 C-scans. As a result, the database contains a total of 94,687 B-scan images.
The database was visually inspected before training the model ignoring all C-scans or B-scan images acquired with errors, large aberrations, or artifacts to ensure the quality of the data. Note that this database is a preliminary version of an ongoing larger dataset that will be made openly available. Access to the database used in this article is possible upon request to the corresponding author.

Ethical Considerations
Ethical approvals for murine (rat) samples acquisition were obtained from the relevant Ethics Committees. In case of research with animals, it was approved by the Ethical Committee of animal experimentation of the Jesús Usón Minimally Invasive Surgery Centre (Number: ES 100370001499) and was in accordance with the welfare standards of the regional government which are based on European regulations.

Deep Learning Architecture
The proposed architecture was based on the Xception classification model [57] previously trained over the ImageNet dataset [58]. Then, a global average pooling layer and a final layer with 2 neurons and softmax activation were added, representing the classification classes: benign vs. malignant. A schematic view of the architecture, generated with a visual grammar tool [59], is provided in Figure 2.

Data Preparation and Augmentation
As a data augmentation strategy, during the training process, the algorithm processes the dataset images in the following manner: image pre-processing; air-tissue delimitation; random selection of region of interest (ROI); ROI extraction; and ROI preparation. These steps are illustrated in Figure 3.

Image pre-processing
The OCT gray scale original image contains one single channel that is duplicated to This pre-trained network accepts images of the size of 299 × 299 pixels which are randomly sampled from the original OCT images as detailed in next section "data preparation and augmentation". OCT images on the database (B-scan images) have variable lateral sizes in the range 512-2000 pixels due to differences in the sizes of the polyps and scanning area selected. For this reason, B-scan images were pre-processed to extract regions of interest of smaller size (299 × 299 pixels) to make the most of the images and avoid losing lesion structural features on the bigger images that would happen with image rescaling. Directly rescaling the whole image could be comparable to reducing the lateral and axial resolution of the images, and hence losing information about the smaller structures. The proposed data preparation approach also serves as a data augmentation strategy. Moreover, a strategy for dealing with data imbalance in the dataset was also adopted.

Data Preparation and Augmentation
As a data augmentation strategy, during the training process, the algorithm processes the dataset images in the following manner: image pre-processing; air-tissue delimitation; random selection of region of interest (ROI); ROI extraction; and ROI preparation. These steps are illustrated in Figure 3.

Air-tissue delimitation
The aim of this step is to automatically detect on the image the delimitation between the air and the tissue. The final goal of this operation is to obtain ROI images adjusted to the tissue, so the noise present in the air part and the differences on the distance from the scanning tip to the tissue in the database images do not provide ambiguous information to the network. Conversely, the shape of the lesion is preserved, and flattering is discarded, as this could be a clinically interesting feature for differentiating the lesion's diagnostic nature.
This step was implemented following the next sub-steps: automatic calculation of Otsu threshold [60] to differentiate between the air and the tissue regions; binary mask generation applying the calculated Otsu threshold to the image; morphological operation to remove small objects from the binary mask; then, for each column in the mask image, extraction of the location (row) of the first positive (true) value if available, to obtain a 1D array containing the delimitation path; and application of a median filter (kernel size = 69) to the delimitation array to eliminate or smooth possible noise in the signal.

Random selection of region of interest
Considering that the total width of the input image (number of A-scans) is highly variable for the different images of the dataset due to the sample size and scanning conditions, a random number indicating where to start the region of interest is calculated. A preliminary sub-image (column) is obtained considering a width of 512 px for the region of interest.

ROI extraction
The values of the delimitation array are applied to the previously extracted sub-image to adjust the tissue at the top, generating a ROI of 512 px width and 224 px depth, which is equivalent to approximately 0.71 mm in width and 0.75 mm in depth considering the optics of the device. Preliminary experiments with fewer widths or longer depths re-

1.
Image pre-processing The OCT gray scale original image contains one single channel that is duplicated to generate the 3-channel image expected by the network to use the ImageNet pre-trained weights. As an additional data augmentation strategy, the image is randomly flipped horizontally to produce alternative input images. No additional geometric transformations are applied to the image, as this would alter the structural features of the lesion and lead to misclassification.

2.
Air-tissue delimitation The aim of this step is to automatically detect on the image the delimitation between the air and the tissue. The final goal of this operation is to obtain ROI images adjusted to the tissue, so the noise present in the air part and the differences on the distance from the scanning tip to the tissue in the database images do not provide ambiguous information to the network. Conversely, the shape of the lesion is preserved, and flattering is discarded, as this could be a clinically interesting feature for differentiating the lesion's diagnostic nature.
This step was implemented following the next sub-steps: automatic calculation of Otsu threshold [60] to differentiate between the air and the tissue regions; binary mask generation applying the calculated Otsu threshold to the image; morphological operation to remove small objects from the binary mask; then, for each column in the mask image, extraction of the location (row) of the first positive (true) value if available, to obtain a 1D array containing the delimitation path; and application of a median filter (kernel size = 69) to the delimitation array to eliminate or smooth possible noise in the signal.

Random selection of region of interest
Considering that the total width of the input image (number of A-scans) is highly variable for the different images of the dataset due to the sample size and scanning conditions, a random number indicating where to start the region of interest is calculated. A preliminary sub-image (column) is obtained considering a width of 512 px for the region of interest.

ROI extraction
The values of the delimitation array are applied to the previously extracted sub-image to adjust the tissue at the top, generating a ROI of 512 px width and 224 px depth, which is equivalent to approximately 0.71 mm in width and 0.75 mm in depth considering the optics of the device. Preliminary experiments with fewer widths or longer depths reported worse results. Smaller ROIs reduce the maintained information worsening the feature extraction and classification performance, so it is important to reach an agreement between both aspects.

5.
ROI preparation (post-processing) The extracted ROIs are resized to 229 px width and 299 px depth to match the default input size of the network (pre-trained with ImageNet).

Data Imbalance Management
This work aims at differentiating benign samples, including healthy tissue and hyperplastic polyps, from malignant/neoplastic samples, including adenomatous and adenocarcinomatous samples. Unfortunately, in our dataset, healthy and hyperplastic samples are underrepresented with respect to neoplastic samples. Data imbalance is a usual problem and for the moment there is not a best strategy for dealing with it, as it mostly depends on the problem to solve and on data characteristics. In this work, a resampling strategy was implemented. This strategy was preferred to weight balance compensation, where weights of each class are calculated and specified on network fitting, as in the authors' experience, it provides better results.
Resampling is a classical strategy for dealing with data imbalance. Over-sampling means adding more samples to the minority class, whereas under-sampling means removing samples for the majority class. Over-sampling and under-sampling can be achieved following different strategies, with the weakness that these may imply. The simplest way is to randomly duplicate or remove samples.
In this work, we implemented an over-sampling strategy by adding new samples for the minority class. However, these new samples were not exact copies of original data, as small variations were introduced to create a diverse set of samples. As described in the previous section, dataset images were manipulated for randomly obtaining ROIs (see Figure 3), that in addition were randomly horizontally flipped, which allowed introducing this variability in the training and validation set.

Training Process
The implemented network was based on a Xception model [57], where a global average pooling layer followed by a dense layer (with two outputs and softmax activation) to deal with a 2-class problem (benign vs. malignant) was added at the end. Pre-trained weights of ImageNet were used [58].
Categorical cross entropy loss was minimized by an Adam optimizer with a learning rate of 0.0001 during the training process. The selected batch size is 24, for a number of 100 epochs and validation loss minimization was monitored for early stopping (with patience 20). The training process was repeated 6 times over different data splits to make sure that the provided results were not biased.

Data Evaluation and Test-Time Augmentation
As described before, OCT C-scans were acquired from murine (rat) polyp samples and adjacent healthy tissue. The C-scans are 3D volumes that consist of consecutive and adjacent B-scan images. For some of the polyps, several C-scans covering different parts of the lesion (upper, center, and bottom) were obtained and included in the same data split. As one of the aims of this work was to study the diagnosis capacity and limitations of OCT in more detail, the evaluation of the model was designed with the intention of comparing the discrimination capacity of the individual B-scans classification with respect to C-scans.
A test time augmentation (TTA) strategy was applied to B-scan and C-scan evaluation. This was implemented by performing 10 augmentations over the data following the random ROI extraction strategy previously described (see Figure 3) and then calculating the mean prediction. By applying this strategy, we estimated a richer posterior probability distribution function of the prediction for the bigger (wider) B-scans. We present a comparison of the results without TTA (called standard) and with TTA to facilitate studying how this technique contributed to the proposed approach.

OCT and H&E Histology Comparative Analysis
Before performing the analysis, it was important to consider that some anatomical differences exist between human colon and murine colon structure. According to [61], in human and rats species, the colon maintains the same mural structure as the rest of the gastrointestinal tract: mucosa, submucosa, and inner circular and outer longitudinal tunica muscularis and serosa. The mucosa and submucosa layers in mice are relative thin in comparison with the human ones. Furthermore, human mucosa has transverse folds through the entire colon, whereas in mice it varies for each part of the colon. At the cecum and proximal colon, mouse mucosa has transverse folds, in the mid colon is flat, and in the distal colon has longitudinal folds. However, in both species the mucosa is composed of tubular glands. Taking this into account and considering that the database used in this work consists of murine (rat) samples, it was expected that the model also learn these anatomical differences present in the mucosa, especially for the healthy samples. A detailed comparison of the anatomical differences (extracted from reference [61]) is provided in Table A1.
According to previous studies analyzing features on OCT images [18][19][20][21], in normal tissue, well-defined layers can be visualized with uniform intensity. In the presence of hyperplasia, thickening of the mucosa layer occurs, but the intensity is similar to healthy tissue and tissue layers are still visible. However, in the case of adenomatous polyps, both thickening of the mucosa and reduced intensity must be observed. Finally, adenocarcinomatous lesions should show blurred boundaries and non-uniform intensity. In the presence of large polyps, the disappearance of the boundaries should be clearly observed, independently from the lesion nature.
Visual inspection of dataset images was performed to look for the features previously mentioned. Figures 4 and 5 provide a detailed analysis of the visible features on the OCT images (of Figure 1 samples) with respect to the histopathological hematoxylin-eosin (H&E) images annotated by a pathologist (scanned at 5x). Regions of interest (with the same FOV of OCT images in mm) were extracted from H&E slides images and later rescaled to fit axial and lateral resolution of the OCT images for better comparison. In these figures, it can be observed that the main features present in H&E images can also be observed in the OCT images. On the one hand, Figure 4, representing healthy tissue, illustrates (as indicated by arrows and manual segmentation lines on the B-scans on the left, Figure 4A,B) that the mucosa layers can be very clearly observed, confirming what has been reported before in previous studies. Muscularis mucosae and sub-mucosa layers are also observed, although clear differentiation in all parts of the image is tougher. On the other side, when analyzing Figure 5 containing neoplastic lesions, it is also possible to confirm that the boundaries of the layers have totally disappeared, making it impossible to find any difference among them. Differences in the noise pattern are also observed. In addition, as indicated using circles and arrows on the B-scans ( Figure 5A,B), new underlying structures appeared in the mucosa and can be identified as bright spots or dark areas in the images. These new structures (in comparison with healthy tissue) are also clearly observed in the corresponding annotated histopathology images ( Figure 5C,D), where cystic crypts (CC) have been identified by the pathologist and appear as dark spots in the B-scan and tumoral glands (TG) clusters as bright spots.

Dataset Partitioning and Testing
The dataset was split such that 80% was dedicated to training, 10% to validation, and 10% to testing. It was assured that images coming from the same lesion (both B-scans and C-scans) were included in only one of the sets. The animal models employed on the creation of the database were genetically modified replicas of one specimen, hence no separation per specimen was necessary in splitting and lesions could be considered independently.

Dataset Partitioning and Testing
The dataset was split such that 80% was dedicated to training, 10% to validation, and 10% to testing. It was assured that images coming from the same lesion (both B-scans and C-scans) were included in only one of the sets. The animal models employed on the creation of the database were genetically modified replicas of one specimen, hence no separation per specimen was necessary in splitting and lesions could be considered independently.

Dataset Partitioning and Testing
The dataset was split such that 80% was dedicated to training, 10% to validation, and 10% to testing. It was assured that images coming from the same lesion (both B-scans and C-scans) were included in only one of the sets. The animal models employed on the creation of the database were genetically modified replicas of one specimen, hence no separation per specimen was necessary in splitting and lesions could be considered independently.
The model was tested on 6 different folds to ensure that the evaluation metrics proportionated were not biased by one random dataset split. A random state seed parameter was established for each fold to obtain different training, validation, and testing sets each time.

Performance Metrics and Evaluation
Given that both B-scan and C-scan data were available for the murine (rat) samples acquired in the database, the clinical discrimination capability of the model on the differentiation of benign versus malignant polyps was calculated for both types of data. To evaluate each C-scan, the mean of the individual predictions for the B-scan images that form the volume was calculated. The performance of the model was measured using the conditions provided by the confusion matrix (see Table 1). In the clinical context being analyzed in this work, these conditions can be seen as: The metrics that were employed to measure the model performance based on the previous conditions are described below. The desired value for these metrics was as close as possible to 1, 1 meaning a perfect test. Additionally, as the accuracy (measure of the number of samples that were correctly classified in the expected class) is a misleading metric in imbalanced datasets, the balanced accuracy was calculated. This metric normalizes true positive and true negative predictions by the number of positive and negative samples, and then divides the sum by two, providing an accuracy value where the class frequencies are the same.

•
Balanced accuracy (BAC). Measures the number of samples that were correctly classified in the expected class considering class frequencies. Number of correct/all assessments considering class frequencies. BAC = (TPR + TNR)/2 = (Sensitivity + Specificity)/2.

Thresholds
Considering the prediction values provided by the model, the threshold that maximizes the BAC (in the range 0-1) was calculated over the validation subset of each fold split both for the B-scan and C-scan data. Then, this threshold was applied over the test subset of each fold split to calculate the metrics of the model (BAC, sensitivity, specificity, PPV, and NPV).

Classification Results
The evaluation of the model was performed on 6 folds, over different training, validation, and testing splits of the dataset each time, with the aim of obtaining a model ensemble. As a result, the mean and standard deviation (std) were calculated for each of the selected metrics. Table 2 provides a summary of the results, where the first number reports the mean and the second the std (mean ± std). In this summary, the results obtained with B-scan and C-scan images, standard, and TTA test split evaluation are included for comparison. The complete list of results of each fold is included in Table A2. at the end of the document. Additionally, a graph illustrating a fair comparison of the folds results following the sum of ranking differences (SRDs) method [62] is provided in Figure A1. After calculating the SRD coefficients for each of the options on the different folds, a graph comparing the performance of the different options can be generated. The smaller the SRD value, the closer to the reference value, meaning better performance.

Discussion and Conclusions
On analyzing the results, in general terms and considering the mean results reported in Table 2, when using the standard evaluation technique, the prediction over C-scan volumes was slightly better than the prediction over individual B-scan images. This impression is confirmed by the SRD analysis ( Figure A1), where smaller values were obtained for C-scan images analysis. This result makes sense, since when evaluating the lesion volumetrically (C-scan) considering the mean prediction of all the B-scan images contained in the C-scan, there was less probability of a bad prediction. If the volume contains some individual B-scans with poor information representing the class sample, the (expected) bad predictions do not have great influence on the final diagnosis. In any case, the small differences on the prediction metrics suggest the high quality of the database used in this study, as shown in the detailed results for each fold provided in Table A2.
It can also be observed that the TTA evaluation technique slightly benefitted the prediction over individual B-scan images in terms of sensitivity and specificity, but not the C-scan volume prediction. However, these results make sense for two reasons: the data preparation strategy and the volumetric evaluation of the lesion. On the one hand, due to the nature of the images, no geometrical transformations were applied for data augmentation, as described in the data preparation section, but ROIs at different location of the image were extracted. Depending on the location of the extracted ROIs, the clinical features can be more or less representative of the lesion, affecting the corresponding prediction. When TTA was applied, different ROIs from the B-scan were extracted, allowing analysis of the overall sample in width, and hence a better prediction was obtained. This is particularly beneficial in the case of large wide B-scan images, as it allows analyzing the different parts of the tissue/lesion in detail. Considering this, and although no improvement was observed on the C-scan evaluation, the TTA strategy was preferred during the evaluation, since in this way, the intrinsic clinical variability of the lesions was captured and hence the model prediction was more robust.
Interpretation of new imaging techniques, such as OCT, can be complicated at the beginning and prevent their adoption in clinical practice. However, advanced image processing techniques, such as deep learning, can be used to facilitate automatic image analysis or diagnosis and the development of optical biopsy. A previous work [46] proposed using a pattern recognition network that requires prior manual annotation of the dataset and diagnosis depends on whether the expected pattern is found on the image. Alternatively, this work proposes using a classification strategy, which can help in the identification of subtle clinical characteristics on the images and is not biased by dataset annotations. This work investigates the application of an Xception deep learning model for the automatic classification of colon polyps from murine (rat) samples acquired with OCT imaging. The developed database is accessible upon request and is part of a bigger database in the process of being published. A strategy for processing B-scan images and extracting regions of interest was proposed as a data augmentation strategy. Test time augmentation strategy implemented with the aim of improving model prediction was evaluated. In addition, this work also aims to compare the differences in the diagnosis capacity of the proposed method when evaluated using B-scan images and C-scan volumes, and for this purpose different clinical metrics were compared. The trained model was evaluated 6 times using different training, validation, and testing sets to provide an unbiased diagnosis of the results. In this sense, we got a model with mean 0.9695 (±0.0141) sensitivity and mean 0.8094 (±0.1524) specificity when diagnosis was performed over individual B-scans, and mean 0.9821 (±0.0197) sensitivity and mean 0.7865 (±0.205) specificity when diagnosis was performed in the whole C-scan volume.
Considering the future application of a deep learning method to assist clinical diagnosis with OCT, and in view of the results of this work, successful diagnosis can be achieved both on B-scan images and C-scan volumes. The evaluation of the lesion over a C-scan volume was preferred over the evaluation of an individual B-scan image, so the prediction could be more robust. However, this will not be possible most of the time in the daily clinical routine, for example during patient colonoscopy examination, where in vivo real-time information is necessary for diagnosis and in-situ treatment decision. In this sense, clinical procedures based on the accumulative predictions of various B-scan images could be defined to facilitate clinicians' decision-making during examination. The promising results with the proposed approach suggest that the implemented deep learning based method can identify the clinical features reported in previous clinical studies on the OCT images, and more importantly, that the amount of data and features present on the images database are enough to allow automatic classification. These results are part of ongoing work that will be further extended; however, it has been demonstrated that deep learning-based strategies seem to be the path to achieve the "optical biopsy" paradigm. Raw interpretation of new imaging modalities is difficult for clinicians but assisted by an image analysis method, the interpretation can be eased and the reliable diagnosis suggestion can facilitate the adoption of the technology. Consequently, the CADx market can benefit from this progress in the short term as the latest market forecast studies suggest. This work will be further extended and tested with a larger and more balanced version of the murine dataset collected. More sophisticated models accepting larger image size will be tested to check whether classification is improved. Optical properties of the different lesions will be studied in detail with the aim of finding scattering patterns for each type of lesion. OCT volumetric (C-scan) information will be also studied in further detail to make the most of it analyzing both the cross sectional and en-face views. Funding: This work was partially supported by PICCOLO project. This project has received funding from the European Union's Horizon2020 Research and Innovation Programme under grant agreement No. 732111. The sole responsibility of this publication lies with the authors. The European Union is not responsible for any use that may be made of the information contained therein. This research has also received funding from the Basque Government's Industry Department under the ELKARTEK program's project ONKOTOOLS under agreement KK-2020/00069 and the industrial doctorate program UC-DI14 of the University of Cantabria.
Institutional Review Board Statement: Ethical approvals for murine (rat) samples acquisition were obtained from the relevant Ethics Committees. In case of research with animals, it was approved by the Ethical Committee of animal experimentation of the Jesús Usón Minimally Invasive Surgery Centre (Number: ES 100370001499) and was in accordance with the welfare standards of the regional government which are based on European regulations.

Informed Consent Statement: Not applicable.
Data Availability Statement: The dataset used in this study is available upon request. This dataset is part of a more extensive dataset that is under collection and will be made publicly available in the future.

Acknowledgments:
The authors would also like to thank Ainara Egia Bizkarralegorra from Basurto University hospital (Spain) for the processing of the samples.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
Appendix A Table A1. Comparison of anatomical differences of human and murine colon (adapted from reference [61]).

Feature Human Rats
Anatomy of the large intestine compared macroscopically