Automatic detection of colorectal polyps using artificial intelligence techniques

Colorectal cancer (CRC) is one of the most prevalent malignant tumors in Colombia and the world. These neoplasms originate in adenomatous lesions or polyps that must be resected to prevent the disease, which can be done with a colonoscopy. It has been reported that during colonoscopy polyps are detected in 40% of men and 30% of women (hyperplastic, adenomatous, serrated, among others), and, on average, 25% of adenomatous polyps (main quality indicator in colonoscopy). However, these lesions are not easy to observe due to the multiplicity of blind spots in the colon and the human error associated with the examination. Objective: to create a computational method for the automatic detection of colorectal polyps using artificial intelligence in recorded videos of real colonoscopy procedures. Methodology: public databases with colorectal polyps and a data collection built in a University Hospital were used. Initially, all the frames of the videos were normalized to reduce the high variability between databases. Subsequently, the polyp detection task is done with a deep learning method using a convolutional neural network. This network is initialized with weights learned on millions of national images from the ImageNet database. The weights of the network are updated using colonoscopy images, following the tuning technique.


Introduction
Colorectal cancer (CRC) is the third most frequent cancer in the world and the second leading cause of cancer death. In Colombia it is the fourth most frequent neoplasm in men and women, with incidence rates increasing every year [1,2] . Many studies conclude that CRC screening is cost-effective in a medium-risk population (population without family history and without a medical history showing predisposition). It is known that age (≥ 50 years), dietary habits and smoking are risk factors that increase the incidence of this disease. In the general population, the risk is 5%~6% and this incidence increases substantially after the age of 50 years, for which reason persons 50 years of age or older are considered to be a medium-risk population, The degree of survival in CRC patients is directly related to the extent of the disease at the time of diagnosis. Individuals diagnosed in an advanced stage have a survival rate of 7% at 5 years, while for subjects with CRC detected in an early stage a rate of 92% has been reported [5] ; for this reason, it is of great importance to detect the tumor in early stages or, even more so, to detect the polyp in an adenomatous (premalignant) stage, thus preventing the disease. It is known that with the available screening techniques (occult blood, colonoscopy), CRC is highly preventable in more than 90% of cases.
Multiple studies have shown that colonoscopy is the test of choice for the prevention and early detection of CRC because, as previously mentioned, it is capable of detecting the main origin of CRC such as adenomatous polyps [6][7][8][9] .
In addition to detecting cancer in its early stages, which if treated in time is completely curable, the detection of polyps is an indicator of quality in colonoscopy and it is considered that during the examination adenomatous polyps (which have a high risk of cancer) are found in 20% of women and 30% of men; that is to say that, on average, adenomatous polyps should be found in 25% of all colonoscopies performed. Unfortunately, different studies have reported that around 26% of the polyps that are present in a colonoscopy are not detected, a very high error rate basically explained by two factors: the number of blind spots during a colonoscopy (polyps located behind the folds, loops of the colon, the preparation, among others) and the human error (overlooked) associated with the procedure [10][11][12] . Multiple studies have been carried out that seek to attack these two factors in order to reduce this rate of missed polyps as much as possible. Thus, accessories have been designed that allow finding the polyps hidden behind the folds, such as Cap, Endo cuff or even a mini-endoscope called the third eye, which seeks to flatten the folds or see behind them. Additionally, it has recently been considered that the factor associated with human error is at least mitigable with the introduction of second readers (computers), a scenario in which technology and artificial intelligence are beginning to show results that can drastically improve the detection rate of polyps and allow lowering the number of undetected polyps in a gastroenterology unit.
The development of computational strategies for pattern extraction and automatic detection of colorectal polyps in colonoscopy videos is a very complex problem. Colonoscopy videos are recorded amidst a large number of noise sources that easily obscure lesions; for example, glistening on the intestinal wall produced by the light source or specular reflection, organ motility and intestinal secretion that occlude the field of view of the colonoscope, and the expertise of the specialist that influences the smoothness of the colon examination. Currently, several strategies have addressed this challenge as a classification task, using automatic machine learning techniques.
On the one hand, some authors have attempted low-level feature selection to obtain candidate polyp boundaries. Bernal et al. [13] presented a polyp appearance model that characterizes polyp valleys as concave and continuous boundaries. This characterization is used to train a classifier that in a test set obtained 0.89 sensitivity in the polyp detection task. Shin and coworkers [14] presented a strategy based on a patch-based classification, using a combination of shape and color features, and obtained a sensitivity of 0.86. On the other hand, several works have used deep convolutional neural networks (CNN), a set of algo rhythms grouped under the term deep learning. Urban and coworkers [15] presented a convolutional network that detects polyps of different sizes in real time with a sensitivity of 0.95. However, Taha and colleagues [16] discussed some of the limitations of these works, one of them being the fact that these methods require a large amount of data to be trained. In addition, these databases are acquired under specific clinical conditions; in particular, the capture device, the scanning protocol performed by the expert and the extraction of the sequences with easily visualized lesions. Although several advances have been made, there is still the challenge of formulating generalizable models to detect lesions accurately, regardless of the type of lesion, the expert's scanning method or the colonoscopy unit used.
The main objective of the present work is to create an automatic colorectal polyp detection strategy with the purpose of building a second reader to support the colon exploration process and to decrease the number of undetected lesions during a colonoscopy. In this paper, an automatic polyp classification strategy for colonoscopy video sequences is presented. This research relies on a deep learning algorithm and evaluates different convolutional network architectures. This paper is organized as follows: initially, the methodology for automatic polyp detection is presented; then, the ethical considerations surrounding this work are described; next, the experimental setup is shown along with the results of the method detecting polyps compared with the annotations of an expert; then, the discussion of this work is presented; and, finally, conclusions and future work are found.

Methodology
This paper presents a deep learning methodology to model the high variability in a colonoscopy procedure, with the purpose of performing automatic polyp detection in colonoscopy procedures. This task is divided into two stages: training and classification. First, a frame-by-frame pre-processing, common to both stages, is performed. Then, a convolutional neural network is trained using a large number of colonoscopy images annotated by a gastroenterologist expert in colonoscopy (with about 20 years of experience and more than 50 thousand colonoscopies performed) in two classes: Negative class or does not contain polyp, and positive class or contains polyp. The model obtained from the learning process is used to classify new images (or images not used in the training process) as belonging to one of the two classes. The flow of this work is visualized in Figure 1 and explained below.

Acquisition and preprocessing protocol
To diminish the effect of the numerous noise sources on the acquisition process of different colonoscopes and the physiological conditions of the colon and rectum, it is necessary to perform a frame-by-frame preprocessing of the video. First, each frame is normalized with mean 0 and standard deviation (SD) of 1, in order to make the extracted features between frames comparable. Then, depending on the capture device, the frames have different spatial resolutions, so each frame is scaled down to 300 x 300 pixels, so that they all have the same capture grid.

CNN architecture
The main unit of these architectures is the neuron, which provides an output as a function of the inputs to it. An array of neurons forms a layer or block, and a network is composed of several elementary blocks that are arranged as follows: several pairs of con-volutional ( Figure 1C, blue box) and clustering ( Figure 1C, yellow box) layers that deliver a vector of image features, followed by a set of fully connected layers ( Figure 1C, green circles) that are responsible for calculating the probability that a set of features belongs to a certain class, and ending with an activation layer ( Figure 1C, red circles), in which the probabilities obtained are normalized and the desired binary classification is achieved. The function of these blocks is:  Convolutional layers (convolutional layers): Identifies local features throughout the image such as shape, edge and texture patterns, vital in the description of polyps. This layer connects a subset of neighboring image pixels or neurons with all nodes of the first convolutional layer. One of these layers or convolutional kernel is distinguished by the specific weights of each node; when operated on a specific region of the image, it provides a feature map of the region.
 Pooling layers: Reduces the computational complexity, which in turn reduces the size of the Automatic detection of colorectal polyps using artificial intelligence techniques  Fully-connected layers: This layer connects each of the neurons in the previous layer to each of the neurons in the next layer. The previous layer is a flat or vector representation of the obtained feature maps. The number of neurons in the next layer is determined by the number of classes to be classified. Finally, the fully connected layer provides a vote to determine whether an image belongs to a specific class.
 Activation function: Normalizes the probabilities obtained from the fully connected layers according to a specific function, where a probability from 0 to 1 is obtained.
A particular architecture is composed of an array of modules containing different configurations and orders of fundamental blocks explained above, and the result obtained by each neuron is known as the gradient. In this work, three highly evaluated and validated state-of-the-art architectures were used: InceptionV3, Vgg16 and ResNet50. Each of them is described below.
 InceptionV3: Consists of 48 layers with 24 millimeters. These layers are largely grouped into 11 modules, in which features are extracted at multiple levels. Each module is composed of a given configuration of convolutional and grouping layers, rectified by the linear rectifying unit (ReLu) function. It ends with an activation function called normalized exponential (sofimax) [17] .


Vgg16: It is organized in 16 layers for a total of 138,000 parameters. 13 of the layers are convolutional, with a grouping layer (in some), 2 fully connected layers and ends with a normalized exponential activation function. This architecture is notable for using small 3 × 3 sized filters in the convolutional layers. Compared to most architectures, the computational cost of this architecture is lower [18] .
 ResNet50: Consists of 50 layers with 26 million parameters. This architecture is built under the concept of residual networks. It is common that in very deep architectures such as the one mentioned, the propagated gradient vanishes in the last layers. To avoid this, certain layers are trained with the residual of the gradient obtained in this and the gradient of a layer two positions before. This architecture ends with a normalized exponential activa-Gómez-Zuleta, et al.

Fine-tuning training
High class classification performance depends largely on the number of annotated images and the way the weights are initialized to train the CNNs. A colonoscopy has approximately 12,000 frames per video, so the availability of annotated image databases is limited. Then, training with a limited number of data and starting the network weights randomly, as is generally done, results in a failed training process. To avoid this drawback, we use weights (transfer learning) from networks of the same type, which have been previously trained for another classification problem on natural images, with databases containing large numbers of annotated images. The reason why this is done in this way is that, even though the natural and colonoscopy images are different, their statistical structure is similar and, likewise, the construction of primitives representing the objects. In these circumstances, networks trained to recognize objects in natural images are used as an initial condition to train these networks in the task of recognizing polyps.
The use of these weights is done by a process called fine tuning, for which the entire pre-trained network is taken and the last fully connected layer is removed. This layer is then replaced by a new one, which has the same number of neurons as the number of clues in the classification task (polyp-nonpolyp) and is initialized with the weights of the pretrained network. Then, the last layer is trained first and, subsequently, the weights of the remaining layers of the network are updated in an iterative process; this methodology is known as back propagation. Each iteration of this training is performed using a certain number of samples or batches of the training images. This process ends when the network has been trained with all the samples in the set, known as an epoch of training. The number of epochs is determined by the complexity of the samples to be classified. Finally, training ends when the probability of a training image is high and matches the annotated label.

Polyp detection
Using the trained network model, this is applied to a set of evaluation videos in which a label is classified and assigned to: (1) frames with and (0) without presence of polyps. However, there are pictures with structures that resemble the appearance of a polyp, such as bubbles produced by intestinal fluids. In these pictures, the model presents a classification error, taking this picture as if it had a lesion present. Analyzing these errors temporally, it is remarkable that they are presented as outliers (from 3 to 10 frames) in a small-time window (60 frames or 2 seconds). Therefore, the classification performed by the network is temporally filtered and determines that, if at least 50% of 60 contiguous frames are classified without polyp presence, the remaining frames are filtered and assigned a new label, as frames containing no polyp. Finally, a polyp is detected when the proposed method classifies an image as a frame with polyp present or positive class.

Data base
The construction of the database in this work was intended to capture the highest variability of a colonoscopy procedure. To train and evaluate the proposed approach, sequences from different gastroenterology centers containing polypoid and nonpolypoid lesions of varying sizes (morphology and location in the colon), scans performed by different experts and capture equipment were collected. These databases are listed below.

ASU-Mayo Clinic Colonoscopy Video Database
This set was built in the Department of Gastroenterology at the Mayo Clinic in Arizona, USA. It consists of 20 colonoscopy sequences, divided into 10 with polyps and 10 without. The annotations were made by gastroenterology students and validated by an expert specialist. This collection has been used with great frequency in the state of the art and stands Automatic detection of colorectal polyps using artificial intelligence techniques out as the database for the event "2015 ISBI Grand Challenge on Automatic Polyp Detection in Colonoscopy Videos" [20] .

CVC-ColonDB
It is composed of 15 short sequences of different lesions, accumulating a total of 300 frames. The lesions in this collection present a high variability and difficulty of detection, as they are quite similar to healthy regions. Each picture was annotated by an expert gastroenterologist. This collection was built at the Hospital Clinic of Barcelona, Spain [13] .

CVC-ClinicDB
It consists of 29 short sequences with different lesions that gather 612 frames annotated by an expert. This database was used by the training set of the MICCAI 2015 Sub-Challenge on Automatic Polyp Detection Challenge in Colonoscopy Videos event. This collection was built at the Hospital Clinic of Barcelona, Spain [21] .

ETISLarib Polyp DB
It presents 196 images with polyps each annotated by an expert. This database was used in the test set for the MICCAI 2015 Sub-Challenge on Automatic Polyp Detection Challenge in Colonoscopy Videos event [22] .

The Kvasir Dataset
This is a database that was collected using endoscopic equipment at Vestre Viken Health Trust (VV) in Norway. The images are annotated by one or more medical experts from VV and the Cancer Registry of Norway (CRN). The dataset consists of the images with different resolution from 720 × 576 up to 1,920 × 1,072 pixels [20] .

HU-DB
This collection was built at the University Hospital in Bogota, containing 253 colonoscopy videos with a total of 233 lesions. Each frame of the videos was annotated by a colonoscopy expert with about 20 years of experience and more than 50,000 colonoscopies performed.
Each of these videos was captured at 30 frames per second and at a spatial resolution of 895 × 718, 574 × 480 and 583 × 457. In total, a database of 1,875 cases and a total of 48,573 frames with polyps and 74,548 frames without polyps was consolidated. Each of the frames in these videos was scored by an expert as positive if a polyp was present, or negative when no polyp was present. Table 1 summarizes the number of videos and frames per database used in this work.

Ethical considerations
The present work is in accordance with Resolution No. 008430 of 1993, which establishes the scientific, technical and administrative norms for research on humans (article 11). This project is classified as minimal risk research, since it only requires the use of digital images, which are generated from anonymized colonoscopy videos; that is, there is no way of knowing the name or identification of the subjects included in the study.

Results
The CNNs used in this work are InceptionV3, Resnet50 and Vgg16. The labels assigned by each of these networks were compared with the annotations made by the specialists in each of the tables. The Gómez-Zuleta, et al.
following experimental setup and evaluation methodology were applied to each of the architectures.

Experimental setup
The CNNs were previously trained with images from the public ImageNet database, which contains approximately 14 million natural images. The resulting weights are used to initiate a new colonoscopy frame training process by the fine-tuning methodology. This method updates the weights by training the network with the colonoscopy database. The update of the weights was performed with 120 epochs over the entire training set. Each epoch trained the model by taking a batch of 32 frames until all frames were fully covered. For each of the networks, the decision threshold was manually adjusted, oriented to maintain a balance in the classification performance for both classes. The training scheme was 70% of the database for training and 30% for validation with respect to the number of cases; i.e., the data are separated from the beginning and training, validation and test data are never mixed. In total, the networks were trained and validated with 213 cases (24,668 frames) with polyps and 36 videos (27,534 frames) without polyps. The evaluation was performed with 103 videos (23,831 frames) with polyps and 25 videos(47,013 frames) without polyps. The details of this collection are presented in Table  2.

Quantitative evaluation
The proposed approach automatically detects polyps in colonoscopy videos; this task is framed as a binary classification problem. This method sets a label to each frame as negative class (frame containing no polyp) or positive class (frame containing polyp). To evaluate the performance of this work, the estimated or predicted label is compared with the label annotated by the expert. This comparison allows to calculate the confusion matrix, which accounts for the following:     Using the confusion matrix, 4 classification metrics were selected and calculated that assess the performance of the method for classifying pictures with (positive class) and without (negative class) polyp independently, and the predictive power in both classes overall:  Sensitivity measures the proportion of correctly classified pictures containing polyps.
 Specificity calculates the proportion of correctly classified pictures that do not contain polyps.
 Accuracy indicates the predictive power of the method to classify pictures with polyps.
 Accuracy is the rate of correctly classified pictures as a proportion of the total number of Automatic detection of colorectal polyps using artificial intelligence techniques pictures.
The results obtained are presented for each of the deep learning architectures explained in the methodology section. Table 3 shows the results obtained for each of the architectures. On the one hand, although most of these architectures show outstanding performance in the classification task, the Resnet50 architecture pre-sents the best metrics in terms of how well it detected the positive class or frames with polyps, and 0.89 sensitivity was obtained. On the other hand, the InceptionV3 architecture was the best at detecting the negative class or frames without polyps, and 0.81 specificity was obtained. To evaluate the performance of these architectures in more detail, ROC (receiver operating characteristic) curves were constructed for each architecture. This representation seeks to analyze how the models classify the images in terms of specificity and sensitivity by varying the decision threshold on the probabilities provided by the model. As can be seen in Figure 2, the Resnet50 architecture separates the classes better regardless of the decision threshold. This indicates that this architecture was better able to generalize intra-and interclass variability.

Discussions
The detection of adenomatous polyps is the main quality indicator in colonoscopy, since it is a fundamental marker for CRC detection and prevention. In many countries, the quality of the gastroenterologist is measured by the number of these polyps he detects in all his colonoscopies and on average revolves around 25% for the expert, but can be as low as 10% for the inexperienced gastroenterologist, leading to the latter missing more adenomas.
Thus, several studies [10][11][12] report that 26% of polyps are not detected during colonoscopies, which may contribute to more cases of CRC. This is how 1.8 million new cases were presented worldwide by 2018 (International Agency for Research on Cancer) [1] . This loss rate is due to the fact that there are several factors that affect an adequate colon exploration such as the experience and concentration level (associated with fatigue) of the expert during a whole working day, the physiological conditions of the colon such as blind spots in the haustras and the difficulty of locating the colonoscope due to the organ's own motility, and the previous preparation of the colon by the patient, which determines how observable the colon walls are, according to the level of cleanliness of these [23] . Most of these factors warn that colonoscopy is highly dependent on the human factor, exhibiting a need for second readers that are not affected by these factors. The use of computational tools for polyp detection in clinical practice would help to corroborate the findings made by the expert and, more importantly, alert to possible lesions that the expert missed. In this way, these tools would help to decrease the rates of undetected polyps and thus decrease the incidence of CRC.
To support CRC diagnosis using computer vision tools, this challenge has been addressed as follows:  Detection, referring to the frame-by-frame binary classification of a video into positive class (with polyp) and negative class (without polyp);  Localization, as the coarse demarcation (by means of a box) of the lesion on an image containing polyp;  Segmentation, such as a fine delineation of the lesion (delineating the edge of the polyp).
Polyp detection is the first and foremost task facing the gastroenterologist. The post-detection tasks (localization and segmentation) are useful processes for the expert when he has already detected the lesion and needs to describe it morphologically, taking as a reference medical guides such as the Paris Classification [6] . This classification allows him to decide the surgical management of the disease in the short and long term. Consequently, these tasks depend entirely on how accurate the previous detection is; therefore, the proposed methodology focuses exclusively on the main task required by the expert: Obtaining colonoscopy pictures with the presence of lesions. Furthermore, in the state of the art, the works that have addressed these tasks [13][14][15] describe limitations to present a single flow covering at least two of these tasks. These papers use different methodologies for each task, as each has its own level of complexity. In general, to detect frames with polyps, contextual or global relationships are measured at the image level; while localization and segmentation analyze at the pixel level by measuring local relationships.
This paper presents a robust strategy for polyp detection, solved as a classification problem. Deep networks for classification tasks are methods that were formulated decades ago, but had not been exploited as the computational power and availability of annotated databases was limited. In the last 5 years, the use of these models has increased dramatically due to technological developments that allow a large amount of parallel processing and the publication of databases with millions of images such as ImageNet. This made it possible to design highly complex networks and train them exhaustively, so that high performance was obtained in classification tasks, since it is capable of modeling a high variability of shapes, colors and textures. However, in the medical field, a large amount of annotated public data is not available, so applying these models to disease detection or classification problems was not contemplated.
The development of transfer learning tech-Automatic detection of colorectal polyps using artificial intelligence techniques niques provided a solution to the shortage of medical data. The weights of networks trained with millions of natural images were used to initialize a new network and train it with a much smaller amount of different data, such as colonoscopy images. State-of-the-art work that has used this flow demonstrates that it has the ability to adequately generalize the high variability of pictures with and without polypoid lesions in colonoscopy images extracted from a particular database. However, the different types of lesions and the typical physiological conditions of the large bowel are not the only source of variability. The lower the expertise of the specialist, the videos are prone to have a higher number of noisy frames produced by occlusions and abrupt movements of the colonoscope. Additionally, capture devices vary in light sources and camera viewing angles. Therefore, training and validating with databases obtained from a single specific gastroenterology service, as the state-of-the-art works [13][14][15] that have presented excellent results, does not cover all the variability of the colonoscopy image classification task.
Due to the above, in this work we consolidated a set of training videos with a high variability that has not been presented in the state of the art by gathering sequences from different databases. The set used to narrate this approach contains: lesions of different sizes, positions and shapes; colonoscopy procedures and annotations performed by different expert gastroenterologists; and videos captured using different colonoscopy units. Despite such variability, this work obtains a sensitivity of 0.89 and a specificity of 0.71 in the task of detecting polyps in colonoscopy sequences.

Conclusions
Deep learning methodologies are currently a promising option for use in medical classification tasks. The advance of technology together with the constant design and evaluation of the networks has allowed to consolidate a set of methods and flows to have a high performance. With the networks evaluated in this work, the results obtained show that they can be routinely used as second readers in a colonoscopy service.
It is notable that these networks adequately generalize the high variability of colonoscopy videos. The results obtained demonstrate that the proposed method can outstandingly differentiate images with and without the presence of polyps, independently of the particular clinical protocol with which the video was recorded, referring to the expert performing the procedure and the capturing device. This method could be useful to decrease the gap between the expert gastroenterologist and the novice in the adenoma detection rate.
As future work, the proposed approach should be subjected in full colonoscopy procedures and evaluate if it is possible to be implemented in real time and develop a strategy that allows not only to detect but also to delimit the lesion within the picture.