Automatic classification of esophageal disease in gastroscopic images using an efficient channel attention deep dense convolutional neural network

: The accurate diagnosis of various esophageal diseases at different stages is crucial for providing precision therapy planning and improving 5-year survival rate of esophageal cancer patients. Automatic classification of various esophageal diseases in gastroscopic images can assist doctors to improve the diagnosis efficiency and accuracy. The existing deep learning-based classification method can only classify very few categories of esophageal diseases at the same time. Hence, we proposed a novel efficient channel attention deep dense convolutional neural network (ECA-DDCNN), which can classify the esophageal gastroscopic images into four main categories including normal esophagus (NE), precancerous esophageal diseases (PEDs), early esophageal cancer (EEC) and advanced esophageal cancer (AEC), covering six common sub-categories of esophageal diseases and one normal esophagus (seven sub-categories). In total, 20,965 gastroscopic images were collected from 4,077 patients and used to train and test our proposed method. Extensive experiments results have demonstrated convincingly that our proposed ECA-DDCNN outperforms the other state-of-art methods. The classification accuracy (Acc) of our method is 90.63% and the averaged area under curve (AUC) is 0.9877. Compared with other state-of-art methods, our method shows better performance in the classification of various esophageal disease. Particularly for these esophageal diseases with similar mucosal features, our method also achieves higher true positive (TP) rates. In conclusion, our proposed classification method has confirmed its potential ability in a wide variety of esophageal disease diagnosis.


Introduction
Gastroscope is an advanced diagnostic imaging tool, which is able to provide a high resolution visualization of living esophageal tissues [1,2]. It first uses a flexible optical fiber to guild the light into the esophageal cavity and an image sensor such as charge coupled device (CCD) or complementary metal-oxide semiconductor (CMOS) to receive the reflections from the mucous membrane in the cavity [3,4]. Then it converts the light signals into electronic signals. After a series of electrical signal processing, the gastroscopic images of the esophageal mucosal are generated [5,6]. Currently, the examination and diagnosis of upper gastrointestinal (esophageal and gastric) diseases mainly rely on gastroscopic images generated by gastroscope [7].
In clinic, gastroscopic diagnosis of esophagus will encounter variety of esophageal diseases, which can be roughly summarized into four main categories: normal esophagus (NE), precancerous esophageal diseases (PEDs), early esophageal cancer (EEC) and advanced esophageal cancer (AEC). The division of EEC and AEC is based on the depth of cancer cells invading under mucous membrane [8]. The 5-year survival rate of patients with AEC is only 15-25% [9,10] whereas the 5-year survival rate of patients with EEC can be as high as 92-93% [9,11]. Therefore, the accurate classification of esophageal diseases is crucial for providing precision therapy planning, especially for realizing an improvement in 5-year survival rate of esophageal cancer patients [12]. However, the diagnosis process with gastroscope is susceptible to a variety of negative factors, such as clinicians' fatigue and lack of experience, and diversity of the appearance of lesion, etc., which are prone to misdiagnosis and missed diagnosis [2,7,8,13,14]. What's more, the mucosal characteristics in the lesion areas of some esophageal diseases is very complex. For example, the apparent mucosal characteristics of some EEC lesions are similar to those of PEDs and even AEC lesions, which are difficult to be distinguished even by the experienced clinicians. Computer Aided Diagnosis (CAD) have been confirmed to be able to prevent many of these problems and improve the accuracy and efficiency of the diagnosis of esophageal diseases [2,15,16].
There are mainly three kinds of gastroscopic image processing tasks in CAD methods, which are classification, segmentation and object detection. The classification task will give an image-wise classification result of the lesion type and it only needs image-wise labelled images for training [17][18][19]. The segmentation task will produce a pixel-wise classification result of lesion type [20,21]. The object detection task will give prediction result on both the lesion type and its location [15,22]. Automatic classification can assist doctors to quickly screen out images with lesions from a huge numbers of gastroscopic images, discriminate different types of diseases and save a lot of labor time, which is important to clinic. The traditional classification methods usually used artificially designed algorithm to extract image features and then a classifier such as support vector machine (SVM) was applied to perform classification based on the extracted features [23,24]. The deep learning-based classification methods have showed better classification ability than the traditional methods [2,18,25]. For example, Kumagai et al. [17] developed a classification method based on GoogLeNet, which can classify malignant and non-cancerous lesions of esophageal squamous cell carcinoma in endocytoscopic system images. Liu et al. [18] fine-tuned a pre-trained CNN to classify three gastric diseases (chronic gastritis, low grade neoplasia and early gastric cancer) on magnification narrow-band imaging images. Zhu et al. [19] also constructed a CNN-CAD system based on the same CNN framework of Liu et al. to determine the invasion depth of gastric cancer using gastroscopic images. In recent years, deep learning-based methods have performed well on the classification of multi-categories of diseases [26,27]. However, the deep learning-based classification method for a wide variety of diseases in gastroscopic images have not been previously reported.
The long-range dependencies maybe easily ignored by the traditional CNNs, if there is none smart mechanism to guide feature selection [28][29][30]. More recently, attention mechanism has been demonstrated to offer great potential in improving the performance of deep CNNs. The subsequently development of attention modules towards two directions: enhancement of feature aggregation and combination of channel and spatial attentions. At present, the attention mechanism has been applied to many natural image analysis tasks, such as image captioning [31], image recognition [32] and image classification [33]. There also exist many examples of successful applications of the attention mechanism in medical image analysis, such as thorax disease classification [34,35], pulmonary lesion detection [36] and skin lesion segmentation [37,38]. Therefore, the attention mechanism has the potential to improve the ability of gastroscopic images classification.
In response to better classification of the four main categories (NE, PEDs, EEC and AEC) and also inspired by the works in [39] and [40], this study proposed to introduce the efficient channel attention mechanism into the dense block of densely connected CNN for constructing a novel network. We named this network as efficient channel attention deep dense convolutional neural network (ECA-DDCNN). The purpose of building ECA-DDCNN was to enhance the interdependencies between channels and strengthen feature propagation in the network in order to highlight and extract the features related to the subtle differences among the various types of lesions. Finally, an ECA-DDCNN based method was developed to classify a wide variety of esophageal diseases. The imbalanced sample size among different categories is a common problem in the collected medical images, and it also occurs in our esophageal image dataset. For maintaining efficient training, we also proposed a novel dynamic random weighted sampling (RWS) method to balance the sample numbers among different categories.
The main contributions of this paper are as follows: a) ECA-DDCNN network was proposed, which enhanced the interdependencies between channels and feature propagation in the network.
b) RWS method was presented for balancing sample number of gastroscopic images in different categories.
c) A classification method based on ECA-DDCNN was developed for classifying the four main categories of esophagus (seven sub-categories) based on gastroscopic images, and it exhibits state-of-the-art performances. The categories of our classified esophageal diseases range from NE to AEC which is the most extensive in the existing related methods.
The rest of this paper is organized as follows. The experimental datasets, data pre-processing method, and proposed ECA-DDCNN were introduced and detailed, in section 2. The experiment results were reported in section 3. Further discussions and a summary of the experiment results have been presented in Section 4. The conclusions of this work are presented in Section 5.

Materials
The esophageal gastroscopic images used in our experiments were collected by gastroenterologists from the digestive gastroscope center of the West China Hospital in Chengdu China. The gastroscopic images were captured using OLYMPUS GIF-Q260 and GIF-Q290 gastroscope, and saved as graphic files of type JPEG (Joint Photographic Experts Group) with four resolutions: 1920×1080 pixel, 1916×1076 pixel, 768×576 pixel, and 764×572 pixel. In total, 20,965 conventional non-magnified white-light imaging gastroscopic images from 4077 patients were collected, including 1471 normal esophagus (NE) images from 296 patients, 2183 surgical scar (SC) images from 485 patients, 3377 osophagitis (O) images from 598 patients, 5921 esophageal varices (EV) images from 1300 patients, 1945 esophageal submucous eminence (ESE) images from 368 patients, 2806 early esophageal cancer (EEC) images from 484 patients, and 3262 advanced esophageal cancer (AEC) images from 546 patients. Among these esophageal diseases, SC, O, EV and ESE belong to PEDs. The collected esophageal gastroscopic images cover the four main categories of esophagus, including NE, PEDs, EEC and AEC. Due to the complex mucosal surface and individual difference from patients, the collected images show high interclass similarity and high intraclass variation. Permission from the medical ethical review committee of West China Hospital and University of Electronic Science and Technology of China, and informed patient's consent were obtained.
Three gastroenterologists with 5, 10, and 15 years of experience, respectively, arrived at a consensus regarding the labels of the images used in this study. All the cancerous lesions were confirmed through biopsies. As the image number of some diseases is very small (for example, there only 1471 images in NE and 1945 images in ESE), we divided the dataset into: training and test groups. The gastroscopic images were randomly selected for generating two groups. Approximately 80% images of each esophageal disease were selected for the training group and the remaining were included in the test group. It needs to be mentioned that the disease distributions (i.e., the ratio of each sub-category in each subset) in the training and test sets are the same. The gastroscopic images recorded from one patient appeared only in one group. The statistics of the training and test groups are listed in Table 1, where the training and test group consist of 16771 and 4194 gastroscopic images in total, respectively. The total patient number of training and test groups is 3253 and 824, respectively. The overall median age of test dataset is 56 with a wide range of 20-88, the total sex ratio between male and female is 569/295.  Figure 1 illustrates the proposed method of classifying esophageal diseases, including data preprocessing, RWS and ECA-DDCNN. The details of each part are stated below.

Data preprocessing and random weighted sampling (RWS)
The collected raw gastroscopic images usually contain a large area of black background with some texts that generally comprises the patient information, as shown in C1 of Fig. 1. This content makes no contributions to the classification task; thus, we cropped these black background areas by using a rectangular box of suitable size. As the esophageal images may have different size, the cropped images were uniformly resized to the size of 224×224, as shown in C2 of Fig. 1.
Our esophageal dataset shows imbalance in image number among different categories because the case numbers of different types are different in clinical diagnosis. For instance, there is only 1471 images of NE from 296 patients, while there are 5921 images of EV from 1300 patients. Hence, we designed a novel random weighted sampling (RWS) method to eliminate the imbalance and improve the overall classification accuracy. For each category, RWS sets a random sampling weight, which W s is inversely proportional to the sample size, as shown in Eq. (1).
Where, W s (i) and N s (i) denotes the random sampling weight and the sample size of the i th category respectively. N refers to the number of categories. After RWS, the numbers of samples among different categories will remain balanced. Figure 1 illustrates the process of RWS, in which the thickness of each color rectangle represents the sample size of each category.

Efficient channel attention deep dense convolutional neural network (ECA-DDCNN)
As previously mentioned, the existing deep learning networks used in the classification of upper gastrointestinal disease are lacking of smart mechanism to guide feature selection. This makes them difficult to identify subtle difference among a variety of esophageal diseases (especially PEDs, AEC and EEC). ECA-module [39] can adaptively recalibrate the channel-wise feature responses by explicitly modelling interdependencies between channels, while DenseNet [40] can strengthen feature propagation and encourage feature reuse. Inspired by [39] and [40], we proposed a novel efficient channel attention based dense layer (ECA-DL) and densely connected several ECA-DLs into a ECA-Dense block (ECA-DB). With the equipped ECA-DBs, the accuracy and efficiency of ECA-DDCNN in classifying four main categories of esophagus (NE, PEDs, AEC and EEC) are thus enhanced. As shown in Fig. 1, there are four ECA-DBs, three Transition layers and one fully connected layer in the backbone of ECA-DDCNN. The transition layer between each two ECA-DBs has the same structure as that of DenseNet [40]. The fully connected layer was designed at the end of ECA-DDCNN, and the number of output category was set as N (N=7 in this study). Figure 2 is the diagram of the proposed ECA-DL and ECA-DB. In ECA-DL ( Fig. 2 (a)), W is the channel-wise attention coefficient computed by ECA-module, U refers to the feature maps extracted by the dense layer. When a channel-wise multiply operation is performed to W and U, the feature maps U are weighted by W. The feature maps weighted by the attention coefficients are outputted by ECA-DL. Where, the dimension of feature map outputted by ECA-DL is called as the growth rate k. In this study, the growth rate of ECA-DDCNN was set as k=48. Taking advantage of the ECA-module [39], ECA-DL can avoid dimensionality reduction and captures cross-channel interaction in an efficient way. The multiple ECA-DLs are connected to each other by a dense connectivity way [40] to form a ECA-DB ( Fig. 2 (b)). In ECA-DDCNN (Fig. 1), ECA-DB-1 to ECA-DB-4 contains 6, 12, 36 and 24 ECA-DLs, respectively.

Training details
In order to speed up the training process, we partly load the pre-trained model of DenseNet on ImageNet. The stochastic gradient descent is selected as the optimization function of our network, and the initial learning rate is set as 5e-3, momentum as 0.9 and weight decay as 5e-3. The learning rate is set to decay by half if the averaged training loss stops decreasing in three epochs, to ensure that the network is efficiently trained at the appropriate learning rate and to speed up the training process. The input image is set as 224×224, the training epochs are 100, and the batch size is 32. The cross-entropy loss function is used to measure the distance between the predicted and target labels during the training process.

Experiments and results
In order to illustrate the effectiveness and performances of our proposed method in the classification of the four main categories (seven sub-categories), we performed extensive experiments, including ablation studies, comparison experiments and generalization ability validation.
In this work, the programming language implemented is Python 3.6.4, and the deep learning library is PyTorch 1.0.0 (https://pytorch.org/). All the experiments were performed on a server based on Ubuntu 16.04.6 LTS (GNU/Linux 4.8.0-36-generic X86_64) and equipped with four graphics processing units of Nvidia GeForce RTX 2080 Ti, 11G.
Rec j = n jj ∑︁ i n ji (4) Where i,j refer to the index of each category, n ij indicates the number of the i th category predicted as j th category. Similarly, n ii and n jj refer to the number of the right predicted i th category and j th category, respectively. Acc and F1 evaluate the comprehensive classification abilities. Pr represents the precision rate of the disease recognition. Rec represents the sensitivity to diseases.
Furthermore, the mean values of Pr, Rec, and F1 were calculated using Eq. (6). Where, p is the currently evaluated parameter, Mean is the mean value of p. N refers to the number of the classified categories. Bootstrap was used to simulate 1000 trials to estimate the 95% confidence interval (CI) of the evaluation metrics.

Ablation studies
To demonstrate the effectiveness of the proposed ECA-DB and ECA-DL, we performed ablation studies. We set DenseNet as the benchmark in the ablation studies. We get AT-ECA-DDCNN  Table 2, our method achieved the optimal values of the Mean Pr, Rec, F1 and Acc (the optimal values were bolded). It indicates that the overall classification performance of our method is better than that of the other three methods, which attributes to the proposed novel ECA-DL and ECA-DB. We further compared the classification performance of ECA-DDCNN under two growth rates k=32 and k=48, and the statistical results of the evaluation metrics are shown Table 3. As shown in Table 3, we can clearly observe that all the values of Mean Pr, Rec and F1, and Acc under k=48 are higher than that under k=32. The growth rate k refers to the new information of each layer that contributes to the global state, that is, the dimension of the feature map of each ECA-DL. A bigger growth rate means that more feature maps are reused. This is the main reason why the classification performance of ECA-DDCNN under the growth rate k=48 is better than that under k=32°It shows that we increase the growth rate of ECA-DDCNN is effective, not redundant.

Comparisons of different data sampling methods
In order to verify the effectiveness of the proposed RWS method, we applied physical data augmentation (Aug) and RWS to balance the distribution among different categories. The shuffle method was set as benchmark. We computed the Pr, Rec, F1 and Acc values for different methods over the esophageal test dataset (Table 4) to evaluate the classification results of the three methods. From Table 4, we can clearly see that the overall classification result using RWS (Our) is higher than the other two data sampling methods, and the network training time is shorter than the other two methods. This shows that our RWS method can effectively improve the overall classification performance and the training efficiency of the network.

Comparisons with other related state-of-the-art methods
To validate the performance of our method in classifying the four main categories (seven sub-categories), we compared our method with five other related methods on our dataset. The comparison methods include: the classification method of esophageal disease proposed by Kumagai et al. [17], the classification method of gastric disease proposed by Liu et al. [19], and the other three most advanced image classification methods: SE-ResnNet152 proposed by Hu et al. [30], Efficient-Net-b5 proposed by Tan et al. [41] and ECA-ResNet152 proposed by Wang et al. [39]. In order to ensure the best classification performance of each method and guarantee fair comparison, the size of the input image is set to be consistent with the original network, where the size of the input images of Kumagai et al. [17] method was set as 229×229, and those of the other methods were set as 224×224. The other experimental conditions were the same for all the methods.
For each method, we calculated the Mean values of Pr, Rec and F1, the Acc and averaged AUC, as well as the network computational complexity. As shown in Table 5, our method achieves the highest values on Mean Pr, Rec and F1, the Acc and averaged AUC, and these values are higher than the suboptimal indices by 1.07%, 0.76%, 0.94%, 0.73% and 0.01%, respectively. The parameter amount of our method is only 26.49M, which is close to the lowest value 23.13M of Kumagai et al. [17] and less than half of the maximum value 67.40M of Hu et al. [30]. Thus, our method is relatively lightweight and costs less computing resources. We also calculated the Pr, Rec and F1 values for each method on each sub-category of test dataset (Table S1 in Supplement 1). As can be seen from the Table S1, the F1 values of our method in each esophageal disease are generally higher than other methods. The F1 value is a composite evaluation parameter combining Pr and Rec. Therefore, the overall classification ability of our method is better than the other five comparison methods. Also, our method obtains the optimal values in Rec for four sub-categories, the optimal values in AUC for three sub-categories and the optimal values in Pr for two sub-categories. Table 5 and Table S1 fully demonstrate that our method is better than other comparison methods in identifying of seven sub-categories of esophageal diseases. In addition to the statistical comparison of evaluation parameters, we calculated the confusion matrix of each method on sub-category of the dataset. The statistics of the true positive (TP) and all the possible false positive records of all methods over the test dataset in the case of each sub-category are shown in Fig. 3. It can be seen from the confusion matrixes of Fig. 3 that the worst TP rates of Kumagai et al. [17] are obtained in SC and O, which are both 0.86, 5% of SC is misclassified as EEC and 4% of O is misclassified as EV. The worst TP rates of Liu et al. [19] and Hu et al. [30] are both obtained in EEC, which are 0.84 and 0.83, respectively. For the above two methods, 9% and 8% of the EECs are misclassified as O, respectively. The worst TP rates of Tan et al. [41] and Wang et al. [39] are both obtained in ESE with TP rate of 0.84, 4% of ESE is misclassified as EEC in the former method and 4% is misclassified as SC in the latter method. For our method, the TP rate of each category is relatively higher and the minimum TP rate is as high as 0.87. In the sub-categories with the worst TP rates obtained by the other five comparison methods, our method still can achieve high TP rates. For example, both methods of Liu et al. [19], and Hu et al. [30] show poor classification ability in EEC, while our method can achieve a high TP rate of 0.88. Tan et al. [41] and Wang et al. [39] both get minimum TP rate in ESE, while our method can achieve a high TP rate of 0.87. This demonstrates that our method can overcome the classification weakness of other methods in the esophageal dataset, and can distinguish various types of esophageal diseases with similar morphology or appearance.
Finally, to further evaluate the comprehensive classification ability of each method, we calculated the average receiver operating characteristics (ROC) curves and AUC values on the test dataset, as shown in Fig. 4. It shows that our method outperforms the competitors by achieving better ROC curve and higher AUC value of 0.9877.  [19], (c) Hu et al. [30], (d) Tan et al. [41], (e) Wang et al. [39], (f) Our method. All the possible records for each disease are presented using a color gradient and numbers.

Generalization ability
In order to verify the generalization ability of our method, we evaluated the classification performance of our and the other five comparison methods on the skin disease dataset ISIC2019 [42]. Dermoscope is a new biomedical photoacoustic (PA) imaging system equipped with a waterless coupling and impedance matching opto-sono probe to achieve quantitative, highresolution, and high-contrast imaging of the human skin [43]. The training dataset of ISIC2019 contains 25331 dermoscopic images, eight types of skin diseases, including actinic keratosis (AK), basal cell carcinoma (BCC), benign keratosis (BKL), dermatofibroma (DF), melanoma (MEL), melanocytic nevus (NV), squamous cell carcinoma (SCC) and vascular lesion (VASC). There is a severe imbalance in the number of images among various skin diseases. NV contains 12874 images while DF only includes 238 images. Because the number of NV category is larger than that of MEL category with the second largest number of images (4521) by one order magnitude, we randomly selected 4000 images from the NV category to form our generalization experiment dataset, and the number of images of the other seven categories remains unchanged. Finally, the experiment dataset contained a total number of 16,499 dermoscopic images. We randomly selected 80% of each category as the training set, and the remaining 20% as the test dataset. Table 6 shows the calculated Mean Pr, Rec and F1, and Acc values of each method on the skin disease test dataset. Our method is better than other comparison methods by achieving the optimal values of Mean Rec, F1 and Acc, and the Mean Pr of our method is 84.46% which is close to the highest Mean Pr value of 85.04%. Furthermore, we also computed the evaluation indicators Pr, Rec and F1 of each method on each category of test dataset (Table S2 in Supplement 1). Our method achieves the best evaluation values on most of the eight categories. The confusion matrix of each method is shown in Fig. 5, which illustrates that our method obtains higher TP rates in most categories of skin diseases than the other five comparison methods. The smallest TP rate of our method is 0.78, which is at least 6% higher than the minimum TP values of other comparison methods. In summary, the generalization experiment results prove the generalization ability of our proposed ECA-DDCNN in the skin disease classification task.  [19], (c) Hu et al. [30], (d) Tan et al. [41], (e) Wang et al. [39], (f) Our method. All the possible records for each disease are presented using a color gradient and numbers.

Discussions
Limited by the shooting conditions such as light, most of the images taken by gastroscope are not as high contrast and definition as natural images. In addition, the lesions of esophageal diseases generally show irregular shapes and blurry boundaries. Hence, the classification of esophageal disease is a very challenging task, and much harder than the natural images. Nevertheless, the proposed classification method based on ECA-DDCNN network still can get satisfactory results on the classification of the four main categories (seven sub-categories) including NE, PCD, EEC and AEC.
In ablation studies, we evaluated the contributions of the ECA-module, the embedding location of ECA and the higher growth rate. The results of Table 2 shows that the location of the ECA-module is also critical. Compared with the benchmark method (DenseNet), the Acc value of BT-ECA-DDCNN method improves a bit, while the Acc value AT-ECA-DDCNN method drops 0.2%. This phenomenon is closely related to the transition layer. In AT-ECA-DDCNN ( Fig. S1 (a) in Supplement 1), ECA-module only adaptively recalibrated the responses of the feature maps with dimension reduction, while in BT-ECA-DDCNN ( Fig. S1 (b) in Supplement 1), the ECA-module adaptively recalibrated the channel-wise feature responses of the feature maps without dimension reduction. In our method ( Fig. 1 and Fig. 2), since the ECA-module was embedded into dense block and placed behind the dense layer, the channel-wise feature responses of the output of each dense layer can be adaptively recalibrated by ECA-module. Therefore, the classification performance of our method is the best. Different from traditional DenseNet, the proposed ECA-DDCNN has a higher growth rate. The experimental results in Table 3 show that higher growth rate benefits classification performance.
The experiment result shown in Table 4 also confirmed that the proposed RWS method could both effectively improve the classification accuracy and shorten the training time. Although the Aug method can also increase the sample size of the category with few samples, it doesn't improve the classification accuracy, and the accuracy is even a bit lower than Shuffle method. This is because pure data augmentation only increased the sample size, and it did not effectively increase the diversity of the dataset. In the end, the training time of Aug method was prolonged due to the large amount of redundant images.
From the comparative experiments, we can see that the overall classification performance of our method outperforms the state-of-the-art methods, while the proposed ECA-DDCNN is also relatively lightweight and efficiency. It only costs 26.49M parameters and 7.82 GFLOPs (Table 5). This is attributed to the proposed ECA-DBs that encourage feature reuse, strengthen feature propagation and reduce the number of parameters. Additionally, the proposed ECA-DB can guide ECA-DDCNN to highlight and extract the subtle difference among the various types of diseases. Therefore, our proposed method can achieve higher TP rates in classifying these esophageal and skin diseases with high interclass similarity ( Fig. 3 and Fig. 5).
Although we have achieved better results than the other related methods in the classifying esophageal diseases, there are still some limitations in this work. Firstly, our method only classified esophageal cancer into EEC and AEC, but not further into subtypes (e. g. esophageal squamous cell carcinoma and adenocarcinoma). Secondly, our method is supervised learning, and it needs a huge numbers of image-wise labeled images for training. However, a huge numbers of image annotation work will put burden on the doctor. Finally, we did not compare our method with the traditional classification methods based on handcrafted features, as many studies have proved that the deep learning methods are superior the traditional methods in classification [18,[44][45][46][47]. Furthermore, the majority of the traditional methods are limited to binary classification and it is difficult to extend them into multiple classification methods.

Conclusions
In this study, we presented a novel network ECA-DDCNN combining attention mechanism and densely connected deep CNN. The proposed ECA-DDCNN is guided by the attention mechanism to highlight and extract those subtle differences among various types of diseases, which are difficult to be distinguished by the experienced clinicians. On the basis, we developed a classification method based on ECA-DDCNN to classify the four main categories of esophagus (including one normal and six esophageal diseases). The experimental results show that the proposed method is competent for classifying these esophageal categories and get higher TP rates on the sub-categories with similar mucosal features, compared with the other state-of-the-art methods. Additionally, the categories of esophagus classified by the proposed method cover the four main categories of esophagus (NE, PCD, EEC and AEC including seven sub-categories), which is the largest among all the existing related classification methods for the esophageal diseases. Therefore, our proposed classification method is suitable for clinical needs and hold a great prospect of clinical applications.
In future works, we intend to perform clinical test of the proposed classification method based on the ECA-DDCNN. We shall overcome the limitations of gastroscopic image labels and design semi-supervised or unsupervised deep learning methods to classify, detect, and segment more types of gastrointestinal diseases.