Deep Transfer Learning for Biology Cross-Domain Image Classification

Automatic biology image classification is essential for biodiversity conservation and ecological study. Recently, due to the recordshattering performance, deep convolutional neural networks (DCNNs) have been used more often in biology image classification. However, training DCNNs requires a large amount of labeled data, which may be difficult to collect for some organisms. +is study was carried out to exploit cross-domain transfer learning for DCNNs with limited data. According to the literature, previous studies mainly focus on transferring from ImageNet to a specific domain or transferring between two closely related domains. While this study explores deep transfer learning between species from different domains and analyzes the situation when there is a huge difference between the source domain and the target domain. Inspired by the analysis of previous studies, the effect of biology cross-domain image classification in transfer learning is proposed. In this work, the multiple transfer learning scheme is designed to exploit deep transfer learning on several biology image datasets from different domains. +ere may be a huge difference between the source domain and the target domain, causing poor performance on transfer learning. To address this problem, multistage transfer learning is proposed by introducing an intermediate domain. +e experimental results show the effectiveness of cross-domain transfer learning and the importance of data amount and validate the potential of multistage transfer learning.


Introduction
Building accurate knowledge of the identity, taxonomy, the geographic distribution, and the evolution of living species are essential for a sustainable development of humanity as well as for biodiversity conservation.
In terrestrial ecosystems, plants are extremely complex and diverse, and there are millions of different plant species [1,2]. For us, plants must be classified into identifiable groups in order to have a clear, organized way of identifying the diverse array of plants and some specific applications such as weed control [3,4].
Besides, the study of marine ecosystems is vital for global climate and environment protection [5][6][7][8]. ere are many kinds of organisms in the marine worth studying, such as fish and plankton, which play an important role in the ecosystem [9] and the marine food chain [10].
At the very beginning, species classification was usually implemented on morphological diagnoses provided by taxonomic studies [11] in a manual identification process. However, for some species like weed plants and plankton, only experts such as taxonomists and trained technicians can identify taxa accurately. Furthermore, one expert may only identify a limited number of species in a specific domain (such as only species of weeds or phytoplankton) because it requires special skills acquired through extensive experiences [3,12]. At the same time, there is an increasing shortage of skilled taxonomists [13]. e declining and partly nonexistent taxonomic knowledge within the general public has been termed "taxonomic crisis" [14], making great challenges to the future of biological study and conservation [11].
Using computer-based multimedia identification tools with computer vision and machine learning techniques have been considered as promising solutions to classify organisms, and a lot of work has been done on this topic [15,16].

Traditional Image Classification.
e traditional image classification process can be generally divided into three steps: image preprocessing, feature extraction/description, and classification [17]. Some preprocessing techniques are often used in the image classification system for producing a suitable enhanced image for the next feature extraction step, such as image denoising, image enhancement, image segmentation, and so on [18]. Feature extraction refers to taking measurements, geometric or otherwise, of possibly segmented, meaningful regions in the image [19]. To characterize and describe some properties of the organism image by a set of values, computer vision experts have handcrafted a lot of features. In previous studies, some general features like size [20], color, shape context [21][22][23][24], invariant moments, granulometric features, co-occurrence matrix, Fourier descriptor, Gabor filters, local binary pattern (LBP) [25], histograms of oriented gradients (HOG), scale invariant feature transform (SIFT) etc., have been used commonly.
ere are also some features that have been designed for some specific species [26][27][28].
However, the handcrafted features are usually lack of robustness and cannot represent the complex biomorphic characteristics of some organisms [12]. Besides, some features are elaborately handcrafted for specific organisms [33], which often perform poorly after being extended to other organisms. ese traditional classifiers usually have not high prediction accuracy on different datasets [12]. Especially when the datasets are big or contain more than 20 categories, these classifiers may be limited by the "curse of dimensionality" [34], so that they are hard to be directly applied for ecological studies.

Deep Convolutional Neural Networks.
In recent years, DCNNs [35][36][37][38][39][40][41][42] have become a mainstay of computer vision community due to their record-shattering performance in the ImageNet large-scale visual recognition challenge (ILSVRC) [43]. ImageNet is a large-scale image dataset with 1000 classes, containing 1.3 million training images, 50,000 validation images, and 100,000 testing images. DCNNs consist of a stack of learned convolution filters that extract hierarchical contextual image features, thus are high-capacity classifiers. With the high capability, DCNNs can find the relevant contextual image features in classification problems intelligently and are less likely to be restricted by the "curse of dimensionality. " Moreover, unlike traditional methods, DCNNs do not need to divide the training process into several steps but use end-to-end learning mechanism, which is more suitable for real applications. e outstanding performance of DCNNs in image classification and other problems has received unprecedented attention, prompting scholars to apply them to various practical problems including biology image classification [3,[44][45][46][47][48]. Nevertheless, the very large number of parameters in DCNNs requires large-scale annotated training data. For some organisms inhabiting a complex environment, such as some marine and even microscopic organisms, it is very difficult to collect their images. For another thing, the collected data can only be used after being precisely classified by experienced experts. While the experienced experts are often scarce and one expert can only identify a limited number of species in a specific domain (such as only species of weeds or phytoplankton) [12], the data available in practical studies may be insufficient to fully exploit the potential of DCNNs.

Transfer Learning with DCNNs.
Transfer learning aims to transfer knowledge between the source domain and the target domain [49]. In biology image classification or some other scenarios, obtaining training data might be difficult and expensive. However, transfer learning can overcome the deficit of training examples in some domains by adapting classifiers trained on another domain [50]. ere are two ways to apply transfer learning with DCNNs. One is treating the DCNN as a big feature extractor and utilizing the pretrained network with learning weights to extract features that would be subsequently used in a new domain. e outputs of the DCNN are considered as high-level features and are then fed into the following classifier. Another is to fine-tune the network weights by training the network with the data from the new domain. In this case, the dimension of the output layer must be changed to match the number of classes in the new domain dataset. ere are some studies about biology image classification using transfer learning. Ge et al. [51] learned a domain-generic DCNN for the task of plant classification, by applying transfer learning on the parameters of the GoogLeNet [37] model (pretrained on the large-scale ImageNet dataset) using all of the training data for the plant classification task. Lee et al. [52] incorporated transfer learning by pretraining DCNN with class-normalized data and fine-tuning with original data.
Orenstein and Beijbom [53] built on the insights from Kaggle's National Data Science Bowl (NDSB) and investigated how DCNNs perform on several datasets of in situ plankton images, and their study suggests that weights from a highly tuned network for one planktonic image set could be used effectively in another plankton domain. Ge and Yu [54] introduced a source-target selective joint fine-tuning scheme for improving the performance of deep learning tasks with insufficient training data. eir idea is to identify and use a subset of training images from the original source learning task whose low-level characteristics are similar to those from the target learning task and jointly fine-tune shared convolutional layers for both tasks.
Previous studies about transfer learning with DCNNs mainly focused on the tasks which transfer from ImageNet to a specific domain or transfer between two closely related domains [53]. Only a few studies exploited the transfer learning between two domains that are not directly related. When applying transfer learning to biology image classification, the different distance between species in the source domain and the target domain may have different effects on the performance. Although there is a certain biological distance between the two domains, they may share some common patterns in the view of DCNNs.
In this paper, inspired by the analysis of the literature and practical applications, deep transfer learning for biology cross-domain image classification is explored. By analyzing the experimental results on image datasets in different biology domains, including flowers, plant seedlings, plankton, and fish, some interesting conclusions are drawn. e main contributions of this paper can be listed as follows: (

Methods
To exploit deep transfer learning for biology cross-domain image classification, multiple transfer learning scheme and propose multistage transfer learning are designed to train DCNNs with several datasets from different domains.
AlexNet [35] consists of five convolutional layers and three fully connected layers. ere are three max-pooling layers of 3 × 3 after layers 1, 2, and 5. In the first layer, the 3 channels in the filters correspond to the red, green and blue components of the input image. e local response normalization (LRN) [35] was dropped in our implementation, which was introduced in AlexNet but was no longer used in subsequent DCNNs as it was replaced with batch normalization [38].
VGG-16 [36] consists of 13 convolutional layers and 3 fully-connected layers. In order to increase the depth of the network, the small (3 × 3) convolution filters are used in all convolutional layers.
GoogLeNet [37] has 22 layers, which consist of three convolutional layers, nine inception layers (each of which is two convolutional layers deep), and one fully connected layer. e inception layer is composed of parallel connections with different sized filters, including 1 × 1, 3 × 3, and 5 × 5, along with 3 × 3 max-pooling, are used for each parallel connection. e outputs of each connection in the inception module are concatenated together as the inception output. Using multiple filter sizes has the effect of processing the input at multiple scales. In order to reduce the number of weights, 1 × 1 filters are applied as a "bottleneck" to reduce the number of channels for each filter. GoogLeNet has multiple versions while batch normalization was introduced in the second version, and the most popular version, as known as GoogLeNet v3, is used in this paper. GoogLeNet v3 decomposes the convolutions by using smaller 1-D filters to reduce the number of weights to go deeper.
As the error back-propagates through the network, the gradient shrinks, which affects the ability to update the parameters in the earlier layers for very deep networks. To deal with the vanishing gradient problem, ResNet uses residual connections. ResNet introduces a "shortcut" module which contains an identity connection so that the "weight" layers (the layers that contain parameters) can be skipped. Rather than learning the function for the weight layers, the shortcut module learns the residual mapping. e "bottleneck" approach used in GoogLeNet, which uses 1 × 1 convolution to reduce the number of weight parameters, is also used in ResNet. e ResNet can be implemented with different layers. In this paper, ResNet with 18, 34, 50, 101, and 152 is built.

Rectified Linear Unit.
Rectified Linear Unit (ReLU) activation function is applied to the output of every convolutional layer in all DCNNs used in this paper. e ReLU activation function can be described by the following equation: where z indicates the input of ReLU activation function. e ReLU activation function can make DCNNs more sparse. For example, in a randomly initialized network, only about 50% of hidden units (z < 0) are activated (having nonzero Journal of Control Science and Engineering 3 output) simultaneously. Another benefit of ReLU is that it reduces the likelihood of vanishing gradient. is arises when z < 0, the gradient has a constant value, which results in faster learning of the DCNNs.

Dropout.
Dropout is a technique to reduce overfitting, which sets the output of each hidden neuron to zero with a probability. e neurons which are "dropped out" in this way do not contribute to the forward pass and do not participate in back-propagation of the training process. Every time an input is presented, the neural network samples a different architecture, but all these architectures share weights. Dropout in the fully-connected layers of AlexNet and VGG is employed. [38] speed up the training process and improve accuracy by controlling the input distribution across layers. To this end, the distribution of the layer input activations (μ, σ) are normalized such that it has a zero mean and a unit standard deviation, which can be described as

Batch Normalization. Batch Normalization
where μ and σ indicate the mean and standard deviation of the distribution of layer input activations, c and β are parameters that can be learned from training, and ϵ is a small constant to avoid numerical problems.

Softmax.
Softmax function is employed after the output layer, which is a fully connected layer with K units. Here K indicates the number of classes in the image classification task, which has the same meaning in equation (3). e output of the softmax can represent a probability distribution over all the predicted classes, which is computed by where x i represents the output of the i-th unit in the last fully-connected layer and i ranges from 0 to K − 1.
2.6. Data Augmentation. By enlarging the dataset using label-preserving transformations [35,39] artificially, data augmentation is the easiest way to reduce overfitting on image data. ere are three forms of data augmentation in our classification system: feature normalization, image resizing/cropping, and image horizontal flipping. It has been proved that feature normalization can make the gradient descent converge faster [38]. During both the training phase and the test phase, when image data are fed into the system, the system will do feature normalization for each channel of the image, respectively, where x c indicates the c-th channel of the input image; μ c and σ c indicate the mean and standard deviation in the c-th channel among all the images in the training set, respectively; x c ′ indicates the c-th channel of the normalized input image.

Pipeline and Experiment Details.
All the DCNNs in this paper are implemented with PyTorch deep learning framework. For GoogLeNet v3 network, firstly the input image will be resized to 342 × 342 and then be cropped into 299 × 299; for other networks, the input image will be resized to 256 × 256 and then be cropped into 224 × 224. To prevent substantial overfitting [35], different methods of cropping are employed during the training phase and the test phase. During training phase, randomly cropping are employed by extracting random 224 × 224 patches (for GoogLeNet v3 network, it is 299 × 299) from the 256 × 256 images (for GoogLeNet v3 network, it is 342 × 342). en, randomly horizontally flip these patches and feed them into the network for training. During the test phase, for each image in the test set, only need to predict once, the foreground organisms in the image are more likely to appear in the center. So, only center cropping is employed.
All the first convolutional layers of the DCNNs in this paper have three channels, corresponding to the three channels of an RGB image. Except for GoogLeNet v3, all the inputs to the DCNNs are fixed-sized 224 × 224 × 3 images. While for GoogLeNet v3, the input image size is fixed to 299 × 299 × 3. If a single-channel gray image is an input to the DCNN, it will be converted to an RGB image with three same channels, whose values are copied from the single-channel image.
To get more details of our experiments, please visit our open-sourced repository BioTL [58] on GitHub.

Training from Scratch.
e DCNNs training procedure generally follows Krizhevsky et al. [35]. e initialization of the network weights is important because the bad initialization can  [36]. e biases with zero and the weights are initialized in all the convolutional layers with N(0, 2/n), where n is the product of the size and the number of channels of the filters in the layer. A weight decay of 10 − 4 and a minibatch size of 16 is used. e learning rates of AlexNet and VGG-16 are both initialized to 10 − 3 , while the learning rates of all other DCNNs are initialized to 10 − 2 . With the initial learning rate, all DCNNs are trained up to 300 epochs, during which every 100 epochs divide the learning rate by 10.

Cross-Domain Transfer Learning
2.9.1. Fine-Tuning on ImageNet. To fully utilize the potential of DCNNs with small amounts of data, we use ImageNet as the source domain and apply transfer learning to transfer the knowledge learned from ImageNet to the target domain. e operations of data augmentation are the same with training from scratch. Instead of initializing all the weights randomly, they are initialized them (except the last fully-connected layer) with the weights learned from ImageNet dataset. Because the number of classes in the target task may differ from the ImageNet's 1000 classes, which corresponds to the output dimension of the last fully-connected layer, the weights of the last fully-connected layer in the pretrained modeled were dropped.

Multiple Transfer Learning.
To exploit the deep transfer learning for biology image classification, a multiple transfer learning scheme is designed. e multiple transfer learning scheme is designed to apply transfer learning several times on multiple source domains to observe the effect of cross-domain. For example, at first, a DCNN model on the Flowers17 dataset is trained, which is considered as the source domain. Secondly, all the weights of the trained model are used except the last fully-connected layer to initialize a new model with the same architecture. is is because the dimension of the output in the last fullyconnected layer corresponds to the number of classes in the classification task. While the number of classes in the source domain is often different from that in the target domain, so the last fully-connected layer needs to be rebuilt to fit the new task. At last we train the new model with initialized weights on the target domain dataset, such as QUT Fish.
In practice, first of all, ImageNet dataset is used as the source domain and then fine-tune the pretrained models on the five target domain datasets (Flowers17, Flowers102, Plant Seedlings, PlanktonSet 1.0, and QUT Fish). After that, to exploit the effects of different distance between species from the source domain and the target domain, different combinations from the five datasets to apply transfer learning is chosen.

Multistage Transfer Learning.
ere may be a huge difference between the source domain dataset and the target domain dataset, causing the knowledge learned from the source domain cannot be well-transferred. If the data in the intermediate domain can adapt the learned features to fit the target domain, the hindering effect will not be particularly noticeable or the performance may be improved. To make the knowledge learned from the source domain more transferable, the multistage transfer learning is proposed. To perform multistage transfer learning, to add an intermediate domain between the source domain and the target domain is needed.
In Figure 1, a diagram is used to demonstrate the multistage transfer learning framework. In Figure 1, "CONV 1" to "CONV N" blocks indicate N convolutional layers in the DCNN model, "FC" block indicates the fully connected layer in the DCNN model. As shown in Figure 1, the proposed multistage transfer learning consists of three stages: prepretrain the models on ImageNet which is considered as the source domain; pretrain the models in an intermediate domain; and fine-tune the models in the target domain. We do not know how to find the best intermediate domain dataset, so followed multiple transfer learning scheme with a grid search to try different datasets as the intermediate domain.
Considering on the computational cost consideration, only multistage transfer learning on ResNet-18, ResNet-34, and ResNet-50 three models are explored, which have the similar structures but different depths.

Datasets
In this paper, to exploit cross-domain transfer learning, several datasets that come from different domains are choosed, including Oxford Flowers, Plant Seedlings, PlanktonSet 1.0, and QUT Fish.

Oxford Flowers.
ere are two versions of Oxford Flowers datasets, Oxford Flowers 17 (Flowers17) and Oxford Flowers 102 (Flowers102). Flowers17 contains 17 classes of flowers, with 80 images in each class which was chosen to be indistinguishable solely by color. Flowers102 dataset consists of 102 classes represented by 40 to 258 images per class and 8189 images in total. ere are about 45% of the Flowers17 images are also part of the Flowers102, so Flowers17 is not simply a subset of Flowers102. e image examples of these two datasets are shown in Figures 2 and 3, in which the images in the same row come from the same class and images from different rows come from different classes. According to the recommendation in the official datasets documents, the datasets are splitted into the training set, validation set, and test set, respectively.    Image segments extracted from the raw data contains 60 736 images in total are sorted into 121 plankton classes and split into a training dataset and test dataset with a ratio of 1 : 1. e images obtained using the camera were already processed by a segmentation algorithm to classify and isolate individual organisms and then cropped accordingly, which can be seen in Figure 5. e image samples demonstrate that there are high intraclass variance and small interclass variance among some plankton species. [60] consists of 3960 images collected from 482 fish species. e data contain real-world  images of fish captured in conditions defined as "controlled," "in situ," and "out-of-the-water" shown in Figure 6. Since "controlled" images are captured with a controlled background and high quality, when splitting the dataset, to split "controlled" images into training set while splitting "in situ" and "out-of-the-water" images with low-quality and pose variations are tended into the test set. At last, QUT Fish dataset is splitted into the training set and the test set with a ratio 1 : 1. As a result of there are some classes in this dataset that only contain two image examples, only 2-fold crossvalidation on it can be applied.

QUT Fish. QUT Fish
Because the amount of training data plays a crucial rule in training DCNNs, the number of all training examples and the average number of training samples are listed in each class for above datasets in Table 2. From Table 2, the scales of all the datasets are small compared to ImageNet, which contains more than one million training image examples. For QUT Fish, the data is extremely scarce since on average there are only 4 training samples in each class. In Table 2, it is obvious that with the increase of the number of layers (depth), the performance of the DCNN on ImageNet is getting better and better. At the same time, the number of parameters is also increasing along with the layers.

Evaluation
In this paper, accuracy and F measure as the evaluation metrics are used.
Accuracy is the most intuitive and frequently-used 2 performance measure of the classification task. Accuracy is simply a ratio of correctly predicted samples to the total samples so it can be easily calculated. Accuracy is a good measure if the datasets are symmetric, however, for some imbalanced datasets, accuracy may not reflect the real performance of the classifier. Most of the datasets used in this paper are imbalanced, such as Flowers102, Plant Seedlings, PlanktonSet 1.0, and QUT Fish. e distributions of these datasets can be seen in Figure 7. To evaluate the classification performance on imbalanced datasets, F measure as another metric is used.
Both accuracy and F measure can be calculated by the confusion matrix, which is a table containing information about actual and predicted classifications. As shown in Table 3 (Refer to Table 1 in Ref. [12]), each row of the confusion matrix represents the instances in a predicted class while each column represents the instances in an actual class. For a binary classifier, according to the true condition and predicted condition, the confusion matrix consists of four parts: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). In this way, several measures can be derived from a confusion matrix: Recall � TP TP + FN .
F measure is the harmonic mean of precision and recall erefore, F measure takes both FP and FN into account and is more useful than accuracy when we have an uneven class distribution.

Results
e multiple transfer learning scheme is designed to exploit deep transfer learning on Flowers17, Flowers102, Plant Seedlings, PlanktonSet 1.0, and QUT Fish datasets. When performing transfer learning, to make a comparison, DCNN models are also pretrained on ImageNet and then fine-tuned the models on the five datasets. For multiple transfer learning, one dataset from the above five datasets is chosen as the source domain and chosen another as the target domain. In the table of  ere are relatively more data in PlanktonSet 1.0 so the DCNNs achieved better results on PlanktonSet 1.0 tend to be deeper. On PlanktonSet 1.0, the best accuracy 77.40% is achieved by ResNet-152 and the best F measure 0.659 3 is achieved by ResNet-101.

Cross-Domain Transfer Learning.
To make a comparison, experiments of fine-tuning the pretrained on ImageNet model are performed. e multiple transfer learning scheme is designed to apply transfer learning on several cross-domain datasets. Similar to fine-tuning on ImageNet, the weights from the model pretrained on the source domain is used to initialize a new DCNN and then fine-tune it on the target domain dataset. To adapt the features in pretrained on ImageNet models to fit the target domain well, multistage transfer learning is proposed by adding an intermediate    Table 8, how much gains the cross-domain transfer learning methods get compared with training from scratch is listed.
In the "transfer process" column, the entries in boldface indicate the source domain dataset or intermediate dataset with best classification results; in the "Accuracy (%)" column, the entries in boldface indicate the best performance with the highest accuracy; in the "F measure " column, the entries in boldface indicate the best performance with the highest F measure .
In the "Transfer process" column, the entries in boldface indicate the source domain dataset or intermediate dataset with best classification results; in the "Accuracy (%)" column, the entries in boldface indicate the best performance with the highest accuracy; in the "F measure " column, the entries in boldface indicate the best performance with the highest F measure .

Fine-Tuning on ImageNet.
e results of fine-tuning on ImageNet are shown in Tables 9 and 10. Comparing with the training from scratch results in Tables 4 and 5, for Flowers17, Flower102, and QUT Fish dataset, every single model achieves a better performance after fine-tuning on Image-Net. For Flowers17, after fine-tuning on ImageNet, 9.52% accuracy and 0.091 1 F measure were gained on average among all models; for Flowers102, there is a much better result, with 35.65% accuracy and 0.369 5 F measure gain on average after fine-tuning on ImageNet; For QUT Fish, there is also a better result, with 17.96% accuracy and 0.174 6 F measure gain on      F measure than training from scratch. For the Flowers17 dataset, using Flowers102 as the source domain dataset gets the gain with 0.66% accuracy and 0.007 0 F measure on average, which is much poorer than using PlanktonSet 1.0 as the source domain dataset; for Flowers102, using Flowers17 as the source domain gets the gain with 8.44% and 0.087 1 F measure on average, which is also poorer than using PlanktonSet 1.0 as the source domain dataset. In the "Transfer process" column, the entries in boldface indicate the source domain dataset or intermediate dataset with best classification results; in the "Accuracy (%)" column, the entries in boldface indicate the best performance with the highest accuracy; in the "F measure " column, the entries in boldface indicate the best performance with the highest F measure .
For Plant Seedlings, Table 12 shows that using Flowers17 as the source domain dataset can get the gain with 0.21% accuracy and 0.001 7 F measure on average. Using Flowers102, PlanktonSet 1.0 and QUT Fish as the source domain dataset all get the result decreased.
In the "Transfer process" column, the entries in boldface indicate the source domain dataset or intermediate dataset with best classification results; in the "Accuracy (%)" column, the entries in boldface indicate the best performance   with the highest accuracy; in the "F measure " column, the entries in boldface indicate the best performance with the highest F measure . For PlanktonSet 1.0, Table 13 shows that there is no clear evidence proving that using multiple transfer learning can get the results improved all the time. On average, using Flowers17 as the source domain dataset gets a decrease with 0.06% accuracy and 0.000 6 F measure ; using Flowers102 as the source domain dataset gets a decrease with 0.09% accuracy and 0.002 3 F measure ; using Plant Seedlings as the source domain dataset gets the gain with 0.08% accuracy and 0.000 2 F measure . In fact, even using ImageNet as the source domain dataset get the accuracy decreased with 0.17% accuracy and 0.000 5 F measure on average.
In the "Transfer process" column, the entries in boldface indicate the source domain dataset or intermediate dataset with best classification results; in the "Accuracy (%)" column, the entries in boldface indicate the best performance with the highest accuracy; in the "F measure " column, the entries in boldface indicate the best performance with the highest F measure . Table 8, the gains of cross-domain transfer learning results compared with

Discussion
In this paper, the multiple transfer learning scheme and the multistage transfer learning method are introduced to exploit cross-domain transfer learning on biology image classification. Our aim is to address the problem that limited labeled data may not fully utilize the feature representation power of DCNNs. In order to achieve this, multiple transfer learning scheme is designed to explore cross-domain transfer learning and proposed multistage transfer learning to learn high-level patterns from different domains to get the learned features fitting the target domain. Table 1 shows that, with the increase of the DCNN's depth, the performance on ImageNet can get better and better. But meanwhile, the parameters of the network also increase dramatically, which makes training the network more difficult especially when the amount of data is scarce. In order to compare the performances of different models on different datasets and observe their changes trend intuitively, the performances of different models in Figure 8 are normalized and translated. Added the depth, the number of parameters and the performance on ImageNet for each model to Figure 8, which have also been normalized and translated.
In Table 2, it can be seen that the scales of datasets in this paper are very small compared to ImageNet. It can be seen that after the depth of the network reaches a certain level, its performance will no longer improve as the depth of the network increases. Most of the best results on the datasets are achieved with ResNet-18 or ResNet-34.
DCNNs can learn some high-level patterns that are general, so transfer learning can be used to transfer these learned high-level patterns to the target domain with limited data. When the data amount in the target domain is small, the data amount in the source domain plays an important role in the transfer learning performance. For example, there are more data in PlanktonSet 1.0, so when make PlanktonSet 1.0 as the source domain dataset, the multiple transfer learning results tend to be better (Tables 6, 11 and 7). For example, in Table 11, although there is a closer biological distance between Flowers17 and Flowers102, the performance of using Flowers17 as the source domain dataset is worse than using PlanktonSet 1.0. When the data amount in the target domain is large, the effect of different biological distances between the species in the source domain and the target domain will be reflected (Table 12). In Table 12, although PlanktonSet 1.0 contains more data than all other datasets, using PlanktonSet 1.0 as the source domain dataset did not get the best result. Multistage transfer learning is proposed to address the problem caused by the big gap between the source domain and the target domain. From Table 8, it can be seen that since there is a huge difference between ImageNet and Plank-tonSet 1.0, multistage transfer learning with cross-domain datasets can improve the performance of fine-tuning on ImageNet. But when performing multistage transfer

Conclusions
In this paper, the multiple transfer learning scheme is designed to exploit deep transfer learning for biology crossdomain image classification. By pretraining the DCNN model in different source domains, the results on the target domain dataset can be improved significantly. It has been proved by the experimental results that even the out-ofdomain data are effective when the target domain data is insufficient. Multistage transfer learning method is also proposed which can improve the performance of DCNNs when there is a huge difference between the source domain and the target domain. A limitation of multistage transfer learning is that the datasets in the intermediate domain should be carefully selected; otherwise, the final performance may be hindered. However, it is difficult to found the best way to search the optimal dataset as the intermediate domain and this need further study. In our view, searching the datasets which have similar low-level characteristics with the target domain may be a good choice. Since DCNNs can learn some high-level domain-independent features, the ideas of multiple transfer learning and multistage transfer learning can be widely applied to biology image classification or other fields.