Sequential semi-supervised segmentation for serial electron microscopy image with small number of labels

Background: Segmentation of electron microscopic continuous section images by deep learning has attracted attention as a technique to reduce the cost of annotation for researchers attempting to make observations using 3D reconstruction methods. However, when the observed samples are rare, or scanning circumstances are unstable, pursuing generalization performance for newly obtained samples is not appropriate


Introduction
Comprehensive research on the structures and connections of the nervous system in the brain and sensory organs of an organism leads to understanding their functional meaning.Such observations are carried out in a research field called "connectomics" (Chklovskii et al., 2010;Chen et al., 2006;Bock et al., 2011).Various methods, such as diffusion tensor imaging (Alexander et al., 2007), brainbow (Livet et al., 2007), and 3D reconstruction (Denk and Horstmann, 2004), have been used to investigate neural structures.When the observation target is a sensory organ of small-sized organisms such as ants and other insects, several hundreds of serial images are taken with a scanning electron microscope and reconstructed in 3D space to observe them in detail.For example, Takeichi et al. partially clarified the mechanism of the neural circuit for identifying nestmates by observing the olfactory sensory unit of the Japanese carpenter ant (Takeichi et al., 2018).As we can see from this example, the 3D reconstruction of serial images makes it possible to observe even the delicate neural structures of small insects.
Although the 3D reconstruction of serial images is an essential technology in various academic fields such as biology and neuroscience, it requires careful annotation by experts, which is significantly timeconsuming.In the example of the research on the Japanese carpenter ant mentioned above, several experts spent a few months on the annotation work to reconstruct a set of serial images in three dimensions.
In recent years, some attempts have been made to construct an automatic annotation system to reduce the burden on experts.As can be seen from the pipeline method proposed in a previous study (Kaynig et al., 2015), automatic annotation is divided into two phases: a region extraction phase and a grouping phase.In the extraction phase, neural contours and their internal regions are extracted.In the grouping phase, nerve regions extracted in each slice are combined into continuous regions.In this paper, we describe the former phase "segmentation".Since the segmentation phase's performance greatly affects the overall processing, various methods based on machine learning have been proposed.The development of applications to facilitate the implementation of these methods is also being actively pursued (Berg et al., 2019).In particular, since the neural membrane segmentation method using deep neural networks (DNN) (Ciresan et al., 2012) was proposed, many DNN-based methods exemplified by U-Net (Ronneberger et al., 2015) have been devised.These networks enable segmentation with significantly high performance.
However, we consider there to be two issues in the practical use of deep learning-based methods: 1.The preparation of labeled data used for learning requires a high level of expertise and a significant amount of time.2. It is difficult even for experts to obtain a precise sequence of images using an electron microscope, and the quality of the captured images will be different each time and will not be stable.
For these reasons, it is difficult to apply the problem setting found in existing studies, that is, obtaining a model for highly accurate inference for newly obtained data.
In this study, we consider these problems with practical application and perform continuous image segmentation in the framework of transductive learning.More specifically, we aim to predict the labels of an entire continuous image from the labels given to a part of the continuous image.In addition, we propose a semi-automatic method for the segmentation of serial images.
Our method leverages a semi-supervised learning algorithm, which uses only a small number of samples (labeled by experts) in a dataset for training, and it automatically segments all remaining unlabeled images in the dataset.In other words, our method makes it possible to complete annotation for a set of serial images by self-learning.We conduct evaluation experiments using two datasets and discuss the properties of the proposed method.
Following this section, we describe related studies on electron microscopy image segmentation and semi-supervised learning in Section 2. In Section 3, we define the tasks in this study and describe the proposed method.In Section 4, we describe experiments using two types of datasets and their results.After discussing the performance and properties of our 4S method in Section 5, we give the conclusion and future work.

Related work
With the development of microscope technology and the field of connectomics, interest in cell area detection has increased, and automation by machine learning has been attempted.The approaches can be roughly classified into unsupervised learning and supervised learning (because semi-supervised learning substantially requires correct answer data, we classify it into supervised learning).The former is an approach that detects the region of interest on the basis of edge extraction using traditional image processing techniques.Typical methods include the binarization method represented by the Otsu method (Otsu, 1979), and the graph cut algorithm (Boykov et al., 2001).Since unsupervised learning does not require teacher data, there is almost no burden on the expert at the data collection stage.However, its performance is limited, and the burden on post-processing correction is significant.Supervised learning is expected to have extremely high performance due to the rise of deep learning in recent years.In the following, we systematically summarize the various methods that have been proposed for the task of cellular region extraction, with particular attention to supervised learning approaches.Furthermore, we refer to the semi-supervised learning approach.

Pixel-wise classification with hand-crafted features
Cell region extraction using supervised learning is mainly composed of two stages: manually designing feature vectors and learning them.For example, Mishchenko leveraged Gaussian smoothed Hessian and the eigenvalues computed on a small region around the pixel of interest as a feature, and classified whether each pixel belongs to a membrane or not (Mishchenko, 2009).Similarly, Andres et al. performed classification by using local statistical features and random forest (Andres et al., 2020).This kind of pipeline method had some success in the early 2010s, but due to the trend of deep learning, it was gradually replaced by the end-to-end method.

Deep neural networks for pixel-wise classification
DNN is a machine learning model that has come to be used for various tasks.This is due to the large datasets available on the Internet and the increase in the capacity of GPU parallel computing.In particular, the case in which the general object recognition model (Krizhevsky et al., 2017), using a convolutional neural network, surpassed the conventional method in the ILSVRC2012 competition and received a great deal of attention.
Ciresan et al. pioneered the use of DNN in the task of cell region extraction (Ciresan et al., 2012).They defined this task as a classification problem that determines whether each pixel in an image belongs to a neural cell membrane.Then, using a square patch centered on each pixel as input, the DNN was used to infer each pixel.In the ISBI 2012 competition (Arganda-Carreras et al., 2015), their method achieved results far exceeding those of the conventional method mentioned above.Following this achievement, various methods based on DNN have also been proposed in the field of biomedical image processing.
Among the DNN-based methods, U-Net (Ronneberger et al., 2015) and deep contextual network (Chen et al., 2016b) are particularly successful examples.Both networks are based on fully convolutional networks (Long et al., 2015), and the feature maps obtained in the middle layer are propagated directly to deep layers by skip connections.Drozdzal et al. have also experimentally shown that the skip connection is essentially crucial in nerve cell segmentation (Drozdzal et al., 2016).In more advanced research, deeper networks (Fakhry et al., 2017) (Xiao et al., 2018) and mechanisms that take into account correlations in the 3D direction (Chen et al., 2016a(Chen et al., , 2018) ) have been devised for greater performance.However, these previous studies presuppose that a large number of training data is available.For researchers of minor organisms and sense organs who cannot easily prepare them, DNN-based methods are challenging.When applying DNNs to situations where there is insufficient data for training, such as in the biomedical field, various approaches, such as transfer learning (Tan et al., 2018), semi-supervised learning (van Engelen and Hoos, 2020), active learning (Budd et al., 2019), weakly supervised learning (Gondal et al., 2017), and self-supervised learning (Chen et al., 2019), have been taken depending on the situation.In this study, we focus on the semi-supervised learning approach, assuming that a small number of labels is available.

Semi-supervised segmentation algorithm
In the application of machine learning for analyzing biomedical images including electron microscopy images, the cost of creating training data is high.However, the acquisition of raw data itself has become easier with the development of imaging equipment.Therefore, many studies have applied semi-supervised learning, which is an approach using a small number of labeled data and a large number of unlabeled data (Cheplygina et al., 2019).Semi-supervised learning frameworks can be classified into self-training (Yarowsky, 1995), co-training (Blum and Mitchell, 1998), and graph-based approaches (Zhu et al., 2003).
The method proposed in this study, which was inspired by expert annotation, is based on the concept of self-training.Self-training uses a model trained on labeled data to make inferences on unlabeled data.Then, among the prediction results, the predicted map with high confidence is regarded as a pseudo-label, and the model is retrained using it.By repeating this procedure, the performance of the model is improved.2017) leveraged it for fetal brain and cardiac segmentation, respectively.However, these studies assume an inductive setting and seek accuracy in inference for unknown cases.Our proposed method is different from them in that the transductive setting described in Section 3.1 is assumed and that pseudo-labeling is performed sequentially using local slices.

Problem definition
We define the task of detecting cell regions from continuous images in the framework of transductive learning (Zhu and Goldberg, 2009).It is assumed that annotations are given only to several consecutive slices in a certain sample's continuous slice image.This is represented as Here, l and u are the total numbers of labeled slices and unlabeled slices.The goal of transductive learning is to train a model to make good predictions for In other words, we aim to predict labels for all data that has already been obtained.Unlike general supervised learning, the aim is not to generalize the model's predictive performance to newly obtained data.

Sequential semi-supervised segmentation framework
When experts label continuous images for 3D reconstruction, they do so one by one, checking the consistency before and after.That is because there is a priori knowledge that consecutive images have a strong correlation before and after.Considering this fact, we devised the approach shown in Fig. 1.An expert labels only the first few images of a series of images, and the rest are automatically labeled.Self-training is performed by using prediction results as pseudo-labels and using them as teacher data for re-training.It is particularly noteworthy that only local slices are used as input data for re-training.Repeated re-training increases the calculation time, but as long as a small number of teacher label is given, the inference is automatically performed on all data after that, so experts' labor can be significantly reduced.We named our method "sequential semi-supervised segmentation" (4S).
Our proposed method is based on a transductive setting and assumes the labels are given to M slices that are consecutive from the first slice in N slices.The process proceeds mainly by repeating two steps: 1.A machine learning model designed for cell region detection is trained using continuous M slices.2. The inference is performed on the next image using the trained model, and the set of pixels that exceed a certain threshold is considered as a pseudo-label.
In the second and subsequent rounds, the selection width of M sets of slices used for training is slid, and re-training is performed with M sets of slices including pseudo-labels.Eq. ( 1) determines the number of epochs in each training step.We designed this equation to reduce the training epochs in proportion to the progress of processing in order to prevent excessive learning from classification errors included in the pseudolabels: We give details on the above process in Algorithm 1. (1) Train model using (X i:i+M− 1, T i:i+M− 1 )forEpochs(i) Predict the label from X i+M T i+M ⟵ predicted label end for

FCN-based neural network architecture
We use an FCN-based network architecture as the machine learning model in the proposed method.FCN is an architecture that does not have any fully connected layers but consists only of convolution layers.It was proposed to solve the task of semantic segmentation (Long et al., 2015).FCN-based neural networks are structured to perform practically pixel-by-pixel classification, so they are capable of learning even with relatively small amounts of training data.In this study, we use the architecture U-Net (Ronneberger et al., 2015), one of the most commonly used networks for biomedical image segmentation tasks.The feature maps obtained in the convolution layers of the encoder part are directly concatenated with the decoder part through skip connections.By propagating the precise information obtained in the shallow layers to the decoder explicitly, elements such as positional information lost by the pooling layers can be restored, and more accurate segmentation can be performed.Note that the network used for segmentation can be arbitrarily selected because 4S is a framework.In Section 4.2, we give details on the U-Net architecture used in the experiment.

Experiment
In this section, to verify the effectiveness and properties of the proposed method, we conducted experiments with multiple settings (Section 4.2) using two types of datasets (Section 4.1).The evaluation metrics and detailed implementation are described in Sections 4.3 and 4.4, respectively.

ISBI 2012 dataset
The first dataset is a set of 30 sections from a serial section transmission electron microscopy dataset of the Drosophila first instar larva ventral nerve cord, which was used in a competition held at ISBI 2012 (Arganda-Carreras et al., 2015) The task to be solved is to detect neural cell membranes.The size of each image is 512 × 512 pixels, and the ground truth segmentation maps for cells and membranes are annotated.In addition to the training images, 30 test images are provided, and the participants of the competition can submit segmentation results for them online.However, since this study assumes a transductive setting, we used only 30 training images, supposing that we have a small number of labels in them.

Japanese carpenter ant dataset
We used an unpublished dataset as the second one.This is a stack of serial block-face scanning electron microscopy images of the nestmate discriminant sensory elements in the Japanese carpenter ant, provided by the Ozaki Laboratory at the Department of Science, Kobe University (Takeichi et al., 2018).A total of 377 images have corresponding ground truth segmentation maps for cells made by experts.The size of the images is 2048 × 2048 pixels, and in our experiment, we cropped them into 512 × 512 pixels so that a sensory element is included.We sampled 100 consecutive slices as an experimental dataset and assumed that a small number of labels would be included.Since this study focuses on the phase of cell segmentation, we treated it as a task of classifying cell regions and others.Therefore, we set the teacher label as binary data with pixels belonging to the cell regions as 1 and the rest as 0. In the following, we call this the "ANT dataset".

Cell tracking dataset
In addition to the above two datasets, we used a fluorescence microscopy image dataset of mouse stem cells (Bártová et al., 2011).It consist of 92 consecutive images.It is required to extract the region of each cell in order to track the multiple cells found in the images.The size of each image is 1024 × 1024 pixels, but in our experiments, we resized them to 512 × 512 pixels.This is different from the above two datasets in that the background of the cells of interest contains almost no noise.Therefore, it was used only to verify the basic performance of the proposed method.
The significant difference between the Ant dataset and ISBI datasets is the change in appearance from the first slice to the last slice.The first dataset does not change much even after reaching the last slice, but the second has very different first and last slices.

Experimental settings
To verify the effectiveness of the proposed method and analyze its properties, we conducted experiments from the aspects listed below.

Basic performance
To show the effectiveness of our proposed method, we compared it with supervised learning without pseudo-labeling as a baseline method.
To compare with the ideal situation where the accuracy of pseudo-labels is perfect, we also performed supervised learning where the labels used in each training step were the ground truth.In the following, we call these two comparison methods SL _ baseline and SL _ ideal, respectively.We used three initial labels in 4S.For SL _ baseline, we used only the initial labels to infer all slices, and the number of epochs was made equal to the total number of epochs in the proposed method.For SL _ ideal, we performed each training using the same method as the proposed method [the number of epochs at t steps is defined by Eq. ( 1)].To estimate the approximate computation cost, we also calculated the computation time per epoch.

Effect of number and position of initial labels on performance
Our 4S assumes that the number of available teacher labels is small, but the value of M is arbitrary, and it is expected that the performance will change depending on the value.In addition, the position of the sequence of images chosen as the initial label will also have a significant impact on performance.To investigate these effects, we verified the performance when the value of M was changed from 1 to 5 and when the last three slices were used as initial labels.

Effect of transfer learning
The proposed method learns to overwrite the past model while using only M local images in continuous images for each training.Therefore, it can be construed that the transfer learning of the past memory is practically performed on 4S.However, if the pseudo-labels generated at each step contain fatal errors, the information would also be transferred, which could seriously affect the performance.To confirm the effect of transfer learning on 4S, we verified the performance when the parameters of the model were initialized before each training.

Evaluation metrics
We evaluated the proposed method by calculating the precision, recall, and dice similarity coefficients (DSC) from the ground truth and the inference result for each slice image.The calculation formula of each index is as follows.(2) Note that true positive (TP), false positive (FP), true negative (TN), and false negative (FN) are based on the prediction result for each pixel and the ground truth.

Implementation details
Our learning model in 4S can be arbitrarily selected as long as it has the function of performing image segmentation.In this experiment, we used U-Net, which is widely used in many segmentation tasks (details are in Section 3.3).The details of the network are shown in Table 1.Note that, in addition to the structure proposed by the original paper (Ronneberger et al., 2015), we introduce batch normalization (Ioffe and Szegedy, 2015).The batch size was set to be equal to the value of M, and the number of initial epochs was set to 600 in all experiments.Given this number of epochs, the model was trained for 1309 epochs for the ISBI dataset and for 1729 epochs for the ANT dataset in total [these numbers of epochs can be calculated by Eq. ( 1)].Similarly, it trained for 1681 epochs for the CELL dataset.We used these epochs to train SL _ init.As an optimization method, we used Adam (Kingma and Ba, 2015) with alpha = 10 − 4 .Considering the binary classification of each pixel, we used the cross-entropy loss as an error function for training.The threshold of the probability value when generating the pseudo-labels was 0.5.All experiments were implemented with Python 3.6.8and the Chainer library (Tokui et al., 2019), and processing was performed with an NVIDIA DGX Station.The source code used in our experiments will be made available at https://github.com/eichitakaya/Sequential-Semi-supervised-Segmentation in the future.

Results
Table 2 shows the results regarding basic performance.Precision, recall, and DSC are the average values of inference results for all slices.Each average is the average of 10 trials, and the standard deviation at that time is also shown.In the ISBI and ANT datasets, 4S outperformed SL _ init in precision, but fell out in recall.In comparison, due to the large improvement in the precision, the DSC, which is the harmonic mean of both, was improved.For the CELL datasets, 4S outperformed SL _ init in both precision and recall, and the DSC was improved.Regarding SL _ ideal, the results exceeded 4S and SL _ init for all metrics.Note that SL _ ideal is set up under the assumption that pseudo-labels are ideally obtained.The computation time was 0.54 s per epoch using a single NVIDIA Tesla V100.
The inference result (DSC) for each slice for each method is shown in Fig. 2 and Fig. 3.These boxplots were drawn on the basis of the results of 10 trials.The performance of 4S and SL _ init tended to decrease as the pseudo-labeling progressed, while the decrease was suppressed for the ANT dataset with 4S.
Qualitative results are shown in Figs. 4 and 5.As an example of an actual output image, we show the inference results for the 30th slice of the ISBI dataset and the 100th slice of the ANT dataset.These are the final slices of each dataset.For the ISBI dataset, there were some points where the neural regions were misrecognized as boundaries in the SL _ init results.In contrast, no such false detections were observed with the proposed method, and the boundary line appeared sharper.Regarding the ANT dataset, there were many areas where multiple nerves were connected due to false detection in the result of SL _ init.Meanwhile, with the proposed method, the total neural area was estimated to be smaller than that of the other methods, and the number of false detections was reduced.The red arrows indicate points where the processing performance of 4S was better than that of SL _ init.For both datasets, some detection omissions that were not seen in the result of SL _ init occurred.The blue arrows indicate these omissions.
Table 3 shows the results for the effect of the number and position of initial labels.Note that M = 3 is the same as 4S in Table 2.The DSC values were highest when M = 2 for the ISBI dataset and M = 4 for the ANT dataset (emphasized by boldface).In addition, the 4S's high precision and low recall compared with the case of SL _ init in Table 2 seemed to be the same for any M.
Reverse (4S) and Reverse (SL _ init) are the results of processing in the opposite direction, with the initial label positions being the last three labels.For both datasets, Reverse (4S) outperformed Reverse (SL _ i nit).Also, the performance of Reverse (SL _ init) was lower than normal SL _ init for both datasets (emphasized by boldface).
When we compare the 4S results, Reverse (4S) showed a slightly higher DSC for the ISBI dataset, and the normal 4S had a higher DSC for the ANT dataset.
Table 4 shows the results for the effect of transfer learning.When the model parameters were initialized at each step, the performance was obviously worse than when the parameters of the previous step were  taken over.In particular, the was remarkable for the ANT dataset.The efficacy of transfer learning is emphasized in Table 4 using bold values.

Discussion
In the experiments done to verify the basic performance, the proposed method tended to improve the precision significantly and decrease the recall slightly.We observed this tendency regardless of the value and position of M. Therefore, we can conclude that 4S has a function for reducing the number of false detections.It can be attributed to the fact that supervised learning with local information allows inference to weigh the most recent slices heavily.To pursue higher performance, it may be useful to use models such as RNN and 3D-CNN that can capture changes in the direction of the Z-axis.
The decline in performance as the processing progresses can be attributed to insufficient label propagation accuracy.It is worth noting that the performance does not necessarily decrease with distance from the initial slice, but there is some variation.It is also essential to clarify the reasons for this variation.Since SL _ ideal assumes that the propagation of the labels is perfect, the challenge for performance is how close we can get to its result.
The natural way of thinking is that a higher value of M should improve the performance, but it is worth noting that the DSC was not necessarily the best when M = 5.The appropriate value of M varied from dataset to dataset, indicating that there is an appropriate setting for each   Meanwhile, changes in position seem to have a more significant effect performance than changes in the initial number of copies.In an extreme case, it is possible to find the appropriate position in an exploratory manner, but the computational cost increases in proportion to the total number of slices.It is desirable to establish a method for identifying initial locations that maximizes performance prior to training.For example, methods would be helpful such as those used in query selection in active learning (Budd et al., 2019) to identify subsets that better represent the overall distribution of a dataset.
The results on transfer learning were contrary to our initial expectation that the propagation of errors due to transfer learning has a negative impact on performance.Initializing the parameters after each pseudo-labeling session resulted in a significant decrease in performance.In our 4S, it seems that training with true labels and the initial epoch was a good prior for training with pseudo-labeling.As a pretrained model with ImageNet has been used in various tasks (Tan et al., 2018), transferring knowledge from different domains could be useful for 4S, but this will be a challenge in the future.
The most important objective of this study was to reduce the annotation burden on experts.When we look at the processing time of 4S, it takes approximately 0.5 s per epoch, which is reasonable if the number of slices is large like the ANT dataset.However, considering the fact that it took several months for human experts to do this work, the 4S's total cost of annotation and processing time for a few slices is far less.Although there is still room for improvement in terms of segmentation performance, further pursuit in the direction of this study would be worthwhile.

Conclusion
In this study, we proposed sequential semi-supervised segmentation (4S) to enable learning from small amounts of teacher data in deep learning-based serial electron microscopy image segmentation.4S applies pseudo-labeling to all slices in target continuous images with a small number of consecutive slices as input.We conducted experiments with two different datasets, assuming that only a small number of teacher data was available.
The proposed method performed better than supervised learningbased segmentation.In particular, the propagation of true and pseudolabels by transfer learning prevented false detection.However, we found that the performance of 4S is highly dependent on the initial number of true labels and the initial position.
Our future work will include the pursuit of higher accuracy based on the selection of appropriate networks and the search for a method to determine the position and number of true labels.We want to examine them and eventually develop them into a general-purpose annotation tool for biomedical imaging.
There are few examples of applying self-training to electron microscope images and nerve cell region extraction, but there are some examples of applying it to medical image processing.For example, Iglesias et al. (2010) applied self-training in the task of skull stripping.Dittrich et al. (2014) and Bai et al. (

Fig. 1 .
Fig. 1.Overview of proposed method.A few labeled samples are used for first training, and trained model predicts label of next sample.Then, model is retrained with next three samples to make next pseudo-label.

Fig. 2 .
Fig. 2. Comparison with baseline method per slice (ISBI dataset).Boxplot was created from results of 10 trials.Orange line represents mean DSC of each slice for SL _ ideal.

Fig. 3 .
Fig. 3. Comparison with baseline method per slice (ANT dataset).Boxplot was created from results of 10 trials.Orange line represents mean DSC of each slice for SL _ ideal.

Fig. 4 .
Fig. 4. Qualitative comparison of different learning methods for the ISBI dataset.(a) Raw image of 30th slice, (b) label corresponding to raw image, (c) our result for 4S, (d) our result for SL _ init.Red and blue arrows indicate points where processing performance of 4S is better or worse, respectively.

Fig. 5 .
Fig. 5. Qualitative comparison of different learning methods for ANT dataset.(a) Raw image of 100th slice, (b) label corresponding to raw image, (c) our result for 4S, (d) our result for SL _ init.Red and blue arrows indicate points where processing performance of 4S is better or worse, respectively.

Table 1 U
-Net architecture we used as training model.

Table 2
Basic performance.Notable values are in bold form.

Table 3
Effect of number and position of initial labels.

Table 4
Effect of transfer learning.Method of sequential initialization of parameters is represented as "scratch." E. Takaya et al.