The 3D Instance Segmentation Network for Synapse Reconstruction from Serial Electron Microscopy Images

Background: The synapse is the key part where neurons communicate with each other. Synaptic plasticity plays a vital role in study and memory. Due to the rapid development of electron microscopy (EM) technology, imaging synapses at nanometer scale possible become possible. However, the automation and effectiveness of the synapse detection algorithm have not yet been satisfactory. The most commonly used method is a two-step solution, where first binary segmentation masks are obtained and then reconstruction results are generated by finding connected components. Results: In this paper, a novel 3D instance segmentation network which can predict the synapses end to end was proposed. Then, it was also proved that the network can exploit features consistent with the biological structures of synapses by visualizing the network layer. Furthermore, our method was evaluated on two public datasets, and experimental results demonstrated the effectiveness of our proposed method. Conclusion: The proposed method provided a fast and accurate solution to detecting synapses from serial section EM images. Besides, a block-wise inference strategy which adapts well to large scale EM images was introduced, and it can also help neuroscientists achieve labor-free analysis and quantification of synapses.


Background
As an important ultrastructure in brains, the synapse has the function of information transmission and storage. Statistics on the size, morphology, distribution, amount and other information of synapses contribute to understanding of the structural information and plasticity analysis on synapses. The synaptic abnormalities associate with many diseases, such as autism and Alzheimer's disease [1]. Besides, synapses and connected neurons are key components of neural circuits in connectomics. Generally speaking, neurons with different properties and functions are considered as vertices, while the synapses are the connection between two neurons as edges. Research shows that the ratio of the number of synapses to neurons is about one to one thousand [2]. In the face of large amounts of data, fully automated synapse detection and segmentation algorithms are important and necessary.
It is widely known that the synapse is composed of presynaptic structure, postsynaptic structure and the synaptic cleft with the thickness of 30-60nm. Due to the nanometric resolution of synapses, neuroscientists usually observe and image the ultrastructures of synapses with serial section electron microscopy (ssEM). Hence, developing the reliable and precise algorithm to detect synapses from serial EM images has brought about the widespread attention in the field of the computer vision and neuroscience.
Substantial efforts have been recently put into developing specialized algorithms for accurate reconstruction of synapses. Typically, Morales et al. proposed a semiautomated method, which relied on pattern matching and required substantial user interactions [3]. Mishchenko et al. developed a synaptic cleft recognition algorithm to detect postsynaptic densities (PSD) in serial block-face scanning electron microscopy (SBEM) image stacks [4]. Jagadeesh et al. put forward an attribute-based descriptor for characterizing synaptic junctions in rabbits' retinal tissue. Three attribute-based descriptors were designed to represent vesicles, synaptic clefts and ribbons, respectively. Then, they utilized the maximally stable extremal region (MSER) to locate synapses and trained a support vector machine (SVM) to classify the synapses [5]. Navlakha et al. proposed a special tissue staining technique, which highlighted synapses only. Besides, they also obtained synaptic candidate windows through the pre-segmentation result, and then designed a three-part feature for each candidate window. Finally, they built a synaptic classifier by using a traditional machine learning framework [6].
Based on the above reasons, Kreshuk et al. presented a general method of detecting and segmenting synapses in focused ion beam scanning electron microscope (FIB-SEM) image stacks. In this approach, firstly, 38 local texture features were designed for each voxel, and then a random forest classifier was used to predict the category of each voxel as synaptic or non-synaptic type [7]. Inspired by Kreshuk's method, Becker et al. further developed this approach by designing a local context feature, which can ignore the presence of the asymmetric context information generated by the pre-and post-synaptic regions [8]. As above approaches were only suitable for isotropic data, Kreshuk et al. subsequently proposed an automated approach for synapse segmentation in ssEM images with highly anisotropic axial and lateral resolution. In addition to that, their pipeline consisted of three steps: classification of pixels based on 3D features, segmentation with an Ising model Markov random field (MRF) and classification based on object-level features [9].
Recent years have witnessed the great success of convolutional neural networks (CNNs) in the domain of biocomputing. There is much successful representative work of synapse detection in EM data. Typically, Roncal et al. designed the VESICLE-CNN to segment the synaptic cleft by judging the category of each pixel based on the patch centered on it. The introduction of the CNN has greatly improved the performance at the cost of increased time consumption [10]. Dorkenwald et al. created the SyConn framework, which segmented synaptic connection through a recursive 3D CNN model [11]. Staffler et al. proposed a new approach SynEM for automated detection of synapses from conventionally en-bloc stained EM image stacks. SynEM classified borders between neuronal processes as synaptic or nonsynaptic type based on the dense segmentation [12]. Xiao et al. put forward a fully automated method that relied on deep learning to reconstruct synapses in EM images. The main idea was firstly detecting synapses by the Faster R-CNN [13], and then using GrabCut to segment the synaptic clefts [14].
Although detection and segmentation of synapses in mammal animals have achieved satisfactory results, the performance of these methods applying to insect brain may be not as good. For instance, the synapses in the drosophila brain are hard to identify due to the polyadic nature [15]. It is essential to propose a more general pipeline with stronger characterization ability to detect and segment the synapses from ssEM images.
In this paper, a novel 3D instance segmentation network for synaptic clefts detection and segmentation is introduced, and it can also be trained end to end, thus producing 3D detection and segmentation results in parallel. A block-wise algorithm is developed to achieve accurate detection on large scale datasets. The visualized feature maps from the learned network are consistent with the biology structure of synapses. Besides, the experimental results on two public datasets show promising performances.

Motivation
Recently, there are two fundamental ways to reconstruct synapses from EM stacks using the machine learning algorithm. To be specific, one way identifies 3D synapses by classifying the neuronal interfaces of being synaptic or non-synaptic [12,16], while the other is a two-step method, which firstly obtains binary synaptic masks with the fully convolutional network (FCN) or instance segmentation network, and then detects 3D synapses by finding connected components [11,14,17]. It should be noticed that the former method relies heavily on precise neuronal pre-segmentation results. Although it is easy to obtain neural circuits by combining synapses and volume segmentation results, this method is unsuitable for synapse analysis tasks due to the huge computational cost. The two-step method directly predicts synapses by applying CNNs, thus further enhancing the detection performance to a great extent. However, the fundamental disconnect between the mask prediction step and connection step brings the primary disadvantage.
In contrast to the existing methods, a novel 3D instance segmentation network which can simultaneously explore inter-and intraslice features and output 3D synapses in one step was proposed. The differences between the traditional method and our proposed method are illustrated in Figure 1, when Figure 1A depicts the two separated steps to reconstruct synapses in previous methods, while Figure 1B presents the 3D synapses that our proposed network directly predicts.

Network architecture
The architecture of our proposed network is illustrated in Figure 2, and concrete parameters of all layers are shown in Table 5. The input of the 3D instance segmentation network is a 3D patch with the size of 512 × 512 × 32, and the outputs are composed of the detection and segmentation results, i.e., 3D bounding boxes and 3D binary masks of detected synapses. With this scheme, the 3d instance masks can be predicted directly from raw image stacks, which means the whole process can be trained end to end.
Specifically, the proposed network contains two-stage detectors and one mask predictor, all of which receive the shared feature maps from the backbone network that is modified based on ResNet50 [18]. Due to the high anisotropy of our test datasets (anisotropic factor is 25 or 10), some parameters of original ResNet50 were revised in order to better fit the data. The kernel size and stride were set as 7 × 7 × 5 and 2 × 2 × 1 in the first convolution layer to keep the information along the sz-direction. Then, it is followed by a 3D max-pooling layer with 3 × 3 × 3 pool size and 2 × 2 × 2 stride. Besides, all parameters of the other layers are the same as the original ones, just changing from 2D version to 3D version.
For the first stage detector, it is developed to generate candidate bounding boxes and ensure detection recall rate as much as possible. To locate 3D synapses, two outputs composed of the regression branch and the classification branch for bounding boxes are designed. At each location of the shared feature map, two anchors at different scales, which are 64 × 64 × 6 and 128 × 128 × 6 were predefined respectively, when each anchor has three aspect ratios, corresponding to 1 : 1, 2 : 1 and 1 : 2, respectively. Hence, there are 2 × 3 anchors in each point of feature maps. The classification branch categorizes each predefined anchor as synaptic or non-synaptic one, and then the regression branch regresses the deviation relative to the ground truth. To further optimize the detection precision rate, the top n outputs of the first detector are passed to the second detector. RoI (region of interest) Align layer [19] is utilized to produce fixed-size (7 × 7 × 3) features from shared backbone features.
Similar to the first detector, the second detector is designed to perform the classification and regression tasks. In order to feed features to fully connected layers, a 7 × 7 × 3 convolutional layer is first employed to obtain a 1024-D feature. Then, it is followed by two fully connected layers which map the 1024-D to the number of classes (2 in our case) and 2×6-D (the corner number of 3D bounding boxes).
Simultaneously, the mask predictor also takes the RoI features (14 × 14 × 6) as inputs and predicts the segmentation results for every detected RoI. Concretely, the subnetwork contains two 3 × 3 × 3 convolutional layers with the same mode and one 2 × 2 × 1 transposed convolution layer with the stride of 2 × 2 × 1 to preserve the shape of mask (28 × 28 × 6). At last, a 1 × 1 × 1 convolutional layer is applied to map the number of feature channels to the number of classes (2 in our case). For the sake of simplicity, the mask predictor is connected to the RoI Align layer to learn morphological features of synapses. In the phase of inference, it is connected to the second detector, whose detection results are more precise.

Loss function
The loss function of the 3D instance segmentation network consists of three parts, optimizing the first detector, the second detector and the mask predictor, respectively. When W and B are the convolutional kernel set and the bias set, the total loss function for an image stack can be defined as follows: where N, n and M are numbers of total anchors, RoIs and the all pixels of predicted masks, respectively. y 1 t and y 1 p represent the ground truth label and the predicted label of classification in the first detector; y 2 t and y 2 p refer to the second detector. Similarly, b t and b p are the ground truth and predicted 3D bounding box (y, x, z, h, w, c), while k t and k p indicate the ground truth and the predicted 3D masks. Besides, the smooth L1 denotes the smooth L1 loss, which can be formulated as: Block-wise inference strategy Unlike the training process running backward with small patches, the inference process needs to proceed on large scale data. Due to the limited GPU memory, inference runs in a block-wise way. Specifically, the original EM stacks were first decomposed into small blocks (512 × 512 × 32) with overlaps of 128 × 128 × 6 to reduce the boundary effect. This overlap size can cover most of the synapses in our datasets. Then, the trained network predicted the instance segmentation masks of synapses for each block. Next, the small blocks were fused into the stacks of original size. In addition to that, the information of overlapping area was adopted to match neighboring pairs along three directions. For any two areas from neighboring blocks, if the intersection over union (IoU) for both exceeds 0.5, then they were merged as one. Algorithm 2 sketches the pair block matching method. As shown in Figure 4, firstly, all z-blocks which only recover the scale of zdirection were obtained, and then, all z-blocks were merged into xz-blocks along the x-direction. At last, all xz-blocks were matched successively to produce the final results. The overall block-wise inference strategy on large scale data is described in Algorithm 2.

Results
In this section, our proposed network was compared with other state-of-the-art networks on two public datasets to demonstrate the effectiveness. Besides, by visualizing the feature maps of the trained network, it was found that the network can accurately exploit the biological structure characteristics of synapses.
Extract the overlapping regions (overlap1, overlap2) according to the overlap size p and direction d. 5 Compute the volumes of all segmentaions in both overlapping regions. 6 for each segmentaion i in overlap1 do 7 Compute the volume of i.
for each segmentaion j in overlap2 do 10 Compute the volume of j.
Compute the IoU of segmentation i and segmentation j.
if Ratio i > 0.5&Ratio j > 0.5 then 16 Assign label i to the segmentation j in block2 . Experimental data and setup Our proposed method was evaluated on two different datasets (SEM and SNEMI datasets) to prove the generality. The SEM dataset is from the mouse auditory cortex imaged by scanning electron microscopy with a voxel resolution of 2 nm × 2 nm × 50 nm. Two annotators annotated all the synaptic clefts in 178 slices with cross validation, when each slice had a size of 8576 × 7616, and then was split into training and testing datasets. The SNEMI dataset is from a public challenge aiming at automatic segmentation of neurites in 3D, and it also comes from the mouse cortex imaged by scanning electron microscopy with a voxel resolution of 6 nm × 6 nm × 30 nm, which consists of a training volume and a testing volume with the equal size of 1024 × 1024 × 100. The EM images and corresponding ground truth are presented in Figure 3.
The network was implemented using Tensorflow and Keras. All training and testing were performed on a server with one NVIDIA Tesla V100 GPU. The network was optimized using stochastic gradient descent (learning rate is 0.001, and the momentum is 0.9). To augment the training dataset as well as avoid over-fitting problem, random rotation(90 • , 180 • , 270 • ), random flipping (up/down, left/right) and adding gaussian noise were employed during the training process.

Biology structure consistent features
To identify a synapse from EM images, three co-occurrence biological features, which are the synaptic vesicles at presynaptic site, the synaptic cleft and the [20][21][22][23][24][25][26][27][28][29][30] nm PSD at postsynaptic site should be satisfied. Recently, researchers have designed hand-crafted features according to the unique biological structures to learn an interface classifier [12], which yields promising performance. Specifically, for each interface between neurites, 7 subvolumes are defined to indicate the different features in the border and two sides ( Figure 5).
Can our proposed 3D instance segmentation network learn the coincident occurrence of regional features? To verify the suspicion, our CNN layer was visualized by adopting the Gradient-weighted Class Activation Mapping (Grad-CAM) [20]. As shown in Figure 6, the heat maps of four sequential layers generated by Grad-CAM are depicted. Results demonstrate that the model identifies synapses by analysing the synaptic cleft, presynaptic and postsynaptic voxels. Coincident occurrence of three unique structures makes a successful detection. Besides, the most important region is the cleft, with importance decreasing gradually from the center outward.

State-of-the-art methods
The U-Net [21] is a commonly used baseline method in biomedical image segmentation, and there are many improved versions based on the original U-Net, which also achieves good results. The 3D U-Net [22] introduces 3D convolution operations into the original 2D U-Net, which can explore the feature spaces in x-y, x-z and y-z directions, leading to a significant improvement of the learning and fitting ability of the network. Mask R-CNN [19] is the baseline instance segmentation network, which can determine the locations, classes and segmentation masks of all instances in one step. It is easy to extend to other tasks, and has been utilized in synapse segmentation tasks.

Detection performance
To compare with other methods quantitatively, the detection results were evaluated in terms of three fundamental performance indicators, precision and recall. The formulations can be illustrated as follows: where TP, FP and FN are true positive, false positive and false negative. When evaluating detection performance, a detection box is considered as a sample. If the IoU (intersection over union) between the predicted box and the ground truth box is greater than 23%, the predicted box is calculated as the positive. The metrics measure the performance on the object-level.
As presented in Table 3 and Table 4, our method achieves the highest precision and F1-score, showing the robustness detection performance. The low recall rate may be caused by the insufficient search space of anchors. To reduce the complexity of the network, only two anchors are defined, and the scale of the third dimension is invariant.

Segmentation performance
As for segmentation performance, we don't adopt the common metrics in the field of semantic segmentation, which is measured by determining whether a voxel is classified correctly. Sizes of synapses are small, and the manual annotations of synapse boundaries are not alway accurate. Inspired by the COCO instance segmentation challenge, we still compute the precision and recall metrics. An instance is considered as positive when the segmentation level IoU exceeds 30%. The evaluation is performed in 3D.
As presented in Table 1 and Table 2, our method yileds 0.5525 F1-score and 0.3745 F1-score on SEM dataset and SNEMI dataset, which both are the highest scores in all methods. Notably, no post processing is applied for all methods.

Qualitative comparisons
To qualitatively illustrate the effectiveness of the proposed method, the results on SEM dataset and SNEMI dataset are shown in Figure 7 and Figure 8, respectively. The ground truth or detected pixels are marked with colors, and the same synapses are labeled with the same color. Due to the imbalance of the positive and negative sample problem, the U-Net predicts few synapses in 2D. Meanwhile, the accuracy of 3D connected components is affected, which causes the poor performances of detection and segmentation. The 3D U-Net exploits the 3D features, which improves the learning ability dramatically. However, the false detection can't be avoided because of the lack of detection modules. Mask R-CNN achieves low missed detection rate but high false detection rate because the Mask R-CNN only relies on the information of one layer, not the sequential information. Our proposed method combines 3D convolutions into the instance segmentation network, yielding the state-of-the-art performance. 3D synapses are detected and segmented accurately.

Discussion
Though our method yields comparable results on the tasks of locating synapses from ssEM images, there is still a lot of room for improvement. In this section, the main advantages and disadvantages are discussed, as well as the future directions in this field.

Advantages and disadvantages
In contrast to the two-step methods, our proposed network can output the 3D instance segmentation results of synapses directly without the post-processing of computing connected components. As the process of finding connected components may introduce errors, our end-to-end approach reduces this possibility as much as possible. Besides, our network can learn spatial and temporal features jointly by applying 3D operations including 3D convolution, 3D maxpooling and 3D upsampling. Because of the aligned properties, the 3D contextual information is incredibly necessary for detecting ultrastructures from ssEM images.
The main disadvantage is the high the complexity of the neural network due to the application of 3D convolution and the second detector. The training and testing time increases significantly. Correspondingly, if the memory of the GPU remains the same, the spatial size of inputs generally shrinks to half of the 2D network, which means the boundary effect becomes more severe since surrounding information lacks. Another problem is the shift error in the segmentation mask caused by the RoI Align layer that adopts bilinear interpolation to replace the original quantization operations, which aims to address the misalignment problem. However, the size of the synapse is very small, and the minor adjustments such as resampling have a huge effect on the final result.

Future directions
To solve the above problems, it is considered that there are three future directions which can further improve the performance. Firstly, the lightweight network can be designed by borrowing the principle of depthwise separable convolution. At present, the fusion strategy of overlapping boundaries is the simple summation. In the future, more complicated and effective algorithm, such as the gaussian-based fusion can be designed. Panoptic segmentation [23] is the hot topic in the field of computer vision. It achieves the tasks of instance segmentation and semantic segmentation in one framework. To overcome the problem of misalignments, the idea of panoptic segmentation can be introduced to avoid the crop operation by combining the fully convolutoinal network.

Conclusions
In this paper, a novel 3D instance segmentation network is put forward to detect synaptic clefts from ssEM images. The proposed network contains two detectors and one mask predictor, which are adopted to identify the locations and segmentation masks of synapses. The end-to-end training fashion facilitates the fitting capability of the network and significantly enhances the performance. The visualization of the network layer demonstrates that the network can learn features consistent with the unique structure of synapses. Compared with the common two-step method, our approach can generate all 3D synapses in one step. Besides, the experimental results on the public dataset show that our proposed method achieves state-of-theart performance.

Availability of data and materials
The data and source code in this paper is available upon request.  Figure 1 The differences between existing methods (A) and our proposed method (B). In previous methods, firstly, binary masks or 2D instances on each layer are predicted, and then the connected components are found to generate 3D synapses. Our proposed 3D instance segmentation network directly produces 3D instances by exploring the 3D feature space.

Figure 2
The architecture of the proposed network. The orange and green blocks represent the feature maps of the contracting path and the expansive path in the backbone network, respectively. Blue blocks indicate the feature maps of detectors and mask predictor, while the numbers below the blocks are the shapes of corresponding feature maps. The overall diagram of block-wise fusion strategy. Firstly, the EM image stack is decomposed into small blocks with overlaps. Different colors of blocks indicate different cropping direction. Then, the CNN runs forward to predict 3D synapses for each block. To recover the original scale, the fusion is performed with three steps, corresponding to three axes, respectively. The order is axis-z, axis-x and axis-y. The arrows represent merge directions.

Figure 5
Subvolumes of hand-crafted features defined in [12]. The regions exactly correspond to presynaptic, cleft and postsynaptic of a synapse. Different colors represent different regions which are used to compute features and related statistics.   Crop the image block b ijt from the original stacks according to the coordinates (start i , start j , startt, end i , end j , endt) of 3D bounding box .

32
Obtain the segmentaion block o ijt by feeding b ijt into 3D instance segmentation network.