Deep learning based domain adaptation for mitochondria segmentation on EM volumes

Accurate segmentation of electron microscopy (EM) volumes of the brain is essential to characterize neuronal structures at a cell or organelle level. While supervised deep learning methods have led to major breakthroughs in that direction during the past years, they usually require large amounts of annotated data to be trained, and perform poorly on other data acquired under similar experimental and imaging conditions. This is a problem known as domain adaptation, since models that learned from a sample distribution (or source domain) struggle to maintain their performance on samples extracted from a different distribution or target domain. In this work, we address the complex case of deep learning based domain adaptation for mitochondria segmentation across EM datasets from different tissues and species. We present three unsupervised domain adaptation strategies to improve mitochondria segmentation in the target domain based on (1) state-of-the-art style transfer between images of both domains; (2) self-supervised learning to pre-train a model using unlabeled source and target images, and then fine-tune it only with the source labels; and (3) multi-task neural network architectures trained end-to-end with both labeled and unlabeled images. Additionally, we propose a new training stopping criterion based on morphological priors obtained exclusively in the source domain. We carried out all possible cross-dataset experiments using three publicly available EM datasets. We evaluated our proposed strategies on the mitochondria semantic labels predicted on the target datasets. The methods introduced here outperform the baseline methods and compare favorably to the state of the art. In the absence of validation labels, monitoring our proposed morphology-based metric is an intuitive and effective way to stop the training process and select in average optimal models.


Abstract
Background and Objective: Accurate segmentation of electron microscopy (EM) volumes of the brain is essential to characterize neuronal structures at a cell or organelle level. While supervised deep learning methods have led to major breakthroughs in that direction during the past years, they usually require large amounts of annotated data to be trained, and perform poorly on other data acquired under similar experimental and imaging conditions. This is a problem known as domain adaptation, since models that learned from a sample distribution (or source domain) struggle to maintain their performance on samples extracted from a different distribution or target domain. In this work, we address the complex case of deep learning based domain adaptation for mitochondria segmentation across EM datasets from different tissues and species. Methods: We present three unsupervised domain adaptation strategies to improve mitochondria segmentation in the target domain based on (1) state-ofthe-art style transfer between images of both domains; (2) self-supervised learning to pre-train a model using unlabeled source and target images, and then fine-tune it only with the source labels; and (3) multi-task neural network architectures trained end-to-end with both labeled and unlabeled images. Additionally, to ensure good generalization in our models, we propose a new training stopping criterion based on morphological priors obtained exclusively in the source domain. The code and its documentation are publicly available at https://github.com/danifranco/EM_domain_adaptation Results: We carried out all possible cross-dataset experiments using three publicly available EM datasets. We evaluated our proposed strategies and those of others based on the mitochondria semantic labels predicted on the target datasets. Conclusions: The methods introduced here outperform the baseline methods and compare favorably to the state of the art. In the absence of validation labels, monitoring our proposed morphology-based metric is an intuitive and effective way to stop the training process and select in average optimal models.

Introduction
Supervised learning has achieved great success in computer vision leading to the development of robust algorithms that have been successfully applied in diverse research areas. The generalization capability and reliability of these algorithms are based on the assumption that the data used to train them and the data used to test them are drawn from the same distribution or domain. Thus, when the training data is not representative enough of the target population, there is a drop in the algorithm's performance [1]. This performance gap is highly significant when the data acquisition changes (i.e., protocol, instrument) even for a similar target domain. In the particular case of biomedical imaging, data distributions are highly biased due to the variety of acquisition techniques and protocols. Therefore, a significant number of annotations is usually required to ensure a good representation of the population.
Nevertheless, collecting and annotating these datasets is extremely expensive in both time and human resources [2]. For that reason, the field of domain adaptation has emerged to tackle both issues: the reduction of the domain gap difference and the generation of annotated data. The purpose of domain adaptation is to learn from labeled data in a source domain to perform well on a different, but related target domain without any annotation [3].
Aiming to reduce source and target domain dissimilarity, many methods have been proposed to create synthetic source images, and therefore, increase the heterogeneity of the data [4]. Some of these approaches generate new images from random noise without any other conditional information for Computed Tomography (CT) data [5,6], Magnetic Resonance (MR) [6,7,8] or chest X-rays [9,10]. Other methods of synthetic data generation aim to create new training samples using target domain samples and labeled source domain knowledge [3]. A large amount of this cross-modality synthesis work has been proposed for adapting MR data to CT [11,12], CT to MR [13,14] and MR to Positron Emission Tomography (PET) [15,16].
Additionally, image generation can be constrained by the appearance of the anatomical structures and segmentation maps. Many approaches have been presented in the literature that generate image-mask pairs, for instance, implementing domain adaptation from CT to MR [17], generating synthetic samples to solve a segmentation task [18,19,20,21] or for one-shot segmentation [22,23,24].
In the particular case of Electron Microscopy (EM) volumes of the brain, its accurate segmentation is essential to characterize the neural structures present in the volume. Several recent works have been presented in the literature that use domain adaptation to segment neuronal structures [25,26,27], vesicles [28], mitochondria [29,30,31,32] and whole-cell organelles [33]. For the specific task of mitochondria segmentation, domain adaptation methods have been introduced to handle the limited availability of labeled data [34,35,36].
In this work, we address the complex case of domain adaptation for mitochondria segmentation across EM datasets from different tissues and species. We assume the absence of target domain annotations to simulate a real scenario. More specifically, we compare three deep learning based strategies to improve mitochondria segmentation in the target dataset based on 1) style transfer between domains, 2) self-supervised learning, and 3) multi-task neural network architectures. To demonstrate the potential of these three strategies, we employed a cross-domain thorough study between three publicly available datasets for mitochondria segmentation. The same initial conditions and basic architectural design choices are maintained across all strategies, which are also compared with the same supervised baseline methods.
In brief, our main contributions are as follows: -We have presented state-of-the-art style transfer as a solution for domain adaptation for mitochondria segmentation in EM volumes. -We introduce a self-supervised approach based in a pre-training step using both datasets without annotations and a final fine-tuning with only source annotations.
-We have performed a cross-dataset analysis of state-of-the-art deep multitask networks for EM datasets in the context of domain adaptation and propose a novel architecture based on one of them. -As a stopping criterion, we propose a new metric to ensure a good generalization towards the target domain based on the morphology of the resulting mitochondria segmentation.

Related work
The work presented here focuses on domain adaptation and style transfer methods for EM image analysis. By domain and style, we refer to the intrinsic feature space and characteristics of a particular dataset and the distribution from where it is drawn. Domain adaptation can be seen as a particular type of transfer learning where instead of trying to transfer the knowledge from task A in domain A to task B in domain B, the tasks are kept the same while the domains are different. On the other hand, style transfer is mainly focused on adapting the domain from one dataset to another. Existing domain adaptation methods can be divided depending on the label availability during the training process. Thus, they can be supervised, if both source and target domain labels are available; semi-supervised, if source labels and some target labels are available; and unsupervised, if only source labels are available while target data is entirely unlabeled [37]. Moreover, methods can also be categorized based on the learning model used, i.e., either shallow (usually relying on predefined image features and traditional machine learning models) or deep (if they use deep learning architectures). In this paper, we focus on the strategy known as deep unsupervised domain adaptation.
One particular way of addressing this problem is by style transfer. For instance, the Cycle Generative Adversarial Networks (CycleGAN) [38] approach is becoming an effective method in medical image synthesis. Many variations have been presented addressing cross-domain style transfer problems targeting different sources and target types of data, such as from MR to CT [39,40,17,41], transferring the stain style for histopathological images [42,43,44] or creating target-style data pairs, image and mask, without using any annotation [45,46,47].
More recent approaches to address style transfer exploit contrastive learning [48], where models are trained without labels to learn which data samples are similar or different. Similarity is defined in an unsupervised way, by using different data augmentation techniques to create similar examples to the original image and then maximizing a similarity function (e.g., mutual information) during training. Following this idea, Contrastive Unpaired Translation (CUT) [49] compares unpaired image patches and associates similar patches to each other while disassociating them from others. This way, the model learns to pay attention to the commonalities between domains. For instance, a patch containing a mitochondrion will have a high similarity with a patch in a different tissue containing mitochondria, or at least a higher value than if it is compared with a patch showing other organelles. Thus, a generator learns to change the style of input images to match a target style.
Another way to address this domain problem is by using self-supervised learning (SSL), which consists in establishing a pretext task using unlabeled related images that do not require to be annotated by an expert to initially train the model. Then, the model is used as the starting training point for the downstream (segmentation) task. The main advantage is that the pretext examples (or pseudo-labels) are automatically generated from existing raw data, not being conditioned to the number of available expert-reviewed images. Therefore, during the pre-training step, models can leverage from all available images to learn useful feature representations.
In the computer vision literature, related to natural images, the usefulness of this self-supervised pre-training step has been widely explored for several tasks. Namely, the coloring of a grayscale image [48,50,51], the restoration of a distorted or deteriorated image [52,53,54,55], the prediction of the transformation performed in an image [56] or even, the re-ordering of pieces or frames of images [57,58] and videos [59]. However, there is hardly any work applying this methodology to microscopy images. The published works mostly focus on reducing the number of annotated images required for training thanks to a good network initialization achieved by pre-training with denoising [60,61,62,63], jigsaw solving [64,65] and image restoration [66].
Finally, another approach is based on multi-task deep neural network architectures that receive both source and target samples as input. In this case, apart from solving the downstream task for the source (labeled) data, the model aims to exploit the features of the target domain to learn the feature shift between domains. Among these types of unsupervised and semi-supervised domain adaptation methods, we find the Y-Net [35], used for the segmentation of EM images. Its architecture consists of an encoder-decoder such as a U-Net [67], coupled with a second decoder in an autoencoder strategy. While one decoder is trained for segmentation, using the images with available labels, the second decoder is trained to reconstruct all available images, including the unlabeled ones, in an unsupervised manner. Since both decoders share the same encoder, the features learned by the autoencoder are used for segmentation too. Consequently, the model works with unlabeled (target domain) data features. Following this idea, in combination with adversarial losses, similar models such as Domain Adaptive Multi-Task Learning network (DAMT-Net) [36] have been proposed. This network builds on top of the Y-Net architecture and adds two discriminators during training, following a Generative Adversarial Network (GAN) approach. The first discriminator uses the predicted segmentation, while the second discriminator uses the final feature maps of the network.

Methods
To address the problem of domain adaptation between different EM datasets, we present different approaches that reduce the domain shift. Firstly, a crossdomain baseline is introduced using stable state-of-the-art models [32] trained only on source domains. Next, a simple histogram matching between domains is added as pre-processing prior to the use of the baseline models. Finally, more sophisticated domain adaptation approaches are presented based on (1) a modern style-transfer technique, (2) self-supervised pretext tasks, and (3) state-of-theart domain adaptation multi-task deep neural networks.

Cross-dataset baseline
As a reference method to compare our results with, we use our recent stable 2D Attention U-Net model [32] trained on the labeled source domain and tested directly on the target domain (without any adaptation method). This network is a modified version of the U-Net [67] including attention gates [68] in the skip connections that has proven to produce consistently robust results in the segmentation of mitochondria on EM volumes [32]. Its architecture is shown in Figure 1.

Histogram matching
A straightforward approach to make the images of one domain look closer to the images of another domain is histogram matching. Most commonly, this technique is applied to one source image so that its histogram matches the histogram of a target image [69]. Here instead, we use as target histogram the mean histogram of the target domain images, so the histogram of all source images are transformed to match it.
Some images of our datasets present zero-padding surrounding the tissue, which provokes an artificial high pick at the zero value in their histograms. Since we are only interested in matching the histogram of the tissue part of the images, we modified the actual number of zeros with linear regression using the first bins of the original histogram. We set the value to zero in the absence of initial values or when predicting a negative number. This process is done for both target and source histograms. Some example images processed with this histogram matching method can be seen in Figure 2.

Style transfer approach
As described in the previous section, domain adaptation can be considered a style-transfer problem. In particular, we were motivated by the success of the recent Contrastive Unpaired Translation (CUT) method [49] for the problem of unpaired image-to-image translation. Therefore, we tested it on our EM datasets for mitochondria segmentation and re-analyzed the cross-domain performance of our supervised baseline networks on the translated target datasets.
In order to learn the translation between source and target images, this method randomly crops the images to patches of 512 × 512 pixels and maximizes the mutual information between the input and output patches using a contrastive learning framework. This way, corresponding patches (positives) are mapped together in feature space and far from other patches (negatives). Results of this method are shown in Figure 3. All cross-dataset stylization results can be found in Section S1.
Following the recommendations of the original paper, we used the default hyperparameter setting as provided in their public implementation, which corresponds with training the method for 400 epochs, with Adam as optimizer and a learning rate of 2e − 4.

Self-supervised approach
As an alternative approach, we propose a self-supervised framework where we leverage from the use of two sequential training steps: (1) an initial generative self-supervised step including both (source and target) datasets without annotations, and (2) a fully-supervised fine-tuning step using only the source images and their labels. A summary of our self-supervised workflow is depicted in Figure 4. Super-resolution pretext task. In this pretext task, our Attention U-Net is trained to enhance the resolution of images from both the source and target datasets. This first step aims to reach a good starting point to solve the downstream task (i.e., supervised mitochondria segmentation). The input images are synthetically generated low-resolution images, while the ground truth is formed by the (high-resolution) original ones. To generate the synthetic input images, the original images are distorted with normally distributed Gaussian noise with µ = 0 and σ = 0.1 as a fraction of the dynamic range of the image. Next, the images are downsampled by a factor of two in both axes and then upsampled by the same factor to simulate a process where the original resolution is worsened. For both downsampling and upsampling, bilinear interpolation is used.
Source supervised training. Once the model has been pre-trained, the encoder gets frozen. Then, the rest of the network (bottleneck and decoder) are fine-tuned with the available source image annotations to perform semantic segmentation. The source images are pre-processed so their histogram matches that of the target domain. The idea behind freezing the encoder is to enforce the model to remember features learnt during the previous super-resolution step from the target dataset. Thus, allowing for a better generalization and performance in the unlabeled target dataset.
It is worth noting that during the super-resolution step, all available source and target images are used to train the model. That is because the input-label pairs are automatically generated from the raw data but no annotations are used. In the second step, only the training subset from the source dataset and its annotations are used to fine-tune the model.
During the pre-training step, the network is run for 200 epochs, following a one-cycle learning rate policy [70] with a maximum learning rate of 5e − 4, and Adam optimizer. Next, the fine-tuning step is carried out for 60 epochs, using as well a one-cycle learning rate scheduler with a maximum learning rate of 1e−4 and Adam optimizer. In both cases, the optimal batch size was found to be 1. All training images were randomly cropped to patches of 256 × 256 pixels, from which 10% was used for validation. A more detailed description of the hyperparameters can be found in Table S3.1 as well as all combinations tested. The source dataset is adjusted to the target image histogram and cropped into patches of 256 × 256 pixels; b) crops from both datasets are used to generate low-resolution samples by undersampling them and adding Gaussian noise; c) our Attention U-Net is pre-trained by learning to super-resolve the generated patches to their original versions; d) the encoder of the model is frozen and the rest of the network is fine-tuned for the mitochondria segmentation task using only source training patches and their corresponding binary masks; e) the model is evaluated on the target test dataset.

Multi-task neural networks
Following the idea behind Y-Net [35], we have built a similar architecture taking as a base model the previously mentioned Attention U-Net [32]. We refer to this network as Attention Y-Net. In short, the architecture consists of the classical encoder-decoder setup, where a new second decoder is placed. We can see the architecture as the combination of the Attention U-Net and an autoencoder, where both parts share the same encoder. The architecture is illustrated in Figure 5. The network is trained using a loss function (L) made of two terms: a segmentation term based on the binary cross-entropy between the predicted and ground truth masks (L BCE ), and a reconstruction term based on the mean squared error between the predicted and the original grayscale images (L M SE ), as given by where the weight α is a numeric value between 0 and 1. For those images without available labels (binary masks), the L BCE value will be 0.
In its original work, the training of the Y-Net [35] was proposed in two sequential steps. First, the network is trained unsupervised to perform only reconstruction (α = 1). Then, the model is fine-tuned to perform segmentation with the available labels (α = 0). However, we have experienced instability in this step. Namely, quite often, the predicted reconstruction of the network was a flat grey-value image. Therefore, we propose a new additional step before the unsupervised pre-training, which combines both tasks using all the available data. We set α = 0.98, which was experimentally found to help balancing both loss terms.
With our additional pre-training step, the network consistently outputs improved results, out of the local minimum achieved with the flat grey-value image. Next, we freeze the network encoder (blue blocks in Figure 5). Otherwise, the network forgets the target domain features in the next step. Experimentally, we observed that the network performs better if we let the bottleneck and the two decoders unfrozen. Remarkably, as observed with the self-supervised approach, the performance of the whole process was greatly enhanced thanks to the use of histogram matching after the first step.
The first step was carried out for 50 epochs. We used an initial learning rate of 1e − 3 that got reduced when reaching plateaus, stochastic gradient descent (SGD) as optimizer and a patience of 7 epochs over the monitored validation loss. In the second training step, we train for 40 epochs (with a patience of 6). We use a learning rate of 2e−4, and a "reduce on plateau" scheduler once again, but this time with Adam optimizer. Finally, in the last training step, we train for 100 epochs (the different stop criteria will be analysed later). We follow a one-cycle learning rate policy [70] with a maximum learning rate of 2e − 4, and use Adam as optimizer. For all training steps, the optimal batch size was found to be 1. The input to the model consists of 1000 random cropped patches of 256 × 256 pixels, from which 10% is used for validation. This training configuration was empirically found. A more detailed description of the hyperparameters as well as all combinations tested can be found in Table S3.2.

EM Datasets
All the experiments performed in this work are based on the following publicly available datasets: EPFL Hippocampus or Lucchi dataset [71]. The original volume represents a 5 × 5 × 5 (µm) 3 section of the CA1 hippocampus region of a mouse brain, with an isotropic resolution of 5 × 5 × 5 nm per voxel. The volume of 2048 × 1536 × 1065 voxels was acquired using scanning electron microscopes (SEM), specifically with focused ion beam scanning electron microscopy (FIB-SEM). The mitochondria of two sub-volumes formed by 165 slices of 1024 × 768 pixels were manually labeled by experts, and are used as the official training and test partitions. In particular, we used a more recent version of the labels [30] after two neuroscientists and a senior biologist re-labeled mitochondria by fixing misclassifications and boundary inconsistencies.
Kasthuri++ dataset [30]. This is a re-labeling of the dataset by [72]. The volume corresponds to a part of the somatosensory cortex of an adult mouse and was acquired using scanning electron microscopes (SEM) as Lucchi++, but specifically with serial section electron microscopy (ssEM). The train and test volume dimensions are 1463 × 1613 × 85 voxels and 1334 × 1553 × 75 voxels, respectively, with an anisotropic resolution of 3 × 3 × 30 nm per voxel.
VNC dataset [73]. This dataset represents a 4.7 × 4.7 × 1 (µm) 3 serial section transmission electron microscopy (ssTEM), acquired using transmission electron microscopy (TEM), of the Drosophila melanogaster third instar larva ventral nerve cord, with an an isotropic resolution of 4.6 × 4.6 × 45 − 50 nm per voxel. Two volumes of 1024 × 1024 × 20 voxels were acquired, but only one of them was labeled. For that reason and following common practice, we use only the later and split the data volume along the x axis into two subsets with equal size (20 × 512 × 1024 voxels) that constitute our training and test partitions.
For fair comparison with other published work, only the training set labels of the source datasets are used during the supervised or fine-tuning steps of our approaches, while the quantitative evaluation is performed only on the test set of the target datasets.

Evaluation metrics
Since our downstream task is semantic segmentation, we evaluate all methods using the Jaccard index of the positive class or foreground intersection over union (IoU F ), defined as where TP are the true positives, FP the false positives and FN the false negatives. As a convention, the positive class is foreground and the negative class, background. This way, IoU F values range from 0 to 1, where 0 represents no overlap at all between the ground truth and the predicted mitochondria masks, and 1 means a perfect overlap.

Stopping criterion
An intrinsic issue of unsupervised domain adaptation methods is blindly deciding when to stop their respective optimization processes since no labels are available from the target domain samples to guide us in such optimization. This problem is common to all our proposed approaches, either to select the number of stylization iterations or to fix the number of epochs to train our self-supervised or multi-task models. For that reason, we have selected a stopping criterion using morphological priors extracted from the source labels. More specifically, we calculate the average solidity S of each mitochondrion in the image as: where N is the total number of objects (in our case mitochondria instances) in the image and solidity(n) is the ratio of pixels in the nth object to pixels of the convex hull of that object. In practice, each instance is found by the connected components algorithm on the binarized outputs of the models. The main advantage of the average solidity is that it is agnostic of the dataset resolution and easy to implement. As a criterion, we can monitor the S value of the predictions in the target dataset and stop optimizing our domain adaptation methods when it moves away from the objective S value (measured in the source domain). To select the best model, one can simply take the model producing test masks with the S value that is closest to the objective one. Moreover, to increase the robustness of this criterion, we discard very tiny objects (with less than ten pixels) for all datasets.
An example of the connection between the S values of test predictions and their respective segmentation results expressed in terms of IoU F is shown in Figure 6. One can observe that the range of epochs where the test S values are closer to the objective S (calculated in the source domain) in Figure 6a correspond, overall, to the epochs with higher IoU F values in Figure 6b. The same plots for all methods and cross-dataset experiments can be found in Section S2.

Cross-dataset results
All the methods proposed here were applied to all the possible source-target combinations of the three EM datasets introduced in Section 4.1. Moreover, for a more detailed evaluation and comparison with the state of the art, we executed as well the same experiments using the publicly available implementation of DAMT-Net [36]. As it is an extended practice on EM image processing, we also tested all methods on the same image data after preprocessing them using contrast limited adaptive histogram equalization (CLAHE) [74]. Notice CLAHE is a contrast equalization method, thus not intended to match two intensity distributions. However, its effect on the image contrast may bring the histogram of our datasets closer to each other.
To ensure the robustness of the proposed training configurations and hyperparameters, each experiment was repeated ten times using exactly the same setup. A full description of the search of hyperparameters for each approach can be found in Section S3.
The best results based on the average IoU F of the predicted mitochondria in the corresponding target test images for each method are shown in Table 1. Furthermore, we explored the impact of stopping the model training by each of the following criteria: (1) monitoring the IoU F value of the source validation set (and also selecting the best model based on that value); (2) leaving the model train for a fixed number of epochs; and (3) monitoring the average solidity values of the target test set (and selecting the model that better approaches the known source average solidity value).
First, although expected, it is worth mentioning that all tested methods outperform the baseline in all cases, demonstrating the need for a domain adaptation strategy that allows addressing the domain shift problem. Secondly, we can observe an evident boost in performance by simply applying either our histogram matching method to the target images or CLAHE as preprocessing for all images, and re-using the baseline models for inference. Interestingly, on one of the source-target combinations (Lucchi++ as source and Kasthuri++ as target) these strategies provide very good segmentation results (IoU F = 0.679 and 0.620 respectively), but they perform poorly (IoU F = 0.268 and 0.249) on the opposite experiment (Kasthuri++ as source and Lucchi++ as target). This reflects an asymmetric aspect of the problem and the need for solutions that learn more than just simple histogram image features. Moreover, these results show our proposed methods generally perform favourably to the state of the art, represented by DAMT-Net [36]. In particular, our style-transfer based approach provides consistent results across all datasets, followed by our proposed multi-task Attention Y-Net.
Finally, the choice of the stopping criterion seems to play an important role improving the segmentation results depending on the dataset combination. Although the monitoring of the source validation results is a good indicator of the performance in the target domain by the multi-task networks (DAMT-Net and Attention Y-Net), we observe their segmentation can be improved by either leaving the training converge (with a maximum number of epochs) or by monitoring the target average solidity instead.
Some qualitative results of the learning-based methods are shown in Figure 7, where the probability maps of mitochondria masks produced by each method are displayed side by side for the same sample images. More specifically, the predictions shown were obtained using average solidity as stopping criterion. In agreement with the quantitative results of

Conclusions and Discussion
In this paper, we address the problem of domain adaptation for the challenging task of semantic segmentation of EM volumes. More specifically, we propose three novel solutions that built on top the deep-learning based state of the art by means of (1) unsupervised style transfer to transform the target domain images into the "style" of the source domain and then reuse robust models trained on annotated data; (2) self-supervised learning to pre-train our segmentation models without annotations and then fine-tune them using the source labels; and (3) a multi-task deep architecture able to learn from both labeled and unlabeled data. All methods have been evaluated under the same setups using three publicly available EM datasets of different modalities (FIB-SEM, ssEM and ssTEM) and each of their possible source-target combinations. In addition, we propose a novel unsupervised metric to avoid blindly selecting the best model during training. First of all, quantitative and qualitative results prove that learning-based methods are needed to deal with the domain shift in five out of the six cross- dataset experiments. Only in one combination (Lucchi++ as source domain and Kasthuri++ as target domain) an ad-hoc histogram matching method has been able to reduce the shift at the level of the learning approaches.
Regarding the proposed approaches, the style-transfer based method produces segmentation results with consistently medium-high IoU F values (∼ 0.5 − 0.6), specially when the stylization is run for a large number of epochs (> 200, see Section S2). The performance of our SSL and Attention Y-Net methods also gets stabilized after a fixed number of training epochs (60 and 100, respectively) as can be seen in Section S2. However, their results are not as consistent as those of the style-transfer approach, oscillating between low (0.1 − 0.2) and high (0.6 − 0.7) values of IoU F depending on the specific source and target dataset combination. Nevertheless, we have been able to estimate the correct number of epochs to train the models thanks to the availability of target labels (although they are not used at all during training). In a real scenario, monitoring the proposed average solidity metric is an intuitive and effective way to stop the training process in the absence of validation labels, and select (in average) models of similar or better accuracy. Although other morphological and area measurements were initially tested, the average solidity correlates better with the IoU F value of the test labels. Nevertheless, the performance of this metric depends on how close its value is the source and target domains.
It is also interesting to note that TEM and SEM images are different, with TEM images usually having higher resolution. Consequently, Lucchi++ and Kasthuri++ datasets (SEM) are -in principle-in closer domains compared to VNC (TEM) as reflected by the baseline results in Table 1. When Lucchi++ or Kasthuri++ are used as sources, the results obtained with VNC are clearly lower than with Kasthuri++ and Lucchi++, respectively. However, when VNC is used as source, the results obtained with Lucchi++ or Kasthuri++ are similar. As similar discussion is applicable to the figures presented in Section S1: going to lower resolution (i.e., from VNC as a source, to Lucchi++ or Kasthuri++) in principle, could be easier than the opposite (from Lucchi++ or Kasthuri++ as a source, to VNC). Apart from the intrinsic variability due to the modality, we need to acknowledge also the variability due to the differences in the samples itself, their preparation and the acquisition protocol.
In summary, from a practical point of view, the style-transfer approach appears as both the safest and simplest way of addressing the domain shift in EM volumes for semantic segmentation. Nevertheless, using self-supervised or multitask models may provide better results on specific datasets at the cost of more complex training setups and a larger set of hyperparameters.
The present work is an initial assessment of the three competing approaches running under the same conditions and compared with the same supervised baseline methods. In a future work, we plan to explore the performance of meaningful combinations of the proposed strategies. Namely, the outputs of the style transfer method could be used as inputs or the self-supervised learning and the multi-tasks neural network architectures. We expect the combined strategies to outperform the histogram matching approach.
Moreover, current initiatives (e.g., volume EM, http://www.volumeem.org) are developing massive databases of heterogeneous 3DEM data. These initiatives promise to facilitate deep-learning-based model building for automated segmentation [75]. In our view, the style-transfer strategies could be more effective when pre-trained in massive databases of heterogeneous 3DEM data than in a small dataset of well-defined characteristics.
Finally, it is important to highlight that even the best results among all our proposed domain adaptation strategies lie much lower than the fully supervised approaches. As a reference, the average IoU F values obtained by our baseline models trained on the target annotated images are 0.9066 for Lucchi++, 0.9154 for Kasthurhi++, and 0.8041 for VNC. This leaves plenty of room for improvement and future lines of research. In particular, we will explore the use of massive databases of heterogeneous 3D EM data, with the combination of some of our proposed strategies and the exploitation of segmentation-specific pretext tasks.

Code Availability
The developed software that support the findings of this study are publicly available at https://github.com/danifranco/EM_domain_adaptation.

S1.1 Source: Lucchi++ -Target: Kasthuri++
The effect of our histogram-matching and style-transfer methods on an image from the Kasthuri++ dataset is shown in Figure S1.1 using the Lucchi++ dataset as the source domain. Remarkably, the domain shift in this source-target combination seems to be the smallest of all cases, and the histogram-matched images (see Figures S1.1a, S1.1b) appear to be very close to the source domain images.
The mitochondria probability maps produced by all our tested methods on the first test image from Kasthuri++ are shown in Figure S1.2 together with its corresponding ground-truth binary labels and original EM image. The best qualitative results seem to be produced by the histogram-matching and style-transfer approaches (see Figures S1.2b, S1.2c), while the state-of-the-art DAMT-Net method struggles to produce compact mitochondria masks and presents border artifacts due to the zero-padding of the Kasthuri++ dataset (see Figure S1.2f). Notice that the displayed results for the style-transfer, SSL, Attention Y-Net, and DAMT-Net approaches correspond to executions using our proposed stop criterion (solidity, see Section 4.3).

S1.2 Source: Lucchi++ -Target: VNC
The effect of our histogram-matching and style-transfer methods on an image from the VNC dataset is shown in Figure S1.3 using the Lucchi++ dataset as the source domain. The domain shift in this source-target combination seems much larger than in the previous case, and the histogram-matched images (see Figures S1.3a, S1.3c) appear to be further away from the source domain images than the stylized ones (see Figure S1.3d). In particular, the style-transfer method successfully adapted the texture of the neural process from one domain (ssTEM) to the other (FIB-SEM).
The mitochondria probability maps produced by all our tested methods on the first test image from VNC are shown in Figure S1.4 together with its corresponding ground-truth binary labels and original EM image. Although all methods seem to approximate the location of mitochondria correctly, the best qualitative results seem to be produced by our style-transfer approach (see Figure S1.4c). In this case, the state-of-the-art DAMT-Net method seems to produce under-segmented results (see Figure S1.4f) while our SSL and Attention Y-Net (see Figures S1.4d, S1.4e) methods output over-segmented masks. Notice that the displayed results for the style-transfer, SSL, Attention Y-Net, and DAMT-Net approaches correspond to executions using our proposed stop criterion (solidity, see Section 4.3).

S1.3 Source: Kasthuri++ -Target: Lucchi++
The effect of our histogram-matching and style-transfer methods on an image from the Lucchi++ dataset is shown in Figure S1.5 using the Kasthuri++ dataset as the source domain. The domain shift in this source-target combination seems larger than in the opposite combination, where the histogram matching obtained excellent results. As in the previous case, here, the histogram-matched images (see Figures S1.5a, S1.5c) appear to be far away from the source domain images. However, the style-transfer method (see Figure S1.5d) has managed to successfully capture the texture of both the neural process and organelles from one domain (FIB-SEM) to the other (ssEM).
The mitochondria probability maps produced by all our tested methods on the first test image from Lucchi++ are shown in Figure S1.6 together with its corresponding ground-truth binary labels and original EM image. In these experiments, all learning-based methods perform notably well (see Figures S1.6c-S1.6f). Nevertheless, the best qualitative results appear to be those produced by our Attention Y-Net approach (see Figure S1.6e), which are very close to the desired ground truth output ( Figure S1.6g). Notice that the displayed results for the style-transfer, SSL, Attention Y-Net, and DAMT-Net approaches correspond to executions using our proposed stop criterion (solidity, see Section 4.3).

S1.4 Source: Kasthuri++ -Target: VNC
The effect of our histogram-matching and style-transfer methods on an image from the VNC dataset is shown in Figure S1.7 using the Kasthuri++ dataset as the source domain. The domain shift seems quite large in this case, and the histogram-matched images ( Figure S1.7c) appear to be far away from the source domain images (Figure S1.7a). In appearance, the style-transfer results ( Figure S1.7d) do not look much better either, but the results from Table 1 indicate the method was quite successful at transferring the style from the ssEM to the ssTEM dataset.
The mitochondria probability maps produced by all our tested methods on the first test image from VNC are shown in Figure S1.8 together with its corresponding ground-truth binary labels and original EM image. In these experiments, most methods struggle to produce proper mitochondria masks. The exception is our style-transfer approach ( Figure S1.8c), which correctly finds all mitochondria present in the ground truth ( Figure S1.8g) but also produces a couple of large mitochondria-like artifacts. Notice that the displayed results for the style-transfer, SSL, Attention Y-Net, and DAMT-Net approaches correspond to executions using our proposed stop criterion (solidity, see Section 4.3).

S1.5 Source: VNC -Target: Lucchi++
The effect of our histogram-matching and style-transfer methods on an image from the Lucchi++ dataset is shown in Figure S1.9 using the VNC dataset as the source domain. Both the histogram-matched image ( Figure S1.9c) and the stylized image ( Figure S1.9d) seem to reproduce the appearance of the source domain image ( Figure S1.9a). In particular, the style-transfer results ( Figure S1.9d) are able to not only reproduce the source intensities but also correctly replicate the textures inside the neural processes.
The mitochondria probability maps produced by all our tested methods on the first test image from Lucchi++ are shown in Figure S1.10 together with its corresponding ground-truth binary labels and original EM image. While all methods identify all the mitochondria present in the ground truth correctly (Figure S1.10g), most of them produce an over-segmentation, except for DAMT-Net, which is under-segmenting ( Figure S1.10f). Although some extra low-probability maps are created by the SSL method ( Figure S1.10d), its medium-high probability maps nicely capture the real mitochondria. Notice that the displayed results for the style-transfer, SSL, Attention Y-Net, and DAMT-Net approaches correspond to executions using our proposed stop criterion (solidity, see Section 4.3).

S1.6 Source: VNC -Target: Kasthuri++
The effect of our histogram-matching and style-transfer methods on an image from the Kasthuri++ dataset is shown in Figure S1.11 using the VNC dataset as the source domain. As in the previous case, both the histogram-matched image ( Figure S1.11c) and the stylized image ( Figure S1.11d) seem to reproduce the appearance of the source domain image (Figure S1.11a). Again, the style-transfer results ( Figure S1.11d) seem to not only reproduce the source intensities but also correctly replicate the textures inside the neural processes.
The mitochondria probability maps produced by all our tested methods on the first test image from Kasthuri++ are shown in Figure S1.12 together with its corresponding ground-truth binary labels and original EM image. Here all methods struggle to correctly identify the mitochondria present in the ground truth ( Figure S1.12g). Some of them produce an over-segmentation (Figures S1.12a, S1.12c, S1.12e), while others are under-segmenting ( Figures S1.12d, S1.12f). As observed before, the DAMT-Net method produces artifacts in the border of the tissue areas due to the padding ( Figure S1.12f). Notice that the displayed results for the style-transfer, SSL, Attention Y-Net, and DAMT-Net approaches correspond to executions using our proposed stop criterion (solidity, see Section 4.3).

S2 Analysis of solidity as stop condition
In this section, we analyze the effect of using the solidity of the predicted masks (see Section 4.3) as a stop condition in all the tested learning methods. With that aim, we plot the solidity values at each epoch of every cross-dataset experiment and, on a complementary plot, the IoU values produced in the test partition of the target dataset at the same epochs.   . On the right, the test IoU evolution (averaged over ten executions) as a function of the epochs with (b) Lucchi++ and (d) Kasthuri++ as target domains. The magenta lines represent the maximum IoU value obtained by the fully supervised baseline models. In contrast, the blue and orange lines represent the IoU values obtained by the baseline methods applied without adaptation and after histogram matching to the target datasets, respectively.

S3 Hyperparameter search
This section describes in detail the search we performed for the optimal training configuration and set of hyperparameters in all our proposed approaches. The corresponding search space and best values are summarized in the tables below using the following notation: There are also acronyms used in tables:   Table S3.5: Hyperparameter search space for the proposed Attention Y-Net, third and last training step, focused in segmentation.

S3.3 DAMT-Net
To execute DAMT-Net, we follow the publicly available implementation provided by its authors [3]. Since they use two images in each training step, we interpret the batch size as 2. Bearing this in mind, we define an epoch as follows: where |X train | is the cardinality of the training set. Taking this into account, we explored a few hyperparameters: -Patch size: 512×512 pixels and 256 × 256 pixels -Epochs: 30, 60, 100 -Save checkpoint every 2 epochs.
Among the different options, the best assignment has been highlighted in bold. The rest of the parameters are those proposed by default in the original publication [3].