Fluorescence microscopy datasets for training deep neural networks

Abstract Background Fluorescence microscopy is an important technique in many areas of biological research. Two factors that limit the usefulness and performance of fluorescence microscopy are photobleaching of fluorescent probes during imaging and, when imaging live cells, phototoxicity caused by light exposure. Recently developed methods in machine learning are able to greatly improve the signal-to-noise ratio of acquired images. This allows researchers to record images with much shorter exposure times, which in turn minimizes photobleaching and phototoxicity by reducing the dose of light reaching the sample. Findings To use deep learning methods, a large amount of data is needed to train the underlying convolutional neural network. One way to do this involves use of pairs of fluorescence microscopy images acquired with long and short exposure times. We provide high-quality datasets that can be used to train and evaluate deep learning methods under development. Conclusion The availability of high-quality data is vital for training convolutional neural networks that are used in current machine learning approaches.

Fluorescence microscopy is an important technique in many areas of biological research. Two factors which limit the usefulness and performance of fluorescence microscopy are photobleaching of fluorescent probes during imaging, and, when imaging live cells, phototoxicity caused by light exposure. Recently developed methods in machine learning are able to greatly improve the signal to noise ratio of acquired images. This allows researchers to record images with much shorter exposure times, which in turn minimizes photobleaching and phototoxicity by reducing the dose of light reaching the sample.

Findings
To employ deep learning methods, a large amount of data is needed to train the underlying convolutional neural network. One way to do this involves use of pairs of fluorescence microscopy images acquired with long and short exposure times. We provide high quality data sets which can be used to train and evaluate deep learning methods under development.

Conclusion
The availability of high quality data is vital for training convolutional neural networks which are used in current machine learning approaches.
Order of Authors Secondary Information: Response to Reviewers: Dear Editor, We would like to submit a revised version of our manuscript "Fluorescence Microscopy Datasets for Training Deep Neural Networks" for consideration as a data note in GigaScience. We would like to thank the editors for your patience as we completed the revisions. We have made numerous changes in an effort to respond to all of the reviewer's comments. Our thanks also go to the reviewers for your helpful suggestions. We would like to respond to the reviewers with the following changes and improvements to the paper. Reviewer 1: I see great reuse potential in these imaging datasets and this Data Note and supporting data should be considered for publication in the GigaScience "Digital Pathology -Translatable Datasets for Clinical Reuse and Machine Learning" Thematic Series.
Thank you for your comment about the reuse potential. We have already been contacted by a few researchers asking when the data would be available and so we anticipate that there will be continued interest in the paper and the data. We would like to have the paper be part of this thematic series if this option is still available.
To ensure reproducibility, I request that the authors submit to GigaDB the denoised image files generated by: 1) CSBDeep toolbox; 2) NVIDIA Self-Supervised Deep Image Denoising software; and 3) BM3D.
We have uploaded to GigaDB the denoised image files as requested. Please note that we switched from the NVIDIA self-supervised denoising network to the Noise2Void network as requested by reviewer 2. This is a similar unsupervised network for denoising. Reviewer 2: Publicly available training datasets for DL methods are an important driver of research, yet compared to other fields (as computer vision) such datasets are currently less commonly found for fluorescence microscopy. So I really like that the paper tries to make a contribution towards changing that situation. I similarly like that the authors compared results from several DL methods as well as a strong classical baseline that are used in practice. Thank you for these encouraging comments about the paper and datasets.
1) The authors write that "High quality, publicly available data of this type has been lacking". However there are some datasets that provide this (e.g. for 3D denoising [17] The Zhang paper "A poisson-gaussian denoising dataset with real fluorescence microscopy images" is an excellent resource but the data offered there is limited in a couple of important ways. The authors collected 50 noisy samples of each image, then average these images to generate a ground truth image. This is not the same thing as collecting an image with a long exposure time as the ground truth. The images offered in the Zhang paper are 512x512 pixels while ours range from 512x512 in one dataset up to 2048x2048 in four of the datasets. Also the Zhang data is limited to 8 bits (of intensity information), while ours were recorded at 16 bit. Furthermore, for a public dataset to be valuable, the distribution of training images has to have a certain heterogeneity, such that evaluation on that data serves as a robust assessment of any method. The proposed dataset however contains only images of the same fixed sample (endothelial cells) of two essentially very stereotypical structures (Actin filaments and mitochondria). This makes it very hard to use the dataset for training models to be applied on differing structures (e.g. nuclei, membranes). We have expanded the paper and datasets to now include images of the cell nucleus and membrane as requested. There are now 6 total datasets, the properties of which are shown in table 1 of the paper.
2) The current way of presenting the dataset/images (i.e. the main contribution) is suboptimal. Including at least an overview figure with a representative image for each modality/noise level/structure would greatly improve the paper (I essentially had to download the whole dataset just to have a look at a single image for each dataset). Additionally, Figure 1 has severe visual glitches that make it impossible to inspect the different denoising results. Finally, providing insets in the same figure for the denoised images would greatly help to see the differences of the compared methods. We have included a new figure (now figure 1, the original figure 1 is now figure 2.) The new figure 1 shows example images and thereby an overview of the 6 datasets. We included an "examples" folder on the FTP site so that users can download a small portion of the total data and thus get a look at what the rest of the data would look like. Sorry about the severe problems with figure 1 in the PDF you downloaded. Please note that the PDF conversion used by the submission system badly reproduces images. Please click on the link on that page of the PDF document and you should be able to download the original high resolution PNG files.
We removed this and just stated that we acquired the datasets under different conditions, table 1 describes these conditions. -The MSE formula on line 102 misses the lower limit in the sum ("j=0"?) We corrected the formula.
-"Following the standard implementation of the CSBDeep network, we used the Laplacian loss function" -> The default loss function in CSBDeep is mean absolute error MAE without any probabilistic component (the config default is probabilistic=False). The laplace loss should only be used if the resulting probabilistic model is needed (e.g. when the additional confidence prediction might be useful), which for a normal denoising task is not the case. I therefore would suggest to rerun at least some of the experiments with the default setting (probabilistic=False) and see whether the results change.
We re-ran all of the data with the default settings.
-BlindSpot: "uses careful padding and cropping to force the network to…" -> Padding and cropping is not really the main distinction of Blindspot networks…..-The relatively poor performance of the BlindSpot Network seems to me a bit surprising. "We used our own implementation in Python using the Keras library" ->I think it would be more convincing when using one of the official implementations, e.g. https://github.com/juglab/n2v We switched to the Noise2Void network using the official implementation as suggested.
-How was the parameter of BM3D (noise level sigma) tuned? Following the procedure of [1], on each image we estimated the noise level using the method of Foi et al. [2] and applied a variance stabilizing transformation [3] before denoising the image with BM3D. This explanation was added to the paper.
-"We normalized both images by clipping values below the 1st percentile and above the 99th percentile". Doesn't this remove essential information of the image? What was the reason to clip? That is a good point and we removed this unnecessary clipping in the new version of the experiments.
-What stopping criterion was used for the CARE/Blindspot network training? We trained each network for 200 epochs. In all experiments, 10% of the patches were withheld for validation during training, and the model with best validation error observed during training was saved and used for testing. We visually inspected the loss curves and observed that the loss for each training run had converged.
We hope that with these changes the paper will now be acceptable for publication in GigaScience.
Sincerely, Background: Fluorescence microscopy is an important technique in many areas of biological research. 10 Two factors which limit the usefulness and performance of fluorescence microscopy are photobleaching of 11 fluorescent probes during imaging, and, when imaging live cells, phototoxicity caused by light exposure. 12 Recently developed methods in machine learning are able to greatly improve the signal to noise ratio of 13 acquired images. This allows researchers to record images with much shorter exposure times, which in turn 14 minimizes photobleaching and phototoxicity by reducing the dose of light reaching the sample. 15 Findings: To employ deep learning methods, a large amount of data is needed to train the underlying 16 convolutional neural network. One way to do this involves use of pairs of fluorescence microscopy images 17 acquired with long and short exposure times. We provide high quality data sets which can be used to train 18 and evaluate deep learning methods under development. 19

Data description 23
Context 24 Fluorescence microscopy is an important technique in many areas of biomedical research, but its use can 25 be limited by photobleaching of fluorescent probe molecules caused by the excitation light which is used. 26 In addition, reactive oxygen species which are generated by exposing samples to light can cause cell damage 27 and even cell death, limiting imaging of live cells [1,2]. Many strategies have been devised to overcome 28 this problem including the use of specialized culture media [3,4] One advantage of deep learning methods is that they can learn a task such as denoising from the data itself, 45 thus providing a sample-specific method which does not depend on a physical model. Once a network has been trained, subsequent image denoising using a convolutional neural network is fast compared to traditional methods which are typically much slower. We also evaluated a self-supervised learning approach called Noise2Void (N2V) [27]. This method learns 67 denoising using only the noisy data. It also uses a U-Net architecture and MSE loss function but masks out 68 random pixels during training to force the network to learn to predict the denoised value of each masked 69 pixel based on the neighborhood of that pixel in the noisy input. We used the reference implementation 70 provided by the authors.

Data Analysis 87
In each dataset, the last 10% of images were used for testing and the remaining were used for training. 88 To train the CARE network, we used the following configuration. We used the ADAM optimizer [28], the 89 training batch size was 16 images, the number of training epochs was 200, the initial learning rate was 90 0.0004, and the iterations per epoch (training steps) was 400. In sampling the training images, 800 patches 91 per image of size 128x128 pixels were used to train the CARE network. In all experiments, 10% of the 92 patches were withheld for validation during training, and the model with best validation error observed 93 during training was saved and used for testing.

105
We acquired six datasets under different conditions. Table 1 provides an overview of the four datasets. In 106 the widefield data, we used adjusted the camera exposure time such that the desired signal to noise levels 107 were achieved. For confocal microscopy, we recorded images with two different (high or low) detector 108 gains and laser powers. 109  After data acquisition, we tested three different methods for image denoising. Figure 2 shows the original 117 low exposure image (raw), the matching high exposure image (ground truth), and the results of the CARE 118 method, the N2V method, and a standard denoising method (BM3D). For this comparison we selected an 119 image pair from data set 1 (60X noise level 1). 120 INSERT FIGURE 2 121 Figure 2: Results of denoising methods. Shown are selected images from dataset 1 (60X noise 1). 122 Table 2 provides average metrics for the denoising performance for each method on each dataset. We used 123 two metrics: peak signal-to-noise ratio (PSNR) and structural similarity (SSIM). Before computing the 124 metrics we scaled and shifted both images to minimize the mean squared error (MSE) between them [17].
Finally, the PSNR metric was calculated as 126 = 10 10 ( 1 ) 127 The SSIM metric [32] is an image quality metric designed to approximate human perception of similarity 128 to a reference image. Unlike PSNR, the metric takes into account structural information in the image. The 129 SSIM metric ranges from 0 to 1 with a greater number indicating higher quality. 130 As shown in Table 2, the unsupervised N2V method is the weakest performer on both metrics. BM3D is 131 better on both metrics but surpassed by the supervised CARE method on almost all datasets. All methods 132 exhibit an approximately 10 dB drop in PSNR or greater on the noisier datasets (Noise 2) in comparison to 133 Noise 1. Each method also performed about 6-7 dB worse on 20× magnification data in comparison to the 134 60× magnification data. 135 Visual inspection of the restored images (example shown in Figure 1) shows that, despite having high SSIM 136 scores, the BM3D tends to blur the images more than the other methods. The results of the N2V method 137 are noticeably noisier than the results of the other methods. 138  Table 3 presents an comparison of the methods in terms of computation time. Using a single Nvidia V100 141 GPU, the CARE network took about 3.5 hours to train on a single dataset while the Noise2Void network 142 took about 3 hours. The CARE network took about 1 second to process a single image while the Noise2Void 143 network took about half that. The BM3D method does not require training but took about 50 seconds to 144 process a single image in MATLAB on a 2.6 GHz Intel Core i3-7100U processor. 145

Reuse potential 148
The provided data can be used to implement new methods in machine learning or to test modifications of 149 existing approaches. The data can be used to evaluate methods for denoising, super-resolution, or 150 generative modeling, as well as new image quality metrics, for example. The data could also be used to 151 evaluate the generalization ability of methods trained on one type of data and tested on another. High 152 quality, publicly available data of this type has been lacking.  We would like to submit a revised version of our manuscript "Fluorescence Microscopy Datasets for Training Deep Neural Networks" for consideration as a data note in GigaScience. We would like to thank the editors for your patience as we completed the revisions. We have made numerous changes in an effort to respond to all of the reviewer's comments. Our thanks also go to the reviewers for your helpful suggestions. We would like to respond to the reviewers with the following changes and improvements to the paper.

Reviewer 1:
I see great reuse potential in these imaging datasets and this Data Note and supporting data should be considered for publication in the GigaScience "Digital Pathology -Translatable Datasets for Clinical Reuse and Machine Learning" Thematic Series.
Thank you for your comment about the reuse potential. We have already been contacted by a few researchers asking when the data would be available and so we anticipate that there will be continued interest in the paper and the data. We would like to have the paper be part of this thematic series if this option is still available.
To ensure reproducibility, I request that the authors submit to GigaDB the denoised image files generated by: 1) CSBDeep toolbox; 2) NVIDIA Self-Supervised Deep Image Denoising software; and 3) BM3D.
We have uploaded to GigaDB the denoised image files as requested. Please note that we switched from the NVIDIA self-supervised denoising network to the Noise2Void network as requested by reviewer 2. This is a similar unsupervised network for denoising.

Reviewer 2:
Publicly available training datasets for DL methods are an important driver of research, yet compared to other fields (as computer vision) such datasets are currently less commonly found for fluorescence microscopy. So I really like that the paper tries to make a contribution towards changing that situation. I similarly like that the authors compared results from several DL methods as well as a strong classical baseline that are used in practice.
Thank you for these encouraging comments about the paper and datasets.
1) The authors write that "High quality, publicly available data of this type has been lacking". However there are some datasets that provide this (e.g. for 3D denoising [17] The Zhang paper "A poisson-gaussian denoising dataset with real fluorescence microscopy images" is an excellent resource but the data offered there is limited in a couple of important ways. The authors collected 50 noisy samples of each image, then average these images to generate a ground truth image. This is not the same thing as collecting an image with a long exposure time as the ground truth. The images offered in the Zhang paper are 512x512 pixels while ours range from 512x512 in one dataset up to 2048x2048 in four of the datasets. Also the Zhang data is limited to 8 bits (of intensity information), while ours were recorded at 16 bit. We have expanded the paper and datasets to now include images of the cell nucleus and membrane as requested. There are now 6 total datasets, the properties of which are shown in table 1 of the paper.
2) The current way of presenting the dataset/images (i.e. the main contribution) is suboptimal. Including at least an overview figure with a representative image for each modality/noise level/structure would greatly improve the paper (I essentially had to download the whole dataset just to have a look at a single image for each dataset). Additionally, Figure 1 has severe visual glitches that make it impossible to inspect the different denoising results. Finally, providing insets in the same figure for the denoised images would greatly help to see the differences of the compared methods.
We have included a new figure (now figure 1, the original figure 1 is now figure 2.) The new figure 1 shows example images and thereby an overview of the 6 datasets. We included an "examples" folder on the FTP site so that users can download a small portion of the total data and thus get a look at what the rest of the data would look like.
Sorry about the severe problems with figure 1 in the PDF you downloaded. Please note that the PDF conversion used by the submission system badly reproduces images. Please click on the link on that page of the PDF document and you should be able to download the original high resolution PNG files.
We removed this and just stated that we acquired the datasets under different conditions, table 1 describes these conditions.
-The MSE formula on line 102 misses the lower limit in the sum ("j=0"?) We corrected the formula.
-"Following the standard implementation of the CSBDeep network, we used the Laplacian loss function" -> The default loss function in CSBDeep is mean absolute error MAE without any probabilistic component (the config default is probabilistic=False). The laplace loss should only be used if the resulting probabilistic model is needed (e.g. when the additional confidence prediction might be useful), which for a normal denoising task is not the case. I therefore would suggest to rerun at least some of the experiments with the default setting (probabilistic=False) and see whether the results change.
We re-ran all of the data with the default settings.
-BlindSpot: "uses careful padding and cropping to force the network to…" -> Padding and cropping is not really the main distinction of Blindspot networks…..-The relatively poor performance of the BlindSpot Network seems to me a bit surprising. "We used our own implementation in Python using the Keras library" ->I think it would be more convincing when using one of the official implementations, e.g. https://github.com/juglab/n2v We switched to the Noise2Void network using the official implementation as suggested.
-How was the parameter of BM3D (noise level sigma) tuned?
Following the procedure of [1], on each image we estimated the noise level using the method of Foi et al. [2] and applied a variance stabilizing transformation [3] before denoising the image with BM3D. This explanation was added to the paper.
-"We normalized both images by clipping values below the 1st percentile and above the 99th percentile". Doesn't this remove essential information of the image? What was the reason to clip?
That is a good point and we removed this unnecessary clipping in the new version of the experiments.
-What stopping criterion was used for the CARE/Blindspot network training?
We trained each network for 200 epochs. In all experiments, 10% of the patches were withheld for validation during training, and the model with best validation error observed during training was saved and used for testing. We visually inspected the loss curves and observed that the loss for each training run had converged.