Atom cloud detection and segmentation using a deep neural network

We use a deep neural network (NN) to detect and place region-of-interest (ROI) boxes around ultracold atom clouds in absorption and fluorescence images—with the ability to identify and bound multiple clouds within a single image. The NN also outputs segmentation masks that identify the size, shape and orientation of each cloud from which we extract the clouds’ Gaussian parameters. This allows 2D Gaussian fits to be reliably seeded thereby enabling fully automatic image processing. The method developed performs significantly better than a more conventional method based on a standardized image analysis library (Scikit-image) both for identifying ROI and extracting Gaussian parameters.


Introduction
Deep neural networks have revolutionized data analysis and led to automation of tasks that previously required human supervision. Image analysis has particularly benefited through the use of convolutional neural networks (CNNs) [1] and their derivatives which have allowed for image classification [2,3], object detection [4,5] and instance segmentation [6]. Although many of these neural networks (NNs) were developed for tasks such as facial recognition by social media networks [7,8], they have also been used to identify laser modes [9], classify phases in condensed-matter systems [10,11], reduce measurement errors for trapped-ion qubits [12] and process images from cold atom experiments [13,14,15]. In this work, we use an instance segmentation NN (see 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 A c c e p t e d M a n u s c r i p t Figure 1. Overview of the neural network (NN). An experimental image is initially fed into a convolutional neural network (CNN) which produces a 'feature map' of the image. The region-proposal network (RPN) uses this feature map to generate regionsof-interest (ROI) where atom clouds are likely contained. In the ROI alignment stage, the CNN's feature map is cropped and resized for each cloud's ROI before being fed into three parallel branches. The first branch (ROI) generates refined ROIs, whereas the second branch (class) classifies the type of atom cloud in the ROI. Finally, the third branch (mask) generates a segmentation mask corresponding to the 1/e 2 contour of the atom cloud. Neural networks consist of an input layer and an output layer with a number of intermediate hidden layers which are connected to one another via 'weights'. Rather than employing hard-coded algorithms, NNs learn to emulate data they encounter through training cycles, in which data is iteratively passed through the NN and the output compared to the 'ground truth'. The difference between the two is then used to update the weights between the NN's layers, thereby improving its accuracy. When employing supervised training, this requires a dataset which includes both input data and their associated ground truth values. For object detection NNs, these ground truth values are rectangular bounding boxes for each object, as well as labels classifying the object types in the bounding boxes. Instance segmentation NNs build upon object detection NNs by also requiring pixel-to-pixel segmentation masks in which image pixels comprising the object have mask values of one, whereas all other pixels have mask values of zero.
Our dataset consists of images of cold atom clouds in a MOT [16] and an ODT [17] (see Fig. 2a-c). Atom clouds in these traps form approximately Gaussian density distributions [18]. Fitting a cloud allows the parameters describing the distribution (Gaussian parameters) to be extracted and used to ascertain information such as the cloud's size and density. Furthermore, by using time-of-flight measurements [19] the temperature of the atoms can be determined.
A region-of-interest (ROI) [20] centered on the atom cloud is used during fitting as objects in the image other than the atom cloud can cause an inaccurate fit (e.g. an atom cloud in another trap or extraneous noise). Additionally, decreasing the fit area can significantly decrease the fit time when using two-dimensional fitting. Manually  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 A c c e p t e d M a n u s c r i p t determining the ROI is time-consuming when analyzing a large number of images and an algorithmic procedure is often employed, such as taking the 'center of mass' and then iteratively expanding the ROI around this point until the fractional enclosed 'power' exceeds some threshold. Another method involves performing connected component analysis on a binarized version of the image and then measuring the resulting regions using common image processing libraries [21]. However, if the image is noisy (e.g. contains fringing), the proposed ROIs will be inaccurate for these types of methods (Sec. 6).
Here we propose a deep neural network based approach to ROI determination in which a NN finds the ROI for each atom cloud in an image (see Fig. 1). Furthermore, the NN differentiates between clouds (classification) in a MOT and those in an ODT and also outputs a segmentation mask for each cloud from which Gaussian parameters are directly extracted. The classification feature is particularly useful when the cloud type is not a priori known from the experimental sequence (e.g. images containing both MOT and ODT clouds during the ODT loading). Although features such as the position or aspect ratio can be used for classification with additional manual or experimental input, the NN needs the image alone (beyond training on an appropriate dataset) to determine cloud types and so is easily adaptable to other cold atom experiments with different numbers of clouds or cloud types.
The rest of the article is arranged as follows: Sec. 2 describes the experimental dataset used for NN training and validation, Sec. 3 describes the training process, Sec. 4 discusses Bayesian optimization [22,23] of the NN's hyperparameters [23] and Sec. 5 examines how the Gaussian parameters are calculated from the segmentation mask. In Sec. 6 we compare our proposed NN method to a more conventional method of determining both ROIs and Gaussian parameters.

Experiment and Dataset
To produce the ultracold atom clouds, erbium atoms are initially trapped and cooled to ∼20 µK in a narrow-line MOT [24] before being loaded into an ODT formed from a 30 W, 1030 nm laser beam focused down to a ∼40 µm waist (see Fig. 2). Optimization of the trap loading involves maximizing the atom number while minimizing the cloud temperature. The atom number is found by fitting the atom cloud in either a fluorescence or absorption image § with a two-dimensional (2D) Gaussian (see Eq. 1) and then integrating under the curve. The cloud temperature can be determined from how the cloud width evolves during time-of-flight expansion.
The experimental dataset consists of 260 fluorescence and absorption images (1936×1216 pixels) along with their ROIs, labels and segmentation masks. Of these, 130 images contain clouds released from the MOT with no ODT present (see Fig. 2b) § Experimental images shown in the paper are processed from 2-3 raw images. Fluorescence images require both an image with atoms and a background image without atoms. Absorption images additionally require a probe image in which the probe beam is turned on, but no atoms are present.  and 130 images contain either just atoms released from the ODT (see Fig. 2c) or images where atoms released from the ODT and the MOT are both present (see Fig. 2d). We manually label the atom clouds in the images as 'MOT' or 'ODT' and draw a ROI box at the clouds' edges which we define as the 1/e 2 radii along the x and y axes (see Fig. 2d). This definition prevents the ROI boxes from overlapping when the MOT and ODT are both present; however, the ROI boxes are also easily expandable when analysis requires the wings of the distribution. The manually drawn ROIs were expanded by a factor of two-excepting where the expanded ROIs would overlap-and the atom clouds inside fit with a 2D Gaussian where I(x, y) is the image intensity, I b is the background intensity, I 0 is the peak intensity, x 0 and y 0 are the center coordinates, w x and w y are the 1/e 2 radii along the major and minor axes and θ is the angular orientation of the distribution. To increase the accuracy of the ROIs used for training, the fit parameters were used to calculate the 1/e 2 radii along the image axes [9] (previously estimated by eye) and the ROI boxes redrawn using these values. The process of fitting and redrawing the ROI boxes from the fit parameters was then completed a second time with subsequent iterations neglected due to an insignificant increase in accuracy. A segmentation mask was generated for each atom cloud (see Fig. 2e) with the mask borders placed at the 1/e 2 contour of the cloud-calculated from the final fit  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59   parameters; pixels within the 1/e 2 contour were set to one, whereas pixels outside were set to zero. Finally, the dataset was randomly split into a training dataset with 200 images and a validation dataset with 60 images.

Neural Network and Training
We use the neural network Mask R-CNN [25] to detect and bound the atom clouds, as well as to provide segmentation masks for each cloud (see Fig. 3a-b). The NN (see Fig. 1) begins with the experimental image being fed into a convolutional neural network base (CNN, ResNet-50 [26] ). The CNN's outputted feature map [27] is then passed into a region-proposal network (RPN) which generates ROIs where objects are likely located. Next, these ROIs are cropped from the CNN's feature map in a ROI alignment stage and resized to uniform dimensions. The cropped feature maps are then fed into three parallel branches. The first applies a classifier to determine the object type and give the confidence of its prediction-which is helpful in determining whether to use the ROI in post-NN analysis. The second branch gives a more accurate ROI box prediction (see Fig. 3a) and finally the third outputs a segmentation mask for the object inside the ROI (see Fig. 3b). Since all three branches share the same base, computation speed is significantly increased [28] for both training and evaluation.
During training, the NN output is compared to the ground truth (i.e. the expected output from the training dataset) via a loss function; the loss is then back-propagated [29] through the NN to adjust the weights between layers and refine the NN model (see Fig. 4a). The loss function for the RPN stage is L1 loss [30], for the ROI branch it is Smooth L1 loss [31], the classifier uses categorical cross entropy loss [32] and lastly the mask branch utilizes binary cross entropy loss [33]. Although the loss can be separately back-propagated for each branch, a simpler approach is taken here in which the losses are summed together and then back-propagated to update the weights of the NN [25].  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 A c c e p t e d M a n u s c r i p t An epoch denotes a single cycle of training in which every image in the training dataset is passed through the NN, the loss calculated and the model weights updated. Increasing the number of training epochs can increase the NN's final accuracy so long as the NN does not overfit on the training data. However, due to finite computational resources, we restrict the training of an individual NN to fifteen epochs-determined to be sufficient, see below-and instead use Bayesian optimization (BO) to increase the accuracy of the final trained model (see Sec. 4).
Five hyperparameters, which tune the learning process, are set before the NN's training phase. The first is the learning rate which determines the size of the step the NN takes during stochastic gradient descent [34]. If the learning rate is too low, the NN will take too long to converge to the minimum of the loss function, whereas if the learning rate is too high, the NN will not be able to descend to the minimum, but will oscillate around it or diverge. Since larger learning rates are useful at the beginning of training and smaller learning rates are useful towards the end, a learning rate scheduler [35] is employed which decreases the learning rate by some scalar (decay, the second hyperparameter) after a fixed number of epochs (step size, the third hyperparameter). The fourth hyperparameter is the momentum which prevents the NN from getting stuck in a local minimum during training [36]. The last hyperparameter is the batch sizethe number of images simultaneously passed through the NN-and, unlike the other hyperparameters which are tuned with BO, is fixed to four for all NNs.
The accuracy of the NN is evaluated (see Fig. 4b) using the COCO evaluation metric which entails calculating the intersection-over-union [37], where ROI p is the ROI prediction from the NN and ROI gt is the ground truth ROI. Values below a predetermined IoU threshold (e.g. IoU = 0.5) are considered a false prediction-mislabelling the object class within the ROI is also a false prediction-, whereas values above are considered a true prediction. A precision recall-curve [38] is then constructed and integrated to give the average precision (AP) for the given threshold (e.g. AP 50 for the IoU = 0.5 threshold). Finally, the AP is calculated for ten IoU thresholds (0.50-0.95 with a step size of 0.05) and averaged to give the mean average precision (mAP)-which is the metric generally reported for object detection. The mask AP values and mAP can be similarly calculated [39]. The NNs are trained in a Google Colab notebook [40] utilizing a GPU backend and implemented in PyTorch [41] using the pre-built Mask-RCNN model in the Torchvision package. Rather than training the NN from scratch, the model weights are loaded from a network pre-trained on the COCO train2017 dataset [42]-which significantly reduces the time required to train the network and increases the model's final accuracy [43]. Additionally, this transfer learning, complemented by data augmentation (in which images are randomly flipped horizontally during training), allows us to use a relatively small training dataset. After each training epoch, the updated NN is evaluated on the validation dataset which yields the mAP for both the object detection and the  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59 Table 1). The shaded region represents the number of epochs used during Bayesian optimization. segmentation mask branches. These two values are averaged together to give the accuracy of the training epoch (average mAP) and the epoch with the highest average mAP determines the overall accuracy of the NN model. After using BO to determine the hyperparameters which give the highest accuracy, we retrain the NN with these parameters for 30 epochs (see Fig. 4c) and verify that the NN's accuracy becomes asymptotic at the 15 epoch mark-with no overfitting seen thereafter.

Bayesian Optimization of Hyperparameters
The accuracy of the trained NN is sensitive to the hyperparameters used during training. In the past, the hyperparameters were tuned through grid search, random search [44] or by hand; however, in recent years, Bayesian optimization (BO) has been successfully employed to find the best set of hyperparameters [23,45]. BO is particularly useful when trying to find the minimum (or maximum) of a function which is noisy and expensive to evaluate-such as NN training-thereby making a grid search of the parameter space impractical [22].
Bayesian optimization takes the function value (cost) at previously evaluated points and uses a Gaussian process to model the cost as a function of the parameter space [46]. The model also determines the confidence of its predictions in a given region of the  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 A c c e p t e d M a n u s c r i p t parameter space: perfect certainty at evaluated points, low uncertainty near evaluated points and high uncertainty far from evaluated points. The BO loop then determines where to evaluate the function next by weighing the benefits of evaluating the function near the model's predicted minimum (or maximum) or evaluating the function in an unexplored region of the parameter space [47]. When optimizing our NN training with BO, the average mAP of the NN is evaluated as a function of the hyperparameter space which consists of the learning rate, momentum, and the learning rate scheduler step size and decay (see Table 1). As a warm start, the NN is initially trained and evaluated at five quasi-randomly (Sobol generated [48]) distributed points (see black squares in Fig. 5a-f). Further evaluation points (red circles in Fig. 5a-f)  With an increasing number of BO evaluation trials the best achieved average mAP rises and converges (see Fig. 5g). The best set of hyperparameters (see Table 1) gives a mAP of 86.3 % for the object detection branch-locating 88 of the 89 clouds-and a mAP of 85.8 % for the mask branch. These mAP values are higher than those for similar NNs trained on the COCO validation dataset [25,50] which is likely due to our NN only needing to classify two object types rather than the eighty in the COCO validation dataset [37], as well as the relatively simple features of the MOT and ODT clouds. The NN also performs well for clouds with a low signal-to-noise ratio (SNR) which we define as the cloud's peak amplitude divided by the standard deviation of the image's background intensity; the validation dataset's nosiest cloud has an SNR of only 2.1 but returns an ROI IoU score of 0.86.
x 0 y 0 w x w y θ x 0 y 0 w x w y θ Label Meas. Thres.
Bin. the peak amplitude I 0 and the background intensity I b were both divided by the fitted peak amplitude. The Gaussian parameters extracted from the NN's segmentation mask can either be used directly or as seed parameters for a conventional 2D fit-increasing the likelihood of fit convergence and reducing the fitting time. Without a seed, the fit time for a full image (1936×1216 pixels) is approximately 15 seconds on an Intel Xeon 2.2 GHz processor core. For the validation dataset, this is reduced to an average of 3 seconds when using the ROIs proposed by the NN-expanded by a factor of two except where the expanded ROIs would overlap-, which substantially increases the fit speed despite the NN's processing time of 0.17 seconds per image using a Nvidia Tesla V100 GPU. Extracting the Gaussian parameters takes an average of 0.18 seconds, but results in an average fit speed up of 2 seconds per region when used as a seed. Thus, our method of determining ROIs and seed parameters using a NN offers a significant speedup in conjunction with fitting.

Comparison with Conventional Analysis
For comparison with the NN, a conventional method is used to find the ROIs and Gaussian parameters of the clouds (see Fig. 6b). We utilize standard methods based on Python's Scikit-image (Skimage) package [21] to pre-process the images, determine ROIs for the clouds, and extract the Gaussian parameters.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 A c c e p t e d M a n u s c r i p t During pre-processing, a histogram is taken of the image's pixel intensities. As the cloud areas are much smaller than that of the overall image an approximately Gaussian peak corresponding to the image's background level and noise is seen. The mean µ and standard deviation σ of a Gaussian fit to this peak are used to define a threshold value I thresh = µ + 3σ [52] and the image is binarized by setting pixels with I > I thresh to one and the remainder to zero.
After pre-processing, connected component analysis [53] is applied to the binarized image which groups pixels of the same value together into 'regions' (e.g. an atomic cloud). These regions are fed into Skimage's regionprops method which returns both the ROI coordinates and the geometric parameters {x 0 , y 0 , w x , w y , θ} for each region. ROIs with an area > 1 2 or < 1 800 the image size (i.e. much larger or smaller than the size    1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 A c c e p t e d M a n u s c r i p t of the clouds under measurement) are removed and the two largest remaining ROIs-if there are more than one-are taken as the MOT and ODT ROIs since smaller regions generally coincide with remaining high noise regions. Applying this method to the validation set locates 68 of the 89 clouds with an IoU > 50 %; this is significantly worse than the NN which locates 88 of the 89 clouds. The mAP is also calculated, but since the conventional method does not differentiate between MOT and ODT clouds, all the validation dataset regions are relabelled as 'cloud'. Even with this simplification, an mAP of only 11.1 % is achieved-much lower than the NN's object detection mAP of 86.3 %. On closer inspection, the conventional method fails on images with low signal-to-noise ratios (SNR) such as images with significant fringing.
To determine the Gaussian parameters of the 68 successfully bounded clouds we apply a numerical scaling factor (to account for our thresholding method [51]) to the {w x , w y } returned from regionprops and then generate a binary mask from the resulting {x 0 , y 0 , w x , w y , θ} which is used to find the cloud's background intensity I b and amplitude I 0 (see Fig. 6c) as in Sec. 5. The parameters are normalized and compared to those from the 2D fit (see Fig. 7, orange bars). Since the conventional method only recognizes a subset of the clouds, we quantitatively compare against the NN segmentation method for the same cloud subset (shown as filled blue bars in Fig. 7); a direct comparison of the root mean squared errors is shown in Fig. 7h.
When considering both object detection and Gaussian parameter extraction, the NN method significantly outperforms the conventional method. Additionally, the conventional method's efficacy depends on the SNR of the data, whereas the NN is more robust against high noise levels and fringing since it learns higher level features of the atom clouds. Furthermore, the conventional method requires manual determination of the best pre-processing procedures and therefore potentially needs further tuning for new data; however, when faced with new data the NN simply needs to be retrained with more labelled data, thus making it effective in a laboratory setting.

Conclusion
An instance segmentation neural network (Mask R-CNN) was trained to identify ultracold atom clouds in magneto-optical traps and optical dipole traps. The neural network (NN) generates both a region-of-interest (ROI) and a segmentation mask for each cloud-corresponding to the cloud's 1/e 2 radii-with a mean average precision of 86.3 % and 85.8 % on the ROI and the segmentation mask branches, respectively. We show that the Gaussian parameters describing the atom clouds' distributions can also be extracted directly from the segmentation masks. Both ROI determination and Gaussian parameter extraction via the NN are significantly more accurate than a conventional method based on Python's Scikit-image library.