Dark solitons in Bose-Einstein condensates: a dataset for many-body physics research

We establish a dataset of over $1.6\times10^4$ experimental images of Bose--Einstein condensates containing solitonic excitations to enable machine learning (ML) for many-body physics research. About $33~\%$ of this dataset has manually assigned and carefully curated labels. The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework -- consisting of a convolutional-neural-network-based classifier and OD as well as a statistically motivated physics-informed classifier and a quality metric. This technical note constitutes the definitive reference of the dataset, providing an opportunity for the data science community to develop more sophisticated analysis tools, to further understand nonlinear many-body physics, and even advance cold atom experiments.


Introduction
Advances in machine learning (ML), and especially in the area of deep learning, are data driven. Yet, in many fields of science it is a common practice for researchers to either make data available only upon "reasonable request," or to not share it at all. As a result, the development of specialized ML techniques as well as trained models is often stymied by the lack of high-quality relevant datasets. Cold atom experiments produce vast amounts of data-images of atom clouds-making cold atoms an ideal system where data availability can both advance ML research and provide new applications relevant to experiment [1][2][3][4]. Here we provide a two component dataset consisting of absorption images of dark solitons in atomic Bose-Einstein condensates (BECs). The first component of this dataset is a curated revision of the Dark solitons in BECs dataset v.1.0 [5,6] containing over 6 × 10 3 images, which we carefully amend to assure high quality labels. The second component contains approximately 1 × 10 4 additional preprocessed and automatically labeled images. they are automatically labeled using the SolDet package [20]. This dataset is available via the National Institute of Standards and Technology Public Data Repository [5] to provide an opportunity for the data science community to develop more sophisticated analysis tools for soliton research and to further understand nonlinear many-body physics.

Dataset curating: materials and methods
In 2021, we released the "Dark solitons in BECs dataset" [5] consisting of approximately 6.3×10 3 preprocessed absorption images taken from multiple experiments performed in a single lab over a span of two months with human assigned labels. Based upon the number of solitonic excitations observed in a given image of a BEC, the data was organized into three classes: no excitations (class-0, accounting for 19.8 % of the dataset), lone excitation (class-1, accounting for 55.4 %), and other excitations (class-2, accounting for 24.8 %). While the initial agreement rate between annotators was relatively high at 87 %, the remaining 13 % of the dataset had to be "further analyzed and discussed until an agreement" between annotators was reached 4 , as stated in reference [6]. Such discussions of labels might introduce an undesirable bias in the labels, especially when the data is challenging to interpret. This bias is in turn imprinted into any ML model trained using that data, thereby putting the model's reliability into question.
While manually re-examining the dataset in the context of reference [4], we confirmed the presence of inconsistencies in the human assigned labels. We found three types of labeling errors: (e1) Images in class-2 containing only a single excitation.
Moreover, in reference [6], the excitation's location for the lone excitation class was determined using fits centered on the deepest density depletion. Our reexamination showed that some solitonic excitations were far from this point [error type (e4)]. As a result, we decided to use a combination of ML and statistical analyses to identify potentially incorrectly labeled data as well as to curate the dataset.
Building on the deep ensembles approach, originally proposed as a means to estimate models' predictive uncertainty [21], we implement an iterative five-fold crossvalidations 5 with strict agreement constraints to curate the original dataset. As described in detail below, we employ two classifications schemes for the cross-validation. 4 The agreement was substantially higher for the easier to interpret class-0 (95.7 %) than for other two cases (88.7 % for class-1 and 76.3 % for class-2), indicating a likely decrease in label reliability with increased data complexity. 5 The k-fold cross-validation is a resampling method that involves dividing the full dataset into k partitions and then performing a series of training and testing runs, with each run using k −1 partitions to train a given module and the remaining one partition to test it. The process is repeated k times to fully cover the dataset. First, we verify the human assigned labels using a set of convolutional neural network (CNN) classifiers trained using the original dataset. In addition, to ensure diversity of the trained models [22], we use a set of object detectors (ODs) [23,24] trained to localize all solitonic excitations within each BEC image and compare the number of detected excitations with both the original and CNN labels. After each iteration, images with insufficient agreement are further analyzed and, if necessary, removed from the dataset and set aside. Following the data curation process, we add the location of all excitations obtained from the ODs as an additional label. The curated dataset is used to train an implementation of SolDet [20], a generalpurpose framework for feature identification in cold atom experiments [4]. We then use the PIE classifier SolDet module to add fine-grained and physically-motivated labels (e.g., longitudinal solitons, solitonic vortices, and "partial" solitons) for the lone excitation class. Furthermore, we employ SolDet to automatically label about 1 × 10 4 additional experimental images (class-9) as well as all images set aside during the curation process (class-8) [5].

Data preprocessing
In the raw data, shown in figures 1(a)-(c), the BEC occupies only a small region of the image (inside the red box) and the long axis of the BEC is rotated with respect to the camera. The horizontal and vertical axes in figure 1(a-c) are labeled in terms of camera pixels i and j.
The angle between the camera and the BEC depends on the experimental setup and is obtained from fits to a representative subset of absorption images (in the case of our data the rotation angle is about φ = 40 degrees). The BEC's angle, position, and size-all necessary for proper cropping-are determined by fitting every image to a column-integrated 3D Thomas-Fermi distribution describing the density distribution of 3D BECs integrated along the imaging axis [25].
There are seven parameters in this fit: the rotation angle φ; the BEC center coordinates [i 0 , j 0 ] in the original image frame; the peak 2D density n 0 ; the Thomas-Fermi radii [R i , R j ]; and an offset δn (from small changes in probe intensity between images). We define the rotated coordinates as To ease the manual analysis and labeling process and facilitate training of the ML models, the absorption images are first rotated to align the BEC with the image frame and then cropped to discard the large fraction of the image that does not contain relevant information. Finally, an elliptical mask (determined based on the [R i , R j ] radii) is applied to the image to eliminate the noise outside the BEC [6].
The same preprocessing techniques are applied to the over 1 × 10 4 previously unlabeled absorption images now included in the expanded "Dark solitons in BECs dataset 2.0".

Data labeling process: dark solitons in BECs dataset v.1.0
As discussed in reference [6], the Dark solitons in BECs dataset v.1.0 [5] consists of images labeled by three independent annotators. These labels organized the data into three disjoint classes: class-0 indicating no excitations, class-1 indicating lone excitation, and class-2 indicating other excitations (e.g. different types of excitations, multiple excitations, ambiguous data). The expectation was for class-0 to consist of images that unambiguously contain no solitonic excitations; for the class-1 to contain images with exactly one solitonic excitation; and for class-2 to contain all other images. The initial labeling process was carried out in batches. At each stage, a subset of anywhere between 508 and 1209 images were independently labeled by each annotator and the resulting labels were compared. The labels with full agreement were accepted. When only two out of three annotators agreed (moderate disagreement), the images were reinspected and further discussed until an agreement was reached. Finally, images labeled differently by each annotator (strong disagreement), were added to class-2. The top two rows in table 1 show the distribution of images between classes in Dark solitons in BECs dataset v.1.0, as well as the number of images with full agreement between annotators in each class.

Data labeling process: Dark solitons in BECs dataset v.2.0
During the first phase of data curating, we performed a pair of five-fold cross validation tests using the original dataset. The first cross validation used CNN models trained on all three classes and the second used ODs trained with only class-0 and class-1. After cross validation we tagged each excitation located by the OD with a quality estimate [4]. The quality estimator yields the likelihood that a fit to the 1D profile of a given excitation has parameters in the range expected for a solitonic excitation. The likelihood was established based on a statistical analysis of fits to features previously identified as solitonic excitations in comparison with all other density depletions.
The cross-validations results in each image in class-0 and class-1 being assigned two predicted labels that are used to identify ambiguous data. To enable direct comparison of the two models, the OD predicted label is class-0 if no excitations are identified, class-1 if one excitation is found, and class-2 in all other cases. An image is flagged as potentially mislabeled if and only if (i) the CNN prediction disagrees with the assigned class and (ii) the CNN and OD predictions are the same.
We found 343 potentially mislabeled images: 14 in class-0, 40 in class-1, and 289 in class-2. We note that our intent is to use cross-validation and deep ensembles to assist data curating, not to change the ground truth. Thus, we do not overwrite the original class labels during the curating process. Rather, all flagged images are further analyzed and all potential excitations are tested with the quality metric. At this stage, all images in the extended Dark solitons in BECs dataset v.2.0 have label v1 (either the original Table 2: Complete dictionary of the labels appearing in Dark solitons in BECs dataset v.2.0. For each element we provide the key, definition together with all possible instances of a given label, and data type. The excitation position, excitation PIE, and excitation quality labels are assigned only to data where label v3=1. The SolDet labeles (soldet CNN, soldet OD, soldet PIE, and soldet QE) are assigned to all of the class-8 and class-9 data as well as to the 10 % of the curated manually labeled data used for testing during SolDet training.
Dictionary key Definition Data type file name Information about which data file a given set of labels refers to String label v1 The  (see table 2). label v2 is set equal to label v1 except for data determined to be truly mislabeled where label v2 is assigned to a new class-8 effectively removing it from the curated dataset. The resulting distributions between classes is shown in table 1.
In the next stage of data curating, we further refine labels for the data that are not in class-8 or class-9 in label v2 using five distinct deep ensembles of size ten trained through a repeated five-fold cross-validation. Prior research suggest that ensembles of ten models are sufficient to reliably assess the predictive uncertainty [21]. Building on that, we use ten five-fold-cross-validated OD models 7 trained using label v2 class-0 and class-1 (4 683 images in total).
Each image is tagged with ten OD predicted labels, each consisting of the number of excitations detected and their positions. Given the random initialization of the training sessions, we treat the deep ensemble as giving a uniformly-weighted set of predicted labels, with each model prediction considered equally reliable [21]. The OD class predictions are used to define a measure of class-based disagreement where #(·) denotes the number of instances for which condition (·) occurs and M is the size of the deep ensemble. A score of 0 indicates full agreement within the deep ensemble with the ground truth label while score of 1 indicates that all models predict an incorrect class. In addition, each image in class-1 is assigned a preliminary excitation position (FIT pos ): the minimum of the background subtracted 1D density profile [6,19].
The subsequent data analysis continues our aim of identifying mislabeled images and is carried out separately for class-0, 1, and 2. The result of this analysis is stored in label v3.
Class-0: Deep ensemble disagreement. For each image, we calculate the class-based disagreement score D class . Since the initial agreement rate between annotators was around 87 %, we opt to require at least 90 % agreement between the ODs. Thus, images with D class−0 ≤ 0.1 are retained in the curated dataset with label v3 set equal to label v2 (1 130 images). We note that the remaining 104 images had either (a) D class ∈ (0.1, 0.5] (75 images); or (b) half or more models predicting class-1 (29 images). All of these are assigned to class-8.

Class-1:
Step 1: Deep ensemble disagreement. Like for class-0, we first compare the number of OD-identified excitations with label v2 and find 3 222 images for which D class ≤ 0.1. Of the remaining 227 there are 178 for which D class ∈ (0.1, 0.5], 28 images for which five or more models predict class-0, and 21 images in which five or more models assign class-2. All these images are assigned to class-8. We found that in each case the excitation position was better located by the OD, so the excitation position label was updated to that returned by the OD.

Class-2:
Step 1: Deep ensemble disagreement. Data in class-2, by design, includes ambiguous images. Thus, the goal of curating this class is to ensure that it does not contain images that would with high confidence be classified as belonging to either class-0 or class-1 by the deep ensemble. Since all models were trained only on the class-0 and class-1 data, we use an ensemble consisting of all 50 models. We find that out of the 1 388 images in class-2, 30 had 90 % of models predict class-0 (and were therefore assigned to class-8) and 336 had 90 % of models predict class-1 (and were further analyzed in step 2). The remaining 1 022 images are retained in the curated dataset with label v3 set equal to label v2.
Step 2: Position range check. To confirm the OD predicted class-1, we compare the range of OD positions for those models predicting one excitation and find that max(OD pos ) − min(OD pos ) < 3 for 322 of images, suggesting that these are very likely class-1 data mislabeled as class-2. These images are also assigned to class-8.
The deep-ensembles-based data curating process resulted in assigning to class-8 a total of 693 images from the original dataset. The resulting dataset contains 1 130 images in class-0, 3 212 in class-1, 1 036 in class-2, and 879 in class-8 (data labeled as class-9, associated with unlabeled data, are unchanged). The final classifications are contained in label v3.

Label refinement: dark solitons in BECs dataset v.2.0
To further refine labels for images in class-1, we use the PIE classifier and quality estimator from the SolDet package [20]. PIE classifier partitions class-1 into physicallymotivated sub-classes stored in excitation PIE. The PIE classifier operates by splitting each image into top and bottom halves and determining the associated 1D profiles to which the quality estimator is separately applied. In addition to returning an overall quality estimate, stored in excitation quality, this algorithm also returns parameters such as the excitation position, width, and so forth.
Then, a sequence of thresholds driven by different top-bottom combinations of these parameters is applied to determine the label. The values defining all thresholds were arrived at by exploring the data accepted and rejected by the cut to minimize the false positive identification of longitudinal solitons, as described in reference [4].
Within the 3 212 images in class-1, the PIE classifier categorized 2 229 images as proper longitudinal solitons (class-A). Out of the remaining images, 378 were classified as top "partial" solitons (class-B) and 418 as bottom "partial" solitons (class-C); 28 were categorized as clockwise vortex (class-D) and 38 as counterclockwise vortex (class-E); 121 were categorized as canted excitations (class-F).

Automated expansion of the dataset v.1.0
In this section, we describe how we leverage the full SolDet package [20] to automatically analyze and label previously unlabeled images (a subset of about 1 × 10 4 identified in label v3 as class-9) that were not part of Dark solitons in BECs dataset v.1.0. These data include images representing class-0, 1, and 2; many class-2 candidates possess multiple excitations as well as ambiguous and confusing structures that may hinder human labeling. Since these data were not previously considered, they make an ideal test case for the SolDet package [4]. Roughly 90 % of the class-0, 1, and 2 data from the curated dataset were used to train SolDet, leaving the remaining 10 % of these classes for validation. In addition, we also apply SolDet to all of class-8 allowing us to cross-check the mislabeled assignment. For this application, the SolDet package is configured to give the labeling flow depicted in figure 2(a), involving a sequential application of CNN and OD modules followed by the PIE classifier. The CNN module categorizes images as class-0, 1, or 2 while the OD module tags each image with positions of all detected excitations. If OD assigns class-0, the image is labeled accordingly and the process terminates for that image. Otherwise, the PIE classifier is executed for each excitation detected by the OD. Finally, all excitations located by OD are additionally tagged with a quality estimate [4].
The output of each module is included as a separate label: soldet CNN for the CNN classifier; soldet OD for the vector of positions; soldet PIE for the vector of classes returned by PIE classifier; and soldet QE for the vector of quality estimates. This enables the end user to choose a desired level of agreement between the labeling modules or the longitudinal solitons' quality necessary for a particular application. Thus, unlike images in the curated dataset, these previously unlabeled images are not assigned a single ground truth class. Figures 2(b)-(d) compares the label assignment for three different subsets of our dataset: a subset of the curated data (10 %) automatically generated by SolDet for validation (shown in yellow), class-8 (mislabeled data; shown in green), and class-9 (unlabeled data; shown in blue). Panel (b) shows that CNN classes for each case follow a very similar distribution, but where data in class-8 has a reduced likelihood of being shows that the quality metric for all excitations identified by the OD are consistent across the three subsets of our data. Figure 3 compares the outcome of the CNN and OD modules for (a) class-8 data and (b) class-9 data. In both cases when the CNN classifier assigns class-0 or class-1, the OD is unlikely to find any or other number than one excitation, respectively. However, when the CNN assigns class-2, the OD assignment is strongly biased to class-1 for the class-8 (mislabeled) data, with 73 % assessed to contain only one excitation. This bias likely results from the process of labeling and curation in which many class-1 candidates were moved into class-8 to avoid false positives. The OD assignment seems almost random for the previously unlabeled class-9 data assigned class-2 by the CNN, with 42 % assessed to contain only one excitation and 49 % assessed to contain two excitations.
For the mislabeled (class-8) data the performance is significantly degraded in the converse case: about 30 % of the OD class-0 and 1 data and almost 50 % of OD class-2 data is assigned one of the alternative CNN classes. For the unlabeled (class-9) data, the disagreement between OD and CNN assigned classes is much lower, at 7 %, 14 %, and 8 % for OD class-0, 1, and 2+, respectively.   illustrative examples from the remaining classes. Because the primary function of the PIE classifier is to reject images that are not class-1-A (avoid false positive longitudinal solitons) the B-F labels are of lower quality. The bottom row displays data from class-0 (no excitations; panel (g)) and class-2 (other excitations; panels (h)-(i)). Together, these show that SolDet is very effective in delineating between class-0,1 and 2.
In each panel the arrows identify the location of the excitation from the OD, showing it is effective across-the-board in locating excitations. The arrows are colored according to the PIE classifier result: Red arrows mark the location of longitudinal solitons (class-*-A) and the orange arrows mark all other classes. Even in cases with many excitations (h), (i), SolDet correctly identifies high quality longitudinal solitons.

Conclusion and outlook
We find that SolDet reliably labels new experimental data (class-9) as well as data categorized as potentially mislabeled (class-8). An inspection of the images shows that the the assigned classes are visually similar to the manually labeled data. Furthermore, the labels automatically assigned to class-9 have a statistical distribution that is very similar to the training dataset. By contrast, the labels assigned to class-8-a selectively filtered subset of the original dataset-are significantly different.
As we mentioned, SolDet is configured to identify and correctly locate longitudinal soliton within BEC (class-*-A). The reliability of the additional PIE classes could be improved by, e.g., further refining cuts defining the physically-motivated categories or slicing the image into more than two pieces.
The enlarged Dark solitons in BECs dataset v.2.0 dataset includes quantitative estimates of all longitudinal solitons quality as well as new fine-grained solitonic excitation categories of all detected excitations. It is a freely available to the whole ML and physics community the opportunity to develop novel ML techniques to cold atom systems and to further explore the intersection of ML and quantum physics.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https://doi.org/10.18434/mds2-2363. density by about 70 %. We then change the DMD pattern to illuminate half of the BEC extension and pulse the light for a variable time to imprint the phase. Since the accumulated phase is proportional to the duration in which the light is pulsed, the pulse duration is varied to create solitons at different speeds. To avoid creating additional density modulations after imprinting the phase, the DMD is reconfigured back to the narrow stripe, the dimple is reapplied and its magnitude is ramped to zero. More details about this protocol can be found in [19].
After solitons are created we let them oscillate in the trap for a variable evolution time that allow us to obtain variation in the soliton properties, such as the oscillation amplitude, initial position, propagation velocity, and lifetime. Since our elongated trap geometry does not produce truly 1D BECs, kink solitons that are initially created can eventually decay into solitonic vortices during the time they oscillate in the trap. After the evolution time, the trap is turned off and the cloud expands for 12 ms before it is imaged using standard absorption imaging [26].
In the standard absorption imaging technique, the fraction of a resonant probe light absorbed by the atomic cloud is used to extract information about the atoms. For the standard absorption imaging, we acquire three images that are combined to obtain the optical density. In the first image I A i,j , the atomic cloud is illuminated with the probe light and its shadow is recorded in a CCD camera, see figure 1(a). To compute the fraction of the light that is absorbed by the atoms, a second image I P i,j of the probe beam, without atoms, is acquired, figure 1(b). The third image I BG i,j is then acquired without the probe light to get the background light in the experiment, figure 1(c). All images are acquired with the same duration and the probe beam has the same intensity for the first two images. The three images are then combined to to obtain the 2D optical density σ 0 n i,j ≈ − ln where σ 0 = 3λ 2 /(2π) is the resonant cross-section and λ is wavelength of the probe laser. All images [figures 1(a)-(c)] are acquired by a 648 × 488 pixel camera (Point Grey FL3) with 5.6 µm square pixels, labeled by i and j. The imaging system, with an optical resolution of ≈ 2.8 µm, has ≈ 6× magnification, generating images with effective pixel size of 0.93 µm.