Evaluation of synthetic and experimental training data in supervised machine learning applied to charge-state detection of quantum dots

Automated tuning of gate-defined quantum dots is a requirement for large-scale semiconductor-based qubit initialisation. An essential step of these tuning procedures is charge-state detection based on charge stability diagrams. Using supervised machine learning to perform this task requires a large dataset for models to train on. In order to avoid hand labelling experimental data, synthetic data has been explored as an alternative. While providing a significant increase in the size of the training dataset compared to using experimental data, using synthetic data means that classifiers are trained on data sourced from a different distribution than the experimental data that is part of the tuning process. Here we evaluate the prediction accuracy of a range of machine learning models trained on simulated and experimental data, and their ability to generalise to experimental charge stability diagrams in two-dimensional electron gas and nanowire devices. We find that classifiers perform best on either purely experimental or a combination of synthetic and experimental training data, and that adding common experimental noise signatures to the synthetic data does not dramatically improve the classification accuracy. These results suggest that experimental training data as well as realistic quantum dot simulations and noise models are essential in charge-state detection using supervised machine learning.

A commercially viable quantum computer will require the manufacture of a large number of qubit devices, as well as autonomous procedures for qubit initialisation and tuning. For semiconductor-based qubits such as charge [1][2][3], spin [4][5][6][7] and topological qubits [8][9][10], this implies the initialisation of quantum dots of a known charge state. These quantum dots are formed by choosing appropriate voltages applied to electrostatic gates, depleting the electron gas underneath and thus isolating electrons from surrounding charge carriers. Although the manual formation of these dots is now routine in a wide range of systems, challenges still exist in automating this procedure. Material defects and fabrication variances result in non-uniform device performance, resulting in unique operating gate voltages between nominally identical devices and even their tune-ups.
With machine learning and artificial intelligence succeeding in an increasing range of sophisticated tasks, it is worth investigating their capabilities in automating the task of defining quantum dots. Several efforts in automating double quantum dot tune-up have been made using either supervised deep learning [11][12][13][14], unsupervised statistical methods [15,16] or deterministic algorithms [17][18][19]. For these approaches to be useful for large-scale tune-up, the tuning outcome needs to be determined reliably despite noise and without human intervention. This is achieved by verifying the charge state of the qubit device based on a measured charge stability diagram. Using supervised learning to perform this task requires a significant amount of labelled data, where each dataset carries an attributed label indicating its charge state. The process of measuring and labelling experimental data is slow, making synthetic data a way to increase the efficiency of this training process. The success of synthetic training data has been quantified mainly by classifying further synthetic data or curated experimental data [11][12][13]. However, classification results of realistic data during tuning suggests this performance to be lower [12].
Here we evaluate the ability of supervised machine learning models trained on synthetic data to determine the charge state of experimental charge stability diagrams, and compare to ones trained on data from real devices. Two convolutional neural network architectures and six parametric binary classifiers are trained to distinguish single versus double quantum dots when trained on purely synthetic data, a combination of synthetic and experimental data or experimental data only. We also investigate how adding noise to noiseless synthetic data affects classification accuracy, a technique showing promising success in the identification of impurities in scanning tunnelling microscope images [20].
Quantum dots are formed by applying voltages to electrostatic gates fabricated on top of a semiconductor structure, which creates potential wells isolating charges in regions with length scales on the order of the Fermi wavelength. One or two regions of charges can be formed, resulting in a single or double quantum dot. To determine the regime, i.e. single versus double, two gate voltages are stepped over while the current through the device is measured, resulting in a so-called charge stability diagram. A single dot features sharp diagonal lines, while a double dot shows either triple points with no charge transition lines between them, or a honey comb pattern with transition lines connecting bright spots corresponding to triple points. It has been shown that voltage combinations not resulting in the formation of any dots can be excluded through simple gate characterisation steps [17]. Machine learning techniques are therefore only required to distinguish between single and double dots of different qualities to complete the tuning process.
In this work, we assess the accuracy of convolutional neural networks and binary classifiers trained on synthetic data, experimental data or a combination of both. Convolutional neural networks trained on synthetic data benchmarked on either synthetic or curated experimental data have previously shown high classification accuracy [11][12][13]. We use the same neural network architecture with two convolutional layers [11,12], summarised in table A.1. The binary classifiers we compare are logistic regression, multi-layer perceptron, decision tree, random forest, K nearest neighbours and support vector machine, and were selected based on the accuracy comparison in [17].
All classifiers are trained and tested on the same data combinations. Binary classifiers are trained and tested on both transport measurements and their frequency spectrum, extracted using a Fourier transform [17]. Neural networks are trained and tested on transport measurements only. While binary classifiers are trained on original experimental data, the neural network is trained on an augmented experimental dataset, created using standard augmentation techniques. The approaches presented also apply to other types of measurements such as charge sensing and radio-frequency reflectometry.
Our synthetic dataset of simulated single and double dot charge stability diagrams is based on a capacitance model [21] and the Qflow-lite dataset [13], i.e. the Thomas-Fermi approximation. Examples of both dot regimes generated by these models are shown in figure 1. Details about the data generation and post-processing steps can be found in appendix A.
We implement five noise models typically encountered in experiments which are added to the noiseless synthetic data. We refer to this dataset as a noisy synthetic dataset. The added noise types are white noise, random telegraph noise, 1/f noise, charge fluctuations on gates, low-frequency current modulations and Our experimental data originates from quantum dots formed in InSb nanowires [22] as well as GaAs two-dimensional electron gases [17,23]. Each charge stability diagram is hand-labelled by two labels indicating the charge state, i.e single or double, and quality, i.e. sufficient or insufficient for subsequent tuning steps. As an example, double-dot diagrams are labelled as sufficient if they feature clear triple points suitable for qubit parameter fine tuning procedures discussed in [24][25][26], and insufficient otherwise.
Sufficient experimental data is further divided into ideal and good measurements, and we assess the classification accuracy on the subsets 'ideal' , 'good' , and 'all' measurements. Ideal measurement outcomes show features similar those found in synthetic data. Good measurements diagrams show some types of noise, but are suitable for further fine tuning. Using standard augmentation techniques, we augment our original experimental dataset of 221 ideal, 1681 good and 4613 bad charge stability diagrams to 10 000 ideal, 13 000 good and 25 000 bad diagrams. Examples of non-augmented measurements are shown in figure 3, showing common noises. Specifically, figure 3(e) shows a gradual current drop towards negative gate voltages, and the broadening of transitions and charge jumps, while figure 3(f) shows white noise, 1/f noise and random current modulation. These noise types have been added with varying, randomly sampled strengths in order to simulate the variation seen in experimental data.
The data described above is used to assess the ability of convolutional neural networks and parametric binary classifiers to generalise from synthetic data to a variety of experimental data. Each classifier is trained on the following dataset combinations: noiseless synthetic data, noisy synthetic data, good experimental data, all (i.e. ideal, good and bad) experimental data, noiseless synthetic data and good experimental data, noiseless synthetic data and all experimental data. Classification accuracies are evaluated on ideal, good and all experimental data. Where applicable, these datasets are split into 80% train and 20% test set and none of the data used in training is used in testing. We perform ten random train and test splits and report the average accuracy and standard deviation. These datasets are balanced, meaning they contain the same number of single and double quantum dots. This produces different sizes of train and test data for each dataset combination.
The results illustrated in figure 4 and detailed in table A.2 show that training the neural network on only synthetic data allows to predict ideal experimental data with an average accuracy of 76.54%. Broadening the scope to predict good and all experimental data sees the accuracies decrease to 64.61% and 60.28% respectively. The accuracies when predicting ideal data improve to 79.49% when good experimental data is added to the training set, and are highest when synthetic and experimental data is combined to a single dataset, reaching 89.50%.
Overall, adding synthetic data to an experimental training set improves accuracy by up to 10%. Confusion matrices for each classification, detailing which subclasses tend to be misclassified and listed in table A.3, show that single dots tend to be misclassified as double dots more often than double dots as single dots.
Adding noise to the synthetic training data results in lower accuracies than noiseless synthetic data. A detailed study of the effect of individual noise models can be found in table A.5, where only one noise type was added at a time with various amplitudes. All accuracies decrease when noise is added.
We further investigate potential benefits of transfer learning, during which the network is pre-trained on a synthetic dataset and then re-trained using an experimental dataset while keeping weights of all but the last layer fixed. This technique could potentially reduce training time or increase accuracies when not enough training data is available [27]. Results of transfer learning using either synthetic or noisy synthetic data for pre-training and good or all experimental data for re-training are illustrated in figure 4 and detailed in table A.4. We see little improvement when predicting good and all experimental data, but networks predicting ideal experimental data benefit from transfer learning by up to 5% compared to the pre-trained network. Overall, these accuracies are significantly lower than when both datasets are used together in a single training step.
Classification accuracies of binary classifiers are summarised in figure 5 and detailed in table A.6. Here, classification accuracies are highest when only experimental data is used for training. Exceptions are the decision tree classifier predicting ideal experimental data, and k-nearest neighbour and multi-layer perceptron predicting all experimental data, which benefit from added synthetic training data. Unlike the neural network, training with ideal experimental data does not show higher accuracies than good or all experimental data combined. The multi-layer perceptron performs best, followed by K-nearest neighbour, logistic regression and support vector machines. Adding noise to synthetic training data increases accuracies for the multi-layer perceptron, k-nearest neighbour and logistic regression, while it decreases accuracies for the decision tree classifier and support vector machine. These classifiers also show high training accuracies, suggesting that overfitting occurs. However, when trained with the augmented experimental data, we saw a significant overall decrease in accuracies. To summarise, we find the highest prediction accuracies are achieved by training classifiers on either experimental or a combination of synthetic and experimental training data. Adding only a small experimental dataset to a large synthetic dataset allows the convolutional neural network to learn the specific type of noise present in real measurements and hence improve classification accuracy. Adding more variety to training data from either improved device models or more experimental data is necessary to achieve higher success rates. Even small improvements will go a long way as inaccurate classification within a tuning algorithm multiply, resulting in a negative cascading effect which significantly reduces the overall tuning performance [14]. In particular neural networks with additional convolutional layers could reach higher accuracy and learn a larger variety of charge stability diagrams originating from different materials and device architectures. But a deeper architecture is more prone to overfitting and requires an even larger training dataset. Segmenting experimental data into regions with fewer regime variations may also increase accuracies.
The noise added to synthetic data does not improve classification accuracies, showing that it does not match the noise found in experimental data. More realistic noise models and quantum dot simulations taking into account impurities and fabrication defects are expected to improve the accuracy of classifiers trained on synthetic data. Realistic semiconductor quantum dot simulations are complex and noise encountered in today's state-of the-art devices, which have been used in this work, are not well understood. Future devices with less fabrication variances and impurities may reduce noise and facilitate charge-state detection based on supervised machine learning using synthetic data. But until these devices are reliable and simulations sophisticated enough to reproduce their behaviour, investing time into labelling experimental data is required.

Data availability statement
The data that support the findings of this study are available upon reasonable request from the authors.

Acknowledgment
We thank Rachpon Kalra and John M Hornibrook for helpful discussions and critical feedback.

Appendix A. Noiseless synthetic data
Our synthetic dataset of simulated single and double dot charge stability diagrams is based on data generated by a capacitance model [21] and the Qflow-lite dataset [13], which uses the Thomas-Fermi approximation. Examples of both dot regimes generated by these models are shown in figure 1. The capacitance model replicates a device made of six gates coupled to two dots, similar to device architectures used to define charge and spin qubits [23,[28][29][30][31][32][33]. A set of 2000 diagrams was generated by randomly sampling capacitances from a Gaussian distribution centred around one of several capacitance combinations generating diagrams encountered in experiments.
We use segments of the original Qflow-lite dataset made available online, divided into 15 subregions of 30 × 30 pixels per original charge stability diagram. We use python's scikit-image [34] resize method to resize these segments to 50 × 50 pixels, the size of our data. We first normalise each diagram individually to a range between 0 and 1 and then multiply them by an overall factor of 0.

Appendix B. Noisy synthetic data
We implement five noise models typically encountered in experiments that are added to the noiseless synthetic data, which are referred to as noisy synthetic data sets. These noise types are white noise, random telegraph noise, 1/f noise, charge fluctuations on gates, low-frequency current modulations and pinch-off current modulation. White noise typically arises due to thermal fluctuations, while 1/f noise and charge fluctuations on gates are two types of random fluctuations due to defects in the semiconductor. Random  telegraph noise on the other hand is a low-frequency modulation of the current caused by the spontaneous capture and emission of charge carriers. Low-frequency current modulations and pinch-off current modulation are consequences of the electron gas being depleted for decreasing gate voltages. Additional examples of each noise type are shown in figure A.1. Noise is generated as follows: The 1/f noise is generated in frequency domain by creating 2D frequencies mesh and taking the inverse of their norm to calculate the magnitude of spectral coefficients: We set the phases of these coefficients to random values: where ϕ k,l is chosen randomly from a uniform distribution over [0, 2π). The inverse Fourier transform is then added to the images. White noise is generated as a 2D map of random, normally distributed coefficients with zero mean and a variance of 1. Pinch-off current modulation is achieved by convoluting the image with where α and β are drawn from a uniform distribution between over [0, 10) and [−5, 5) respectively, and x i,j is a pixel coordinate matrix. Random current modulation is realized by convolving an image with a 2D map of Gaussian blobs of random mean and standard deviation, drawn uniformly between [−1, 1) and [0.3, 0.8) respectively. Random telegraph noise is simulated by adding charge jumps following a Poisson distribution with an expected number of occurrences drawn from a uniform distribution between [0, 0.2). Charge jumps, which appear as voltage jumps on gates, are achieved by randomly choosing a location in gate voltage space and a step size, defining the subregion of the current map which will be removed. The new image is then resized to its original size using python's scikit-image resize method. A 2D Gaussian convolution is applied to all image to simulate thermal broadening. We generate 10 000 random noise maps for each noise type, which are normalised to a range between 0 and 1. These are then added to noiseless synthetic data by choosing a random sub-selection of maps and random amplitudes distributed uniformly between 0 and 0.2, an amplitude range over which we saw the highest accuracy variation. Varying the noise strength ensures to cover different noise levels found in experimental data. Random telegraph, white and 1/f noise are added to noiseless current maps, while random current and pinch-off current modulation maps are convolved. The resulting diagrams are scaled by the ratio of old and new maximum current values to ensure previous normalisations are preserved. Charge jumps are added to a number of charge stability diagrams determined by the random amplitude times the total number of diagrams in the target dataset.