Unsupervised machine learning of topological phase transitions from experimental data

Identifying phase transitions is one of the key challenges in quantum many-body physics. Recently, machine learning methods have been shown to be an alternative way of localising phase boundaries from noisy and imperfect data without the knowledge of the order parameter. Here, we apply different unsupervised machine learning techniques, including anomaly detection and influence functions, to experimental data from ultracold atoms. In this way, we obtain the topological phase diagram of the Haldane model in a completely unbiased fashion. We show that these methods can successfully be applied to experimental data at finite temperatures and to the data of Floquet systems when post-processing the data to a single micromotion phase. Our work provides a benchmark for the unsupervised detection of new exotic phases in complex many-body systems.


I. INTRODUCTION
Machine learning techniques have recently achieved remarkable successes in analysing large data sets in various areas. These developments lead also to promising applications in quantum physics [1,2]. Examples include the efficient representation of quantum many-body states [3], efficient state tomography from restricted experimental data [4][5][6], optimisation of experimental preparation [7][8][9][10] and identification of phases of matter [11][12][13][14][15][16][17][18][19][20]. For the latter, machine learning methods were applied to data both from numerical simulations and from experiments such as scanning tunneling microscopy images of condensed matter systems [21,22], neutron scattering data from spin ice systems [23] as well as real and momentum-space images of ultracold atom systems [24][25][26]. When analysing experimental data, machine learning can unfold its full potential by identifying the relevant information despite noise and other imperfections such as finite temperature or restricted access to relevant observables. Another prospect of the machine learning analysis is to identify novel phases and order parameters in exotic regimes [26,27]. While supervised machine learning methods, i.e. with labelled training data, have been broadly applied, unsupervised methods dealing with unlabelled data were so far mainly restricted to numerical studies [14,18,[28][29][30][31][32][33].
Machine learning techniques can be employed for various tasks related to phase transitions including the comparison of experimental data to competing theory descriptions in an unbiased way [25], the analysis of patterns in the trained filters of convolutional neural networks [26,27], the generation of new images [34], or the extraction of physical parameters and concepts [35,36]. Finally, the question of interpretability of neural networks has obtained a new stimulus by the application to physical problems [15,27,[37][38][39][40][41].
Quantum simulators based on ultracold atoms in optical lattice allow engineering a variety of quantum manybody systems and probing them with detection methods complementary to solid-state systems [42]. In particular, topological systems can be created by adding artificial gauge fields [43,44] using periodic driving, i.e. so-called Floquet engineering [45,46]. Topological phases of matter are an active field of study, but the absence of a local order parameter generically poses a challenge to their detection [47]. Therefore, the classification of topological phases receives particular attention with machine learning techniques [31,[48][49][50][51][52][53]. With cold atoms, many detection methods have been demonstrated including the transverse Hall drift [54][55][56][57], Berry phase measurements [58], quantized circular dichroism [59,60] as well as Bloch state tomography [61][62][63][64][65]. The latter is based on momentum-space images after quench dynamics, which also form the basis of the machine learning analysis in this article.
Here, we apply unsupervised machine learning techniques to experimental data from topological phases of a Haldanelike [66] model realised in ultracold atom quantum simulators. We also address the problem of dealing with the micromotion inherently arising in Floquet systems by using machine learning for data postprocessing, which allows to effectively change the micromotion phase of all data to a desired value. While supervised machine learning can deal with data of different micromotion phases [24], unsupervised techniques we applied within this work cannot.
Unsupervised learning of phase transitions can roughly be divided in two categories: clustering-based methods [15,16,29,30,32,52,[67][68][69] and learning-success-based methods [14,18,19,28,31]. In this work we apply methods of both categories to the data, which we postprocess to a single micromotion phase. Clustering-based methods identify the phases by clustering the data in a suitably chosen space and associating each cluster with a different phase, employing concepts such as PCA, t-SNE, autoencoder or diffusion maps. In this category, we find that a k-means cluster analysis in the latent space of an autoencoder does identify the phase transitions, when it is applied to cuts through the phase diagram separately. This approach, however, cannot distinguish between the different signs of the Chern number. Learning success-based methods use the success of the training process for different trial classifications to judge on the similarity of the data. In this context, we use anomaly detection [18] and influence functions [39]. By carefully combining the information from these techniques we can uncover the full phase diagram from noisy experimental data in a completely unsupervised way. Our results provide an important benchmark for unsupervised machine learning of phases of matter and evaluate methods that might be useful for revealing new exotic order in complex systems.
The structure of the article is as follows. We start with the description of the used methods (Sec. II). In section II A, we describe the experimental setup and the protocols to obtain the data. Section II B gives an overview over the different machine learning methods we use within this work. In the result section III A, we first employ a latent space analysis to detect the different topological phases. Afterwards, we describe how to postprocess the data to a desired micromotion phase in section III B and check the validity of this approach with influence functions in section III C. We use the postprocessed data to detect the different topological phases with the method of latent space analysis (Sec. III D) again. Section III E describes an anomaly detection scheme to separate different topological phases. We finally use the influence functions in section III F to distinguish between the two topologically non-trivial phases.

A. Experimental setup and data acquisition
The data is taken in experiments performed with ultracold atoms in optical lattices [42], which are established as a well-controllable system for studying solid-state physics in general and topological phases in particular [43,44]. The topological Haldane model [66] is realised by Floquet-driving of a honeycomb lattice [56,63,70,71]. In this specific configuration, the experiments start with a hexagonal lattice with a large offset ∆ AB = 2π · 6.1 kHz between the two sublattices, realised by a suitable polarisation of the three interfering laser beams that form the optical lattice [63] (Fig. 1a). The lattice is then accelerated on elliptical trajectories by a phase modulation of the lattice beams characterised by the shaking phase ϕ between the modulation along the x and y direction. The resulting effective Floquet Hamiltonian features non-trivial Chern numbers and gives rise to a topological phase diagram closely related to the original Haldane model (Fig. 1d) [24,65]. The control parameters are the shaking phase ϕ, which gives rise to the time-reversal-symmetry-breaking, and the shaking frequency f sh , which gives rise to non-trivial Chern numbers C = ±1 for near-resonant shaking with the sublattice offset f sh ≈ ∆ AB /2π.
The numerical prediction for the phase boundary ( Fig. 1d) results from a Floquet calculation for a tight-binding model of the hexagonal lattice based on the shaking parameters and the calibrated parameters of the static lattice. It has been shown to agree well with previous measurements of topological properties in the system [24,60,63,65], except for a slight shift of the topological region towards higher frequencies for the experimental data. This shift might be due to the uncertainty in the calibration of the static lattice or to contributions of higher bands, which where neglected in the two-band tight-binding model. Note that the calibration uncertainty of the polarisation of the lattice beams of 0.2 • leads to an uncertainty of the expected phase transition points of around 200 Hz [24].
The experiments are performed with ultracold spin-polarised fermionic atoms of 40 K with mass m = 40u prepared in the lowest band of the optical lattice formed by laser beams with a wavelength of λ = 1064 nm as in the earlier work [24,63]. The characteristic energy scale is the recoil energy E rec = h 2 /(2mλ 2 ). In the transverse direction the cloud is weakly harmonically confined. In order to adiabatically prepare the lowest band of the Floquet system, we gradually ramp up the Floquet drive in two steps (Fig. 1b): 1) we ramp up the shaking amplitude to 1 kHz within 5 ms at the far off-resonant shaking frequency of f ini sh = 4.5 kHz, 2) we ramp the shaking frequency to the final values f fin sh within t ramp = 2 ms at the fixed shaking amplitude. Due to the Floquet heating, this procedure leads to a population of the lowest band of typically 50-75%. Previous work on supervised machine learning has shown that the Chern number of the lowest band can be faithfully obtained despite the non-zero temperature [24].
For the detection of the state, all potentials are switched off leading to a free expansion of the system known as time-of-flight imaging. The expansion maps the original momentum distribution on the real-space density, which is then imaged by absorption imaging. This procedure can be related to the Bloch-state tomography [62,63,65], which is based on the quench dynamics after projection onto the static lattice with large ∆ AB , realizing the special case of zero hold time in the static lattice. This connection motivates the usefulness of the experimental images for detecting topology, although the proper Bloch state tomography explicitly relies on the full quench dynamics for disentangling the parameters [62].
In the experimental protocol, we hold the atoms in the Floquet system for different hold times t hold at the final shaking frequency in steps smaller than the Floquet period, in order to sample different instances of the Floquet micromotion φ. The micromotion phase is then given by φ = f fin sh (t ramp /2 + t hold ) + f ini sh t ramp . This convention traces the micromotion back to the start of the driving with a kick in a fixed direction and allow relating the micromotion phases of data with different shaking frequencies. The micromotion is an intrinsic property of Floquet systems and while it can give rise to new physics [72,73], it is often a nuisance when studying the effective Floquet Hamiltonian [74,75].
For the analysis, we restrict the images to a square region of 56x56 pixels centered around zero momentum, k = 0, where 56 pixel correspond to the length of a reciprocal lattice vector (Fig. 1c). The images are furthermore individually normalized to the interval [0,1]. In total we use 10,436 images with varying shaking phase, shaking frequency and micromotion phase with just a few images per parameter. While supervised learning often requires an additional large training data set at parameters, which allow a labelling, the unsupervised methods discussed below can identify the phase transitions with the data homogeneously sampled across the parameter space alone.

B. Machine learning methods
In the various machine learning applications of this article we use deep neural networks (NNs) composed of combinations of fully-connected and convolutional layers [76]. After each layer, the output is processed by a non-linear activation function, which in this work is mainly the so-called rectified unit function ReLU (x) = (0 if x ≤ 0; x if x ≥ 0) or its variations (e.g. leaky ReLU [77]). The archetypical task of a NN is supervised learning, where the network output y out i is trained to approximate a desired label y i for every input x i of the dataset D = {x, y}. In order to achieve this task, we define a loss function l i = l(y out i , y i ) that captures the success of this endeavor. Training then comes down to minimising the total loss function L = 1/N N i=1 l i with respect to the trainable parameters {ω} of the deep NN. Commonly used loss functions are mean square error and binary cross-entropy. This high-dimensional optimisation problem can be tackled with gradient descent, where the parameters {ω} are iteratively shifted in the direction of the negative gradient, i.e. ω → ω − α∇ ω L, where α is the so-called learning rate and a hyperparameter. We here mainly use more involved gradient-based optimisation strategies like ADAM [78] in order to speed-up the training process.
In this work, we employ a special NN architecture called autoencoder (AE) [79][80][81]. An AE is composed of two subsequent deep NNs, called encoder and decoder. The neurons z at the output of the encoder are called bottleneck neurons and the dimension of this space is typically chosen to be smaller than the input space. The task of an AE is to find an efficient compression of the input data through the encoder at the bottleneck, from which the decoder is able to reproduce the original input x i at the output stage y out i . The loss function l(x i , y out i ) is therefore defined between the input data and the output of the AE and there is no necessity for labelled data. In this work, we found it sufficient to use the mean squared error total loss function, AEs are typically used in the context of unsupervised learning for tasks such as dimensionality reduction or anomaly detection. We will later use the success of this compression by looking at the loss L to differentiate between phases of a phase diagram in section III E. Finally, an autoencoder can also be trained in a supervised way given a series of inputs associated to a corresponding output. Such supervised method has applications in image denoising or colorization [82][83][84].
The key difference with respect to the previous architecture is the introduction of an engineered regularisation at the bottleneck. Instead of encoding the input x into a feature z(x) in the latent space, it is encoded into a probability distribution p(z|x). A feature z is then sampled from p(z|x) to be passed to the decoder. This introduces two main advantages. First, the feature space is regularised such that neuron activations at the bottleneck are more interpretable [87]. Second, this way we can generate new data after training by sampling at the bottleneck. However, we use the variational autoencoder here to gain more stability in the transformation of experimental data. Additionally, one can introduce a question neuron, that is an extra input neuron feeding directly into the bottleneck. The information provided there can be for example a physical parameter corresponding to the image we provide. We will later use VAEs to postprocess the data to a fixed micromotion phase.
The influence function [88,89] is an interpretability method that can be understood as a numerically-feasible approximation of leave-one-out (LOO) training. LOO training consists of retraining the model after removing a single training point and checking how it changed the test loss connected to the prediction on the chosen test point. If the prediction got worse (better), i.e. the test loss got larger (smaller), the removed training point was a helpful (harmful) one. However, such an analysis with LOO trainings is prohibitively expensive because, for the full picture, it requires the number of training procedures equal to the training dataset size multiplied by the number of chosen test points. Instead, the most complicated step in the numerical approximation, i.e. influence functions, consists of a single computation of the Hessian's inverse of the training loss with respect to the model parameters.
An influence function estimates the influence that a single training point has on a prediction made on a single test point. In this paper, we apply it to the convolutional neural network (CNN) trained in a supervised way. Analysis of the most influential points indicates which data features are dominant in the model's predictions. We use this property in section III C. Moreover, we interpret the training points which are similarly influential to a particular prediction as similar. In this way, the influence functions' values, I, provide a notion of similarity found by a trained network in a given problem. Thanks to this property, they allow finding distinctive similarity regions within the same class [39], which indicates improper labelling and proves useful in section III F. Analysis of what training points the model regards as similar and which data features are dominant in the model's predictions increases the model's interpretability. Influence function values, I(x train , x test ), can only be compared for fixed test point and various training points, and within the same model. Therefore, we need to fix a test point whenever we calculate a set of I for the similarity analysis.
All machine learning techniques were implemented using NumPy, PyTorch and Tensorflow [90][91][92]. The specifics of the architectures with reproducible code for all performed tasks can be found in our notebooks [93].

A. Latent-space interpretation of autoencoders
As a first step, we produce and analyse a low-dimensional representation of the data in the latent space of an autoencoder formed by the activations of the bottleneck neurons. Autoencoders are important tools for unsupervised learning [94]. The autoencoder consists of several convolutional layers and a fully connected bottleneck formed by two neurons (Fig. 2a). We choose a 2D latent space to create an easy-to-understand visual representation of the given samples. The complete implementation details can be found in our notebooks [93]. We checked that choosing more dimensions in latent space does not lead to an improvement. The autoencoder is trained on the complete dataset.
The two dimensional latent-space representation of all images yields a dense cloud of data points without an apparent clustering (Fig. 2b). The picture becomes clearer, when we restrict the data to fixed shaking phases, i.e. vertical cuts through the phase diagram ( Fig. 2c-d). The data then lies on elliptical structures with the radius related to the shaking frequency. For further analysis, we fit an ellipse using direct least square fitting [95] and perform a coordinate transformation to extract the elliptical coordinates radius r and azimuthal angle θ measured from the major axis of the fitted ellipse.
The azimuthal angle can be clearly connected to the micromotion phase showing a linear dependence (Fig. 2e) for a shaking phase of ϕ = 90 • . The same dependence can also be seen with the azimuthal coordinate of the center of mass of the raw images, which provides a direct connection between the time of flight images and latent space. See appendix A for further details. We furthermore explore possible information hidden in the radial coordinate (Fig. 2f). The mean radius decreases with shaking frequency with some signs of plateaus, but without a sufficiently clear separation with shaking frequency for making a prediction of phase transitions. The latent space representation can thus be physically interpreted via the micromotion, but it cannot provide an identification of the topological phases. We attribute this to the dominance of the micromotion, which we try to eliminate in the following section The mean radial coordinate for a given shaking frequency traces out a monotonously decreasing curve with no clear signature of the phase transition and indicate three plateaus in accordance with the phase boundaries. The error bar is the standard deviation from averaging over the images with a given shaking frequency. The plots for other shaking frequencies look similar. This association with the micromotion means that latent space can be interpreted, but also that micromotion is the dominant feature hiding possible signatures of the topological phase transitions. III B.

B. Data postprocessing to desired micromotion phase
In Floquet systems, the micromotion poses an additional challenge for identifying phase transitions. We find that the unsupervised machine learning only works if all images have the same micromotion phase, i.e. the center of mass displacement is in the same direction. Because the data was taken with a sampling of different micromotion phases, we need to post process it to the desired sampling of a fixed micromotion phase. We show that this can be accomplished by machine learning techniques based on VAEs, which are a powerful tool for data transformation and generation [85,86]. Our architecture uses an additional question neuron in the bottleneck, which has previously been proven successful in identifying relevant physical properties [36]. Here we use the additional question neuron for supervised training of the VAE. The given challenge is related to tasks such as fringe removal in absorption imaging [96] or removal of timing jitter in pump-probe experiments [97], but we believe that our method based on VAE is very broadly applicable to postprocessing to desired sampling in different experimental scenarios.
The encoder of the VAE consists of several convolutional stages and several layers of fully connected neurons. The last layer of the encoder has 26 fully connected neurons thus the latent space covers 13 probability distributions. The decoder has again several fully connected layers followed by a few transposed convolutional stages. The first fully connected layer is also attached to the input of the question neuron. In total the autoencoder has over 3 million trainable parameters. The complete implementation details can be found in our notebooks [93]. To optimise the hyperparameters of our autoencoder, we use the hyperparameter optimisation library optuna [98] and train over 60,000 different network architectures. To identify the best working network we use the structural similarity index [99] as a measurement for performance. For each point in the phase diagram with a fixed shaking frequency and shaking phase we took several images with different micromotion phases by varying the hold time. As a new dataset, we select all combinations and permutations of images for a fixed shaking frequency and shaking phase and calculated their micromotion phase difference ∆φ = φ output − φ input . Thus the dataset includes 63,050 image pairs with a given ∆φ. Thus in contrast to the other autoencoders we employed in context of this paper, the input and output are different for the VAE. We randomly choose 10% of the dataset for validation purposes and hide them from the network during training. We train with one image as input, which we refer to as input image and one image with the same shaking frequency and shaking phase but a different micromotion phase as output and the micromotion phase difference as input for the question neuron. Samples of original images are given in figure 3a. After training, we use the variational autoencoder to transform all original images in the dataset to a micromotion phase of φ 0 = 0.0 by choosing their micromotion phase with an opposite sign as input for the question neuron. The postprocessed images with a single micromotion phase are similar to the original data except for some noise removal and a squeezing of the distribution of pixel values to a range of 0.3 to 0.7, which we attribute to the non-linear activations in the network (Fig. 3b).
Because the micromotion is directly related to the center of mass of the images, one can see the success of the postprocessing in the example images in Fig. 3b: after micromotion is set to a fixed value, the images with different micromotion phases look very similar and have the same direction of the displacement of the center of mass. This is further confirmed by a comparison of the distributions of the azimuthal center of mass coordinates for all images before and after rephasing (Fig. 3c,d). The highlighted blue data in (Fig. 3c) are the center of masses for the micromotion phase φ = 0.0. The narrow distribution in (Fig. 3d) shows that the rephasing was successful. For the identification of the phase transitions with unsupervised machine learning discussed below, it is important that we use the postprocessed data with a single micromotion phase.

C. Confirming the postprocessing to a desired micromotion phase with influence functions
To get further evidence that the postprocessing described in section III B was successful, we confirm it with influence functions introduced in section II B. Influence functions provide an interpretation of the machine learning model by indicating which training points are influential for a chosen prediction. Analysis of the most influential examples can reveal the characteristics which impacts the machine learning predictions.
Firstly, we train a convolutional neural network (CNN) in a supervised way to classify original images with micromotion phase. Instead of analyzing the whole 2D diagram, we consider only a single cut at the fixed shaking phase −90 • which simplifies the visualization of the results without changing them. Within this single cut, the Chern number of the system changes from 0 to -1 and back to 0 with increasing shaking frequency. The labelled training data contains then only two phases (C=0 and C=-1). To avoid influence from experimental imperfections, we exclude data close to the theoretically-predicted phase transitions. With the trained CNN, we calculate the influence functions determining how influential the whole training dataset is for the prediction chosen to be in the transition region. The results are presented in figure 4b,c with the black cross indicating the test point and dots representing the training data and their color-coded influence function values. Colors vary from red for helpful training points, through green for least influential (ignored), to blue for harmful.
We see in panel b that the most influential (both helpful and harmful) data for the chosen prediction are those with both the most similar shaking frequency and micromotion phase. Learning the shaking frequency is expected as it is the parameter governing the phase transition. However, the CNN also regards the micromotion phase as influential when making a prediction, while we know that this property is physically irrelevant for the transition. Micromotion phase is an intrinsic property of the Floquet realization of the topological Hamiltonian, but does not change the topology of the effective Floquet Hamiltonian.
We do the analogous analysis for postprocessed training data, i.e. with the removed micromotion phase. Panel c in figure 4 shows that the most influential points are now randomly distributed along the original micromotion phase axis. It tells us that the CNN no longer sees this parameter as influential and confirms that the data was successfully postprocessed to a constant micromotion phase.
Let us note that when training on data with or without the micromotion, in both cases the validation and test accuracy of the trained CNN are similar. It means, interestingly, that in this set-up the predictive power of the network is not impacted by learning the quantity which is physically irrelevant.

D. Clustering in latent space
In the following sections we apply different unsupervised machine learning methods to the postprocessed data with a constant micromotion phase and compare their performance. We start with a method of the clustering-category, which identify clusters in some low-dimensional representation of the data as different phases. Specifically, we employ the same convolutional autoencoder as in section III A, but now applied to the postprocessed data.
The latent space representation of the complete dataset is shown in figure 5b. The data is color encoded by the theoretical predictions for the Chern number. It appears that the different topological classes tend to form ring structures in the 2D latent space. As in section III A, we now restrict ourselves to single shaking phase cuts. Figure 5c and d show the latent space for two such cuts and we observe three main clusters related to three frequency regimes.
Choosing k-means clustering, a standard method to solve cluster problems, it is possible to automate the clustering process of the different latent space representations of the observed image data. We use the k-means functionality of Scikit-learn [100] and set the number of clusters to 3 and the number of max iteration to 500. All other parameters are set to standard according to the documentation. We tried different random seeds to prove stability. The results of the cluster analysis are shown in figure 5e and f. This allows one to predict phase boundaries in good agreement with the theoretical predictions given by the dashed lines. The slight shift to higher frequencies is in accordance with all other methods and is explained in section II A. Evaluating the data for all shaking phases allows reconstructing the complete 2D Haldane phase diagram shown in figure 5f.
The procedure of separately analysing vertical cuts through the phase diagram can fundamentally not distinguish between the C=1 and C=-1 phases at positive and negative shaking phases. Furthermore, a similar analysis of horizontal cuts along constant shaking frequencies through the phase diagram does not produce clustering. Therefore further methods are required to fully identify the topological phases.

E. Anomaly detection scheme
We followed the approach in [18] and performed an unsupervised learning scheme called anomaly detection to map out the phase diagram in a few training iterations. We used a convolutional autoencoder, similar to the network described in section III B. The network consisted of an encoder and decoder made of two convolutional layers each, with a fully-connected bottleneck of 50 units. For full details about the model and process, we refer the interested reader to [93], where all steps described in this section can be exactly reproduced. The idea is the following: We started by defining a region of the phase diagram in which we trained the autoencoder to encode and decode the images with low mean-squared error L MSE (A in , A out ) between input and output images A in , A out ∈ R 56×56 . The network learns the characteristic features of the phase that the images were taken from and fails to reproduce images from the other phases. By looking at the loss for all images after training in only part of the phase diagram, we distinguished between the phase it has been trained on and the remaining phases via different plateaus of the loss function. Furthermore, by fitting a sum of tanh functions to the loss curve as a function of shaking frequency, we obtained a phase boundary. We then repeated this process by training in the region of high loss from the previous training round until we found all boundaries.
We show this process in figure 6 where we started by training in the low-frequency regime (Training 1). We  cyan rectangle. We identified three different plateaus in the loss, between which we obtained two boundaries. As seen in panel 1b), where we took a single cut of the phase diagram at −90 • , the different phases make up plateaus. So we fitted a tanh function to estimate the boundaries as indicated by the grey lines. We continued the process and trained in the high-frequency regime, where the first iterations yielded the highest loss. As seen in Training 2 (Fig. 6), we find the reverse picture with a clear boundary slightly above the theoretically predicted transition. This boundary from Training 2 nicely coincides with the second boundary from Training 1. To complete the picture, we also trained in the intermediate-frequency regime that yielded higher loss in the previous training iterations. Here the training region is significantly smaller, yet the results still match nicely with the previous training iterations. Note that generally, the results are robust concerning the size of the training region in frequencies.
We show this in figure 11 in appendix C. The images provide sufficient information to separate the different phases and map out the phase diagram with this method. We present the predicted diagram in figure 7. We notice that with this method, it is not possible to differentiate the non-trivial topological phases with Chern number 1 and −1 because the trained compression does not generalize well in the shaking phase parameter. We provide further details in appendix C. In the following section, we overcome this shortcoming and complete the phase diagram, i.e. separating the two topological regions, employing the influence functions. We see that the transition between intermediate and high frequencies for all three training rounds is slightly above the theoretically predicted transition, which is due to a mismatch between theory and experiment as discussed in section II A.

F. Analysis of data similarity within three phases with influence functions
After obtaining the phase boundaries from the anomaly detection scheme, as described in the previous section, we analyse how similar data are within the three distinguished phases. Such an analysis not only can confirm the predictions of unsupervised machine learning schemes but also reveal the existence of additional phase transitions. To this end, we train a CNN on the postprocessed experimental data, i.e. with the single micromotion phase, with labels assigned by the anomaly detection scheme. Therefore, we have three labels corresponding to the low-frequency, topological, and high-frequency phases. We employ influence functions, described in section II B, to analyse which training data are influential for a given prediction. Similarly influential training data, {x train }, with a similar influence functions' values, I(x train , x test ), for a particular test point, x test , can be then interpreted as similar from the machine learning model's point of view in a given problem.
To analyse similarity of training data, we need to compare I(x train , x test ) and therefore fix the test point for which I(x train , x test ) is calculated. In figure 8, we plot three sets of I(x train , x test ) calculated for all training data and three different test points: one is located in the low-frequency regime (panel a), the second in the intermediate-frequency regime (panel b), and final in the high-frequency regime (panel c). Each element of the phase diagrams in the upper row indicates color-coded I value for a corresponding test point marked with black cross. If an element corresponds to more than one training point (if more measurements were performed for a given frequency and shaking phase), then we plot the mean of I(x train , x test ). Red (blue) color indicates the most helpful (harmful) training points for a given prediction. White dots correspond to the lack of available training data. The lower row of figure 8 contains the mean of I(x train , x test ) for a single cut across the phase diagram for a fixed shaking frequency, f sh . The error bars represent the standard deviation, being non-zero for all shaking phases for which multiple measurements were taken.
In panel a, we see that the low-frequency training data are all quite similarly influential for the model while predicting that the black-cross test point belongs to the same low-frequency phase. The uniformity in question is also well visible in the cut through the phase diagram in the lower panel of figure 8. Apart from single I(x train , x test ) with large variation indicating experimental outliers in the training set, the formed similarity pattern is quite uniform. Notice, however, the symmetric logarithmic scale for I. When ignoring the outliers, I values span almost one order of magnitude. The lowest I values of around 5 · 10 −6 are located in the negative shaking phase, and the largest I values around 3 · 10 −5 are for training points which have positive shaking phase similar to x test . It tells us that the shaking phase is an influential factor in the predictions in the low-frequency phase. However, it is not a determining one. Otherwise, the largest I values would be much more localized in the shaking phase axis. We also note that the I values always highlight the boundaries between phases for two reasons. Firstly, data around the phase transitions are usually the most confusing for the model. They are labelled as belonging to either of the phases, being at the same time non-representative of any phase. The second reason is of purely numerical nature. Regardless if boundaries are placed in accordance to physical ones, the data around the boundaries plays a unique role in the training, containing the most important information for the model. In general, we expect that the confusing phase transition regions, indicated by large I values, in experimental data should be broader as compared to numerical studies [39]. It is due to the fact that the experimental system is finite and inhomogeneous and therefore the phase transition is intrinsically broadened.
Panel c shows even more uniform behaviour. It contains I values for the test point localised in the high-frequency regime. What may seem surprising is that almost all I values are practically zero. It means that none of the training points is of significant influence when making the chosen prediction. The reason is that the prediction on the test point from panel c has an extremely high certainty and it has an impact on the I values. In fact, |I(x train , x test )| values are proportional to the uncertainty of the prediction made on x test . When the prediction's uncertainty is very low, the I values are also very small.
Panel b is analogous to the previous panels, but this time the test point, for which I values are computed, is localized in the intermediate-frequency regime (which we know contains two topological phases). The striking feature of panel b is the lack of uniformity in the intermediate-frequency regime which is well visible in the lower plot of panel b, which contains I values for the single cut through the phase diagram for the fixed frequency of 6.6 kHz. In between two plateaus, i.e. around shaking phase of 0 and 180 • , there are significant dips in the influence functions' values reaching negative values. They show that the training data in this part of the diagram is different enough to be harmful for the analysed prediction. They are misleading for the model as they are labelled the same (as belonging to the topological phase) while they are actually quite different. It is analogous to the reason why influence functions always highlight boundaries between phases. This leads to the conclusion that within the anomaly-detected intermediate-frequency regime there are additional boundaries, therefore more phases. Another observation supporting this conclusion are two similarity plateaus on the negative and positive sides of the shaking phase, separated by the detected boundaries. They are well visible in the lower plot of panel b. The average values of two plateaus differ by almost order of magnitude, indicating two distinct patterns. Simultaneously, these patterns are more similar to each other than to the lowor high-frequency phase, which suggests the similar character of two phases detected in the intermediate-frequency regime.
The similarity analysis described above reveals the existence of two phases within the anomaly-detected topological phase. We note that this analysis is vastly simplified by removing the micromotion phase from the time-of-flight images. Results from section III C show that the micromotion phase was a very influential factor for the trained CNN before the post-processing. Therefore, the analysis would need to include the impact of the micromotion phase on the CNN's predictions.

IV. CONCLUSION
In this article, we have applied different unsupervised machine learning methods for identifying topological phase transitions in experimental data of Haldane-like model realised with ultracold atoms. The topological phase diagram of the elliptically shaken hexagonal lattice hosts topologically non-trivial phases at an intermediate shaking frequency and trivial phases both for low and high shaking frequencies. Furthermore, the sign of the Chern number changes with the sign of the shaking phase, i.e. the orientation of shaking, giving rise to two distinct non-trivial phases.
A necessary step for the successful unsupervised learning was fixing the micromotion phase inherent to the Floquet realization of the topological phases via a variational autoencoder with a question neuron. This postprocessing of the experimental data to the desired sampling demonstrated here is an exciting tool on its own. Both a clustering analysis in an appropriate low-dimensional representation of the data and anomaly detection in the loss function correctly identified the three regions as a function of shaking frequency. The correct identification of the two regions with the opposite sign of the Chern number was only possible by combining this information with the insights from an influence function on the supervised training on the incomplete phase diagram. In total, the full phase diagram, which can also be identified via supervised machine learning on labelled data, could be obtained in a fully unsupervised way by combining the different methods.
The successful identification of the phase diagram demonstrates that unsupervised machine learning can correctly identify phases even for noisy data and despite the finite temperature of the system. In the future, these methods can be applied to exotic quantum many-body systems with unknown phase diagrams or hidden order [26,27] to support the interpretation of the data and to guide the experimental exploration of the parameter space. In figure 9 we show the dependence of the azimuthal coordinate of the center of mass on the micromotion phase. For circular shaking, i.e. a shaking phase of ±90 • , the center of mass moves in a circular fashion yielding a linear dependence between the azimuthal coordinate of the center of mass and the micromotion phase. For linear shaking, i.e. a shaking phase of 0 • and 180 • , the center of mass moves along a diagonal line yielding a constant azimuthal coordinate of the center of mass at ±45 • , with only an irrelevant jump by 180 • for certain micromotion phases. Other shaking phases interpolate between these two behaviors. In conclusion, the movement of the center of mass of the momentum distribution follows the shaking trajectories as expected.

Appendix B: PCA Analysis
In addition to the methods in the main text we also used principal component analysis on the processed data. In figure 10 we selected four components of this analysis. The different principle components clearly show features that correspond well with the theoretical predictions such as sharp local minima at the expected phase transitions. However, the data also shows features that are not related to phase transitions such as strong dependence on the shaking phase in the trivial regions. Therefore the data does not provide a clear recipe for identifying the topological phase transitions in a completely unsupervised way. This is true in particular for the choice of the components to be analysed.
Appendix C: Anomaly detection in phase direction.
We perform a consistency check to confirm the method is well behaved in the frequency parameter. For this, we train with different sizes of the training region in the respective three phases. As seen in fig. 11, the obtained results are consistent and we are confident about the three obtained boundaries presented in the main text. In panel c) there are discrepancies inside the blue-detuned trivial phase where we perform the training. However, the onset of the plateau for the topologically non-trivial phase is consistent for all four training region sizes. We find that the method does not generalize well when performing the same analysis in the phase parameter. We show in fig. 12 1a-c how for smaller boxes in the phase-parameter, the network has problems reproducing images from the same phase but with different and unseen shaking phase parameters. This is the reason we are not able to differentiate between the