Introduction

Today’s scientific computing and experiments often produce petabytes of floating-point data that need to be stored for post-processing or transferred to different computing centers. For example, modern lattice quantum chromodynamics (QCD) simulations targeting accurate precision generate \(O({\text {PB}})\) of data1,2 and store the data on storage systems for long-term analysis. In many applications, only a few significant figures of the stored data are required for the analysis, so lossy data compression algorithms are considered as viable approaches to reducing the data storage requirement and increasing the effective bandwidth for data movement.

Various lossy data compression algorithms recently proposed for floating-point arrays of scientific data include ISABELA3, ZFP4, SZ5,6,7, and NUMARCK8. ISABELA provides in-situ compression based on interpolation using B-splines9 after sorting multidimensional scientific data. ZFP uses block transformation for decorrelation and bit-plane encoding for a fixed-rate lossy compression. SZ is an error-bounded lossy compression algorithm based on fitting and predicting the successive data points. NUMARCK achieves the data compression by approximating the temporal changes using the K-means data clustering algorithm10.

For statistical data, it is possible to detect the correlation pattern of the data components using machine learning techniques and exploit the learned correlation for efficient lossy data compression. An example is the approaches based on the autoencoder11,12,13. Unsupervised machine learning techniques allow us to find efficient codings of the input data. They work as data compression algorithms since the coding is typically a lower-dimensional representation of the original data. The compression performance of the representation learning can be maximized by restricting the codes to be binary variables so that each code can be stored in a bit. However, an encoder with binary codes generally involves binary optimization for finding the optimal codes, which is an NP-hard problem.

Quantum annealing is an approach to solving such optimization problems using adiabatic quantum computation (AQC). In contrast to the gate model quantum computation14,15, AQC performs an adiabatic time evolution from a ground state of a simple initial Hamiltonian to the state of the final Hamiltonian of the target problem16,17,18. When the Hamiltonian is evolved sufficiently slowly, the adiabatic theorem guarantees that the state remains in the ground state of the instantaneous Hamiltonian. Quantum annealing is AQC with a relaxation of the adiabatic condition in an open system at finite temperature19. It solves combinatorial optimization problems using the quantum effect tunneling through barriers between local minima20,21,22,23. Although its ability to demonstrate the quantum speedup over classical computation is still controversial24, quantum annealing has been successfully applied to various NP-hard and NP-complete problems25,26,27,28,29.

The quantum processor of the D-Wave systems realizes the quantum annealing to find low-lying energy states of the target Ising Hamiltonian starting from the transverse field Hamiltonian30. In general, the quantum annealing will require an exponentially large number of samples in order to recover the most optimal solution to large binary optimization problems. However, the annealing time for each sample takes only O(10) microseconds. Furthermore, we show in this work that low-energy solutions, which can be obtained from only a small number of samples, are sufficient for our proposed sparse coding compression algorithm.

In this paper, we propose a new data compression algorithm based on a representation learning. For a maximum compression, we use binary codes, and formulate the problem in a quadratic unconstrained binary optimization (QUBO) form so that it can be solved on Ising solvers such as the D-Wave quantum annealer, as described in “Method” section. As a result, the algorithm guarantees the compression ratio while optimizing for the smallest loss. We also present a bias correction procedure that removes the bias and estimate the errors due to the inexact reconstruction from the lossy-compressed statistical data. In “Numerical experiments” section, the proposed compression algorithm is demonstrated for two different lattice QCD datasets using simulated annealing, and D-Wave’s 2000Q and the Advantage System.

Method

Data compression

The goal of this algorithm is to find a matrix \({\varvec{\phi }}\in {\mathbb {R}}^{D\times N_q}\) and binary coefficients \({\user2{a}}^{(k)}\in \{0,1\}^{N_q}\) that precisely reconstruct the input vectors \({\mathbf {X}}^{(k)}\in {\mathbb {R}}^{D}\) such that \({\mathbf {X}}^{(k)} \approx {\varvec{\phi }} {\user2{a}}^{(k)}\) for all data index \(k=1,2,3,\ldots , N\). The procedure defines a mapping from \({\mathbf {X}}\)-space to \({\user2{a}}\)-space:

$$\begin{aligned} \left\{ {\mathbf {X}}^{(k)}\vert {\mathbf {X}}^{(k)}\in {\mathbb {R}}^{D}, k=1,2,\ldots ,N\right\} \longrightarrow \left( \left\{ {\user2{a}}^{(k)} \vert {\user2{a}}^{(k)}\in \{0,1\}^{N_q}, k=1,2,\ldots ,N\right\} , {\varvec{\phi }}\in {\mathbb {R}}^{D\times N_q}\right) . \end{aligned}$$
(1)

Here the coefficients in the \({\user2{a}}\)-space are restricted to binary variables so that it can be stored in a single bit. Additionally, we restrict \(N_q \ll N\) so that the memory usage of \({\varvec{\phi }}\) is comparatively small to the uncompressed data, which for high-statistics datasets where compression is necessary is of \(N\gtrsim {\mathrm {O}}(10^4)\). As a result, the data in \({\user2{a}}\)-space uses less memory space than those in \({\mathbf {X}}\)-space, and results in data compression.

One possible solution of the mapping (\(\{{\user2{a}}^{(k)}\}\) and \({\varvec{\phi }}\)) can be obtained by minimizing the mean square error of the reconstruction as following:

$$\begin{aligned} \min \limits _{ {\varvec{\phi }} }\sum _{k}\min \limits _{ {\user2{a}}^{(k)} } \left[ \, \sum _{i=1}^D \left( X^{(k)}_i - \big [{\varvec{\phi }} {\user2{a}}^{(k)}\big ]_i \right) ^2 \, \right] . \end{aligned}$$
(2)

When the underlying data exhibits heteroskedasticity, a weight factor of inverse variance, \(1/\sigma _{X_i}^2\), needs to be multiplied to each term of the least-squares loss function to avoid the algorithm focusing on the reconstruction of the large-variance components of the input vector and to make the reconstruction error uniform. The same effect can be achieved by standardizing the input data \({\mathbf {X}}\) in the data preparation. The resulting optimization problem is mapped to a QUBO

$$\begin{aligned} H({\user2{h}}, J, {\user2{s}}) = \sum _i^{N_q} {h_i s_i} + \sum _{i<j}^{N_q} {J_{ij} s_i s_j }, \end{aligned}$$
(3)

through the transformation given below:

$$\begin{aligned} J = 2{\varvec{\phi }}^T{\varvec{\phi }}, \qquad {h}_i = -2\left[ {\varvec{\phi }}^T {\mathbf {X}}\right] _i + \left[ {\varvec{\phi }}^T{\varvec{\phi }}\right] _i, \qquad {\user2{s}} = 2{\user2{a}} -1. \end{aligned}$$
(4)

Note that structure of the transformation is similar to the one used in sparse coding31,32,33, but the compression algorithm does not require the constraints \(\left[ {\varvec{\phi }}^T{\varvec{\phi }}\right] _i=1\), placed in the sparse coding.

After obtaining the solution of \({\user2{a}}^{(k)}\) for a given \({\varvec{\phi }}\), we update \({\varvec{\phi }}\) using stochastic gradient decent on a classical computer. The optimizations for \({\user2{a}}^{(k)}\) and \({\varvec{\phi }}\) are iterated until they reach to a stationary solution. The procedure can be summarized as following.

  1. (1)

    Initialize \(\{{\user2{a}}^{(k)}\}\) and \({\varvec{\phi }}\) with random numbers or initial guesses.

  2. (2)

    Take a random mini-batch of size \(N_b\) from the N samples of \({\mathbf {X}}^{(k)}\).

  3. (3)

    Within the mini-batch, fix \({\varvec{\phi }}\) and find \(\{{\user2{a}}^{(k)}\}\) that minimizes Eq. (2).

  4. (4)

    Within the mini-batch, fix \(\{{\user2{a}}^{(k)}\}\) and update \({\varvec{\phi }}\) towards the optimum solution of Eq. (2) with a learning rate \(\eta\).

  5. (5)

    Repeat (2)–(4) until it reaches the minimum reconstruction error.

Here the mini-batch size \(N_b\) and the learning rate \(\eta\) control the convergence of the algorithm.

Bias correction

In many scientific applications, such as the Monte Carlo simulations, our major concern is the expectation value of a function of the statistical variables \(\langle f({\mathbf {X}})\rangle\). With the samples \({\mathbf {X}}^{(k)}\), the expectation value is usually estimated by a simple average over k. When using the compressed data in \({\user2{a}}\)-space, however, the lossy-compression introduces reconstruction error \({\mathbf {X}}^{(k)} \ne {\varvec{\phi }} {\user2{a}}^{(k)} \equiv \tilde{{\mathbf {X}}}^{(k)}\). As a result, a simple average \(\frac{1}{N}\sum _{k=1}^N f(\tilde{{\mathbf {X}}}^{(k)})\) as an estimator of \(\langle f({\mathbf {X}})\rangle\) is biased.

An unbiased estimator \({\bar{O}}^{\text {BC}}\) can be defined by using a small portion of the original data \({\mathbf {X}}^{(k)}\):

$$\begin{aligned} {\bar{O}}^{\text {BC}} = \frac{1}{N}\sum _{k=1}^N f(\tilde{{\mathbf {X}}}^{(k)}) + \frac{1}{N_{\text {bc}}}\sum _{k=1}^{N_{\text {bc}}} \left( f({\mathbf {X}}^{(k)}) - f(\tilde{{\mathbf {X}}}^{(k)})\right) . \end{aligned}$$
(5)

Here the first term on the right hand side is a sloppy estimator of \(\langle f({\mathbf {X}})\rangle\), and the second term is a bias correction term that makes the estimator satisfy \(\langle {\bar{O}}^{\text {BC}} \rangle = \langle f({\mathbf {X}})\rangle\). Note that in the second term, we use the first \(N_{\text {bc}}\) samples out of total N samples as a bias correction dataset, assuming the data samples are independent and identically distributed. Depending on the data characteristics, however, one could take the maximally separated or randomly chosen \(N_{\text {bc}}\) samples for the bias correction dataset.

In addition to the \(\{{\user2{a}}^{(k)}\}\) and \({\varvec{\phi }}\), for a bias correction in the reconstruction, one needs to store the \(N_{\text {bc}}\) samples of the original data \(\{{\mathbf {X}}^{(k)}\vert k=1,2,\ldots ,N_{\text {bc}}\}\). As explained in “Quality indicator for lossy-compression” section, the statistical error of \({\bar{O}}^{\text {BC}}\) induced by the bias correction term depends on \(N_{\text {bc}}\) and the correlation between \(f({\mathbf {X}}^{(k)})\) and \(f(\tilde{{\mathbf {X}}}^{(k)})\). For a good compression, which yields high correlation between correlation between \(f({\mathbf {X}}^{(k)})\) and \(f(\tilde{{\mathbf {X}}}^{(k)})\), the bias \(f({\mathbf {X}}) - f(\tilde{{\mathbf {X}}}^{(k)})\) can be estimated precisely from a small number of samples, so one can take \(N_{\text {bc}} \ll N\). A similar structure of bias correction has been demonstrated in the machine learning regressions on statistical data34,35.

In the calculation of the statistical error of \({\bar{O}}^{\text {BC}}\), the correlation between the sloppy estimator and the bias correction term should be taken into account. One approach to make the procedure simple is binning the data so that each bin has a certain number of bias correction data samples and the data in different bins are uncorrelated with each other. Assuming that the number of bins \(N_{\text {bin}}\) divides N and \(N_{\text {bc}}\), Eq. (5) can be rewritten as

$$\begin{aligned} {\bar{O}}^{\text {BC}}= \frac{1}{N_{\text {bin}}} \sum _{i=1}^{N_{\text {bin}}} \left[ \frac{1}{M}\sum _{k=iM+1}^{(i+1)M} f({\tilde{{\mathbf {X}}}}^{(k)}) + \frac{1}{M_{\text {bc}}}\sum _{k=iM+1}^{iM+M_{{\text {bc}}}} \left( f({\mathbf {X}}^{(k)}) - f({\tilde{{\mathbf {X}}}}^{(k)})\right) \right] \end{aligned}$$
(6)
$$\begin{aligned}&\equiv \frac{1}{N_{\text {bin}}} \sum _{i=1}^{N_{\text {bin}}} {\bar{O}}^{\text {BC},b}_i, \end{aligned}$$
(7)

where \(M = N/N_{\text {bin}}\) and \(M_{\text {bc}} = N_{\text {bc}}/N_{\text {bin}}\). In this rearrangement, the statistical error of \({\bar{O}}^{\text {BC}}\) can be calculated by \(\sigma _{{\bar{O}}^{\text {BC}}} = \sigma _{{\bar{O}}^{\text {BC},b}}/\sqrt{N_{\text {bin}}}\). Again, note that the first \(M_{\text {bc}}\) samples in each bin are used for the bias correction, but one can take maximally separated or randomly chosen samples for the bias correction dataset, depending on the characteristics of the data.

Quality indicator for lossy-compression

To measure the quality of lossy-compression on statistical data, we define the \(Q^2\) as

$$\begin{aligned} Q^2 \equiv \frac{1}{D} \sum _{i=1}^D \frac{\sigma _{X_i-{\tilde{X}}_i}^2}{\sigma _{X_i}^2}, \end{aligned}$$
(8)

where \(\sigma ^2_{X_i-{\tilde{X}}_i}\) is the variance of \({X_i-{\tilde{X}}_i}\). This parameter is an indicator of the statistical error increase due to the lossy-compression after the bias correction as following. Consider a simple bias-corrected average of independent observables

$$\begin{aligned} \bar{{\mathbf {X}}}^{\text {BC}} = \frac{1}{N}\sum _{k=1}^N \tilde{{\mathbf {X}}}^{(k)} + \frac{1}{N_{\text {bc}}}\sum _{k=1}^{N_{\text {bc}}} \left( {\mathbf {X}}^{(k)} - \tilde{{\mathbf {X}}}^{(k)}\right) . \end{aligned}$$
(9)

The variance of the i-th component of \(\bar{{\mathbf {X}}}\) can be approximated as

$$\begin{aligned} \sigma ^2_{{\bar{X}}_i^{\text {BC}}}&\approx \frac{1}{N} \sigma ^2_{{\tilde{X}}_i} + \frac{1}{N_{\text {bc}}}\sigma ^2_{X_i-{\tilde{X}}_i} \end{aligned}$$
(10)
$$\begin{aligned}&\approx \frac{\sigma ^2_{X_i}}{N} \left( 1 + \frac{N}{N_{\text {bc}}} \frac{\sigma ^2_{X_i-{\tilde{X}}_i}}{\sigma ^2_{X_i}} \right) , \end{aligned}$$
(11)

where the first approximation assumes a small correlation between the two terms in Eq. (9), and the second approximation assumes a good lossy-compression that gives \(\sigma ^2_{{\tilde{X}}_i} \approx \sigma ^2_{X_i}\). Assuming a small reconstruction error satisfying \(\sigma ^2_{X_i-{\tilde{X}}_i} /\sigma ^2_{X_i} \ll N/N_{\text {bc}}\), the expected statistical error increase due to the bias correction can be estimated as

$$\begin{aligned} \frac{\sigma _{{\bar{X}}_i^{\text {BC}}}}{\sigma _{{\bar{X}}_i}} \approx 1 + \alpha \frac{N}{2N_{\text {bc}}} \frac{\sigma ^2_{X_i-{\tilde{X}}_i}}{\sigma ^2_{X_i}}, \end{aligned}$$
(12)

where \(\alpha =1\) and \(\sigma ^2_{{\bar{X}}_i} = \sigma ^2_{X_i}/N\). It shows that the increase of the statistical error compared to that of the original data is proportional to ratio of the number of bias correction data, \(N/N_{\text {bc}}\), and the normalized variance of the reconstruction error, \(\sigma ^2_{X_i-{\tilde{X}}_i} /\sigma ^2_{X_i}\). Hence, we define the quality of the compression by taking an average of \(\sigma ^2_{X_i-{\tilde{X}}_i} /\sigma ^2_{X_i}\) over the all vector elements as given in Eq. (8).

Note that, when data have autocorrelation, the bias correction dataset can be chosen such that they have smaller autocorrelation than the original data by taking a wide separation in the trajectory direction of the autocorrelation. It makes the bias correction more efficient, suppresses the statistical error increase, and yields \(\alpha <1\).

Boosting

In practice, the binary optimization in Eq. (2) is difficult to solve for a large \(N_q\). Although a quantum annealer is employed to solve the optimization problem, the maximum number of fully-connected qubits is limited to \({\mathcal {O}}(100)\) for the current quantum processors. However, the problem can be decomposed into a linear combination of smaller \(N_q\) by applying the idea of Boosting36,37.

Assume that we have a matrix \({\varvec{\phi }}_1\) and vectors \(\{{\user2{a}}_1^{(k)}\}\) of \(N_{q1}\) binary elements that approximately reconstruct the input vectors \({\mathbf {X}}^{(k)} \approx {\varvec{\phi }}_1 {\user2{a}}^{(k)}_1\). We can find another set of solutions of \({\varvec{\phi }}_2\) and \(\{{\user2{a}}_2^{(k)}\}\) of \(N_{q2}\) binary elements reconstructing the reconstruction error of \({\varvec{\phi }}_1 {\user2{a}}^{(k)}_1\) by taking \({\mathbf {X}}^{(k)} - {\varvec{\phi }}_1 {\user2{a}}^{(k)}_1\) as new input vectors. By combining the two sets of solutions, we can build a precise reconstruction of \({\mathbf {X}}^{(k)}\) as

$$\begin{aligned} {\mathbf {X}}^{(k)} \approx {\varvec{\phi }}_1 {\user2{a}}^{(k)}_1 + {\varvec{\phi }}_2 {\user2{a}}^{(k)}_2 = \begin{pmatrix} {\varvec{\phi }}_1 &{}\quad 0 \\ 0 &{}\quad {\varvec{\phi }}_2 \end{pmatrix} \begin{pmatrix} {\user2{a}}^{(k)}_1 \\ {\user2{a}}^{(k)}_2 \end{pmatrix}, \end{aligned}$$
(13)

where the total number of binary coefficients representing an input vector \({\mathbf {X}}^{(k)}\) is \(N_{q1}+N_{q2}\). This procedure can be repeated to an arbitrary number of sets of solutions. For an ideal binary optimizer, decomposing the problem into a smaller number of binary elements makes the solution worse than the full solution because the decomposition ignores the correlation between the different sets, that potentially reduces the reconstruction error. For realistic binary optimizers, however, boosting can provide a better solution.

Numerical experiments

Test data

In this study, we use the Monte Carlo simulation data of lattice QCD, a theory of quarks and gluons, and their interactions. The lattice QCD simulations produce large amounts of data that need to be stored for analysis, but the data are correlated with each other, so a data compression algorithm exploiting the correlation can obtain a better compression ability. Among various lattice QCD observables, in this study, we use the three-point correlation function data of nucleon vector and axial-vector (axial) charges, which describe the response of a nucleon to particles such as the neutrino. For most of the test cases, we shape the data into 10 independent sets of 3200 vectors with 16 components, so each dataset has \(N=3200\) and \(D=16\). As illustrated in Fig. 1, there are strong correlations between the 16 components. Since the vector data show a stronger correlation than the axial-vector data, we expect the proposed algorithm to give a better compression (smaller \(Q^2\)) for the vector data than the axial-vector data. We standardize the data as a pre-processing step to obtain a homogeneous reconstruction error on all 16 components.

Figure 1
figure 1

Correlation pattern of the 16 components of the vector (left) and the axial-vector (right) data. Red indicates the high correlation (correlation coefficient = 1), and white indicates no correlation.

Experiments with simulated annealing

First, we carry out the demonstration of the proposed compression algorithm with simulated annealing38,39 on classical computers by employing the D-Wave’s simulated annealing sampler implemented in the D-Wave’s Ocean library40. In the simulated annealing, we take the minimum energy solution from the 150 runs (num_reads=150), while all other parameters are set to their default values. As described in supplementary section 1, we find that the simulated annealing with num_reads=150 gives close to the exact solution up to around \(N_q=20\), and the quality of the solution deteriorates as \(N_q\) is increased.

To find the solutions \(\{{\user2{a}}^{(k)}\}\) and \({\varvec{\phi }}\) of the optimization problem in Eq. (2), we iterate the optimization in \({\varvec{\phi }}\) and \(\{{\user2{a}}^{(k)}\}\) as described in “Data compression” section. In this study, \({\varvec{\phi }}\) is updated as follows. After obtaining the solution \(\tilde{{\varvec{\phi }}}\) that minimizes the reconstruction error of the mini-batch using the L-BFGS-B algorithm41, we update \({\varvec{\phi }}\) as

$$\begin{aligned} {\varvec{\phi }} \leftarrow {\varvec{\phi }} + \eta (\tilde{{\varvec{\phi }}}-{\varvec{\phi }}). \end{aligned}$$
(14)

Here the learning rate \(\eta\) is continuously decreased from the initial value \(\eta _0\) as the number of training epochs (\(n_{\text {epoch}}\)) is increased, following \(\eta = \eta _0 \times 0.8^{n_{\text {epoch}}}\). For the batch size and initial learning rate, we use \(N_b=50\) and \(\eta _0=0.9\) as we find that those give the best or close to the best results after exploring a grid of \(N_b\) and \(\eta _0\). The final results are obtained with 30 epochs of training steps.

To compare with the compression performance of the proposed algorithm, we study the conventional data compression algorithms using principal component analysis (PCA) and neural-network-based autoencoder. PCA finds orthogonal directions that maximize the variance as principal components. By saving only the coefficients of the first few principal components, the PCA works as a lossy data compression algorithm. We reconstructed the data from the first \(N_z\) principal components to obtain the data compression. Autoencoder also provides data compression by constraining the number of codes (\(N_z\)) to a small number42,43. We used a fully connected neural-network encoder and decoder with three hidden layers of \((D, 128, 64, 32, N_z)\) and \((N_z, 32, 64, 128, D)\) with rectified linear unit (ReLU) activation functions. For the training, we use the Adam optimizer44 implemented in the PyTorch python library45 with the learning rate of 0.01 and the batch size of 3200, which are the optimal hyperparameters determined from a grid search. After 5000 epochs of training, we continue the training until we reach a better reconstruction error than the best reconstruction error we have obtained in the first 5000 epochs and stop the training.

The results are summarized in Table 1 and Fig. 2. The results show that the boosting approach gives better results than the full calculation for \(N_q>32\), where the simulated annealing fails in finding the close-to-ground solution. The comparison between different algorithms shows that the autoencoder outperforms the PCA, and the proposed binary compression outperforms the autoencoder. The compression quality (\(Q^2\)) of the autoencoder with \(N_z\) number of codes can be obtained using the proposed binary compression algorithm with the number of bits around \(N_q\approx 10N_z\). Considering single-precision floating-point numbers, which usually occupying 32 bits for a number, the proposed algorithm provides the same quality of compression as the autoencoder approach using about 3–3.5 times smaller memory space.

Table 1 \(Q^2\), defined in Eq. (8), of the binary compression (BC) algorithm we propose with \(N_q\) qubits (left) and classical approaches of the principal component analysis (PCA) and autoencoder (AE) with \(N_z\) codes (right) for the vector and axial-vector data. The results with \(N_q = N_{q1}+N_{q2}\) shows the compression with the boosting explained in “Boosting” section. Results are averaged over 10 independent sets, and the errors are calculated as the standard deviation of the mean. A smaller \(Q^2\) indicates a better reconstruction.
Figure 2
figure 2

\(Q^2\), defined in Eq. (8) for different number of storing bits for data dimension \(D=16\). For the principal component analysis (PCA) and autoencoder (AE) approaches, the number of storing bits is calculated by \(32\times N_z\), assuming single-precision floating-point numbers. For binary compression algorithm of \(N_q >= 48\), we use the boosting approach with \(N_{q1} = N_{q2} = N_q/2\).

We also carry out a scaling test by observing the data compression performance for different dimensions of the lattice QCD simulation data, \(D=8, 16, 32\), and 48. In this test, the data correlation pattern is kept the same as the one described in Fig. 1: D can be decomposed into four blocks, and the data points within a block have a stronger correlation than the data points between different blocks. Results are presented in Fig. 3. The two upper plots show that the compression algorithm becomes more efficient, which means the smaller reconstruction error for a fixed compression rate, as the data dimension becomes larger. This is because larger dimensional data has a larger number of correlated data components that can be exploited for efficient data compression. The two bottom plots show that the data compression performance of the binary compression algorithm with \(N_q=32\) is similar to that of the autoencoder with \(N_z=3{-}3.5\) for all different data dimensions. These results support that, considering single-precision floating-point numbers for the autoencoder codes, the binary compression is about 3–3.5 times more efficient than the autoencoder. Note that, however, this performance gain should depend on the correlation pattern of the data.

Figure 3
figure 3

Scaling of \(Q^2\) for various dimensions (D) vector (left) and axial-vector (right) data. Upper figures show the results using the binary compression algorithm at fixed compression rates (\(N_q = D\) and \(N_q = 2D\)), and the bottom figures show the comparison between the binary compression and autoencoder algorithms at fixed sizes in the compressed space (\(N_z=1,2,3\) for autoencoder, and \(N_q=32\) for binary compression). For binary compression algorithm of \(N_q >= 48\), we use the boosting approach with \(N_{q1} = N_{q2} = N_q/2\).

Experiments with D-Wave 2000Q

To verify the usability of the existing quantum hardware for the proposed compression algorithm, we carry out the \({\user2{a}}^{(k)}\)-optimization of Eq. (2) using the D-Wave 2000Q quantum processor with the \({\varvec{\phi }}\) obtained using the simulated annealing as described in “Experiments with simulated annealing” section. The major issue with the D-Wave quantum annealer is that the h and J parameters have poor precision when implemented in the D-Wave QPU, even though they are specified as double-precision floating-point numbers in the program, due to the integrated control errors46. Since the parameters are normalized by the maximum absolute value of h and J, small-value elements are more sensitive to the integrated control errors. When there are large scale variations in the h and J parameters, furthermore, the information of the small-value elements could be washed off due to the errors, and the annealing results may be different from the true solution of the original problem.

In the data compression problem, to minimize the effect of the fidelity loss in the final results, we restrict the maximum absolute value of the matrix elements of \({\varvec{\phi }}\) by 1. It prevents a large maximum absolute value of h and J, which introduces a large distortion of the small-value elements. Due to the limited D-Wave access time, we carry out the study only for one set of \(N=200\) samples. For this study, we obtained solutions to this QUBO problem on the D-Wave 2000Q quantum hardware that are drawn from 5000 reads using a series of 20 different chain strengths within the range (2.0, 3.0). Results are taken from the lowest energy points among overall \(5000 \times 20\) solutions of the 2000Q machine. The embedding procedure is repeated only for a new input but was kept unchanged as we changed the chain strength values. In a control run, we find that even if one runs a new embedding each time a new chain strength changes, the final results do not differ from the method described above. However, the later approach that requires a new embedding solution for each chain strength will require more pre-processing time.

Table 2 shows the \(Q^2\) values of the binary compression algorithm on D-Wave 2000Q in comparison with the simulated annealing optimizer. When \(N_q \le 16\), D-Wave shows similar performance as the simulated annealing, but when \(N_q > 16\), D-Wave shows worse performance than the simulated annealing. The reconstruction error, represented by \(Q^2\), is decreased as \(N_q\) is increased on the simulated annealing, but no significant decrease of the \(Q^2\) is observed on the D-Wave for \(N_q > 32\) compared to the results from \(N_q=32\). As expected, constraining \({\text {max}}(|{\varvec{\phi }}_{ij}|)=1\) improves the results on the D-Wave for \(N_q \ge 32\), but the D-Wave results are still worse than the simulated annealing. Note that, due to the limited D-Wave access time, the results were obtained with a fixed \({\varvec{\phi }}\) obtained using the simulated annealing. Hence, the results show a comparison of the optimization performance for a given problem. If \({\varvec{\phi }}\) were obtained directly from the D-Wave quantum annealer, however, optimal constraints to meet the hardware limitations would have been imposed, naturally, and it might have resulted in a better compression performance than those of the \({\text {max}}(|{\varvec{\phi }}_{ij}|)=1\) constraints. Note that the quality of the solutions from the D-Wave quantum annealer is dependent on the nature of the data. In general, we expect better D-Wave quantum annealing results for a dataset that gives smaller scale variations in the h and J parameters through Eq. (4).

Table 2 \(Q^2\) values of the binary compression algorithm on D-Wave 2000Q quantum annealer (D-Wave) and the simulated annealing (Sim.Ann.) with (\({\text {max}}(|{\varvec{\phi }}_{ij}|)=1\)) and without (Free \({\varvec{\phi }}\)) the constraints on the elements of \({\varvec{\phi }}\). Results are obtained from a set of \(N=200\) vector and axial-vector data. Numbers in the parenthesis are the statistical error of the 200 samples estimated by the bootstrap method47.

Comparison of D-Wave 2000Q with advantage systems

We benchmark the D-Wave Advantage system in comparison with the 2000Q using the \({\user2{a}}^{(k)}\)-optimization problem in Eq. (2). For axial and vector data we compute the cumulative distribution function (CDF) of the normalized reconstruction error for systems of size \(N_q=(32,60)\). To minimize possible biases due to a specific choice of embedding, we employ the heuristic solvers provided by Dwave to find an embedding for each configuration and proceed to collect at least 1500 samples (per configuration). The chain strength during embedding was determined by the maximal coupling in absolute value, multiplied by a hyperparameter which we call chain strength multiple. The number of physical qubits, in practice, is many times higher than the logical qubits required, due to hardware connectivity. There are cases where the physical qubits that are strongly coupled to behave as one logical qubit, return different values and we discard these samples, as non viable solutions, from our calculation of the distribution function. (For axial data, \(N_q=32\): 150 qubits Advantage/350 qubits 2000Q, \(N_q=60\): 600 qubits Advantage/1600 qubits 2000Q. For vector data, \(N_q=32\): around 180 qubits Advantage/380 qubits 2000Q, \(N_q=60\): 600 qubits Advantage/1400 qubits 2000Q). As the number of qubits increases the fraction of feasible samples decreases. Also, the fraction of the CDF with small reconstruction error decreases. By trial and error, we find that setting chain strength multiple to a value greater than 1 reduces the number of viable solutions from the Advantage system, but it improves the results from 2000Q. For the samples collected for axial data, we set them to 0.8 and 1.6 respectively.

Figure 4
figure 4

Cumulative distribution function (CDF) of the normalized reconstruction error from all feasible samples obtained from the D-Wave 2000Q (red) and Advantage system (blue) for the axial-vector data. About 50% and 38% of the samples were feasible from D-Wave 2000Q and Advantage for \(N_q=32\), respectively. For \(N_q=60\) there were about 51% and 18%, respectively.

As can be seen from Figs. 4 and  5, when \(N_q=32\), both hardware perform rather well, and the new Advantage system has better statistics and overall higher quality of sub optimal solutions. In the case of \(N_q=60\), the difference between the two hardware becomes less distinct and the CDF is peaked on solutions with high reconstruction error. As we did not apply boosting for these experiments, the quality degradation as the number of qubits increased can be ascribed to the connectivity of the hardware.

Figure 5
figure 5

Cumulative distribution function (CDF) of the normalized reconstruction error from all feasible samples obtained from the D-Wave 2000Q (red) and Advantage system (blue) for the vector data. About 95% and 91% of the samples were feasible from D-Wave 2000Q and Advantage for \(N_q=32\), respectively. For \(N_q=60\) there were about 63% and 73%, respectively.

Discussion

In this paper, we presented a new lossy compression algorithm for statistical data based on the representation learning with binary variables. The algorithm finds a set of basis vectors, which is common for all data, and their binary coefficients (\(N_q\)) that precisely reconstruct each D-dimensional input vector. The algorithm provides data compression because the \(N_q\)-dimensional binary representation requires much smaller storage space than the original data of D-dimensional floating-point numbers. We also presented a bias correction procedure estimating the errors due to the inexact reconstruction of the lossy compression in “Bias correction” section. The compression algorithm was applied to two lattice QCD datasets in “Numerical experiments” section. With simulated annealing, the binary compression algorithm was able to achieve the same quality of reconstruction with 3–3.5 times smaller storage usage than the algorithm using neural-network autoencoder. The binary optimization carried out on D-Wave 2000Q for the compression problems showed promising results, but the performance was limited by the integrated control error of the D-Wave QPU, which introduces large uncertainties in the h and J parameters. The comparison of D-Wave 2000Q and Advantage systems showed that the Advantage is more efficient than the 2000Q in obtaining the low-energy solutions.

The proposed compression algorithm is a natural outlier detector because input data with large reconstruction errors can be marked anomalous48. Using the proposed algorithm, furthermore, many operations that need to be performed on the floating point numbers \({\mathbf {X}}^{(k)}\) can be replaced by those on single-bit coefficients \({\user2{a}}^{(k)}\) with much smaller computational cost, because the relationship between \({\mathbf {X}}^{(k)}\) and \({\user2{a}}^{(k)}\) is linear (\({\mathbf {X}}^{(k)} \approx {\varvec{\phi }} {\user2{a}}^{(k)}\)), and the single-bit coefficients satisfy \(\left( a_j^{(k)}\right) ^n = a_j^{(k)}\) for any n, which simplifies power operations. Here are two examples of the operations in the compressed space:

  • Sum of vectors

    $$\begin{aligned} \sum _{k=1}^N {\mathbf {X}}^{(k)}&\approx \sum _{k=1}^N {\varvec{\phi }} {\user2{a}}^{(k)} = {\varvec{\phi }} \left( \sum _{k=1}^N {\user2{a}}^{(k)}\right) , \end{aligned}$$
    (15)
  • Sum of \(l^2\)-norm squares

    $$\begin{aligned} \sum _{k=1}^N ||{\mathbf {X}}^{(k)}||^2&\approx \sum _{k=1}^N \sum _{i=1}^D\left( \sum _{j=1}^{N_q} \phi _{ij} a_j^{(k)} \right) ^2 \nonumber \\&= \sum _{i=1}^D \left[ \sum _{j=1}^{N_q} \phi _{ij}^2 \left( \sum _{k=1}^N a_j^{(k)} \right) + 2\sum _{l<m} \left( \sum _{k=1}^N a_l^{(k)} a_m^{(k)}\right) \phi _{il} \phi _{im} \right] . \end{aligned}$$
    (16)

The cost reduction is maximized when \(D, N_q \ll N\), which is a typical case of many statistical datasets.

In this study, we presented only the results with the \({\varvec{\phi }}\) calculated from the whole dataset. In general, however, \({\varvec{\phi }}\) obtained from a smaller subset of the whole data provides a reasonably good compression performance. When using a \({\varvec{\phi }}\) obtained from a subset data, some unseen data vectors could yield large reconstruction error. To control the error and maintain the quality of the compression, one needs to define a threshold and save the original data when the data gives a reconstruction error bigger than the threshold.