Introduction

Artificial neural networks are the computational models composed of interconnected nodes, and can ‘learn’ to deal with complicated tasks such as image feature recognition, language translation, medical diagnosis, etc., through ‘training’ the parameters1,2,3,4. The optical neural networks (ONNs) can perform the function of the artificial neural networks by using optical elements. They have drawn much attentions in recent years because of the potential to go beyond their electrical counterparts. The advantages of the ONNs include the low crosstalk, neglectable time delay in propagation, low heat generation, etc.5,6. Especially, they are expected to break the bandwidth limits of the electrical neural networks, achieving ultrahigh computing frequency enabled by THz-wide telecommunications band7. Also, the ONNs can get rid of the troubles caused by the Von Neumann bottleneck, avoiding the restrictions rooted in the energy and time consumption when reading and transmitting data from the memories8. With these benefits, the ONNs are proven to perform the image processing9,10,11,12,13,14,15,16, hand-written digits recognition17,18,19,20,21, and many other tasks22,23,24,25,26. The related techniques are also found to be applicable for logic27 or matrix28,29,30 operations. However, all the ONNs at present cannot exhibit a speedup as powerful as that indicated by the models such as quantum neural networks31,32,33,34,35,36,37,38,39, because they are implemented in a straightforward manner that closely follows the traditional neural networks without incorporating an algorithmic advances. With the proliferation of the data generated by the daily communication, the traditional neural networks must contain millions of parameters in order to capture the feature of the data, so that the training of them for practical use is becoming more and more consumptive. Hence, the same obstacle will also occur in the current development direction of the ONNs. So, how to build a new ONN which could provide an algorithmic speed-up is highly demanded by the social needs and generally challenging.

On the other hand, the recent experimental progress towards the realization of quantum information processors has led to the emerged research field of quantum neural networks31. Excitingly, the advance of the quantum machine learning would fulfill the needs for data processing with an algorithmic speed up. By far, several quantum machine learning algorithms are proposed, including the quantum generative network34, quantum Boltzmann machine35, quantum transfer learning38, etc. Very recently, the quantum convolutional neural network (QCNN) is theoretically constructed37. Such a network has displayed its unique property for identifying the symmetry of quantum states. More importantly, it has been numerically shown that the convergence of a QCNN model is faster than the traditional convolutional neural network (CNN) model in the task of classifying the classical data39, which would greatly contribute to the application in daily life. However, the realization of the QCNN requires a large amount of the 2-qubit gates and the sufficient long coherence time of the multi-qubit system, which is technically hard for the current quantum devices. So, the demonstration of the advantageous quantum machine learning models like QCNN on the quantum platforms has not been given.

Inspired by the theoretical model of the QCNN, in this work, we introduce the optical correlation to the design of ONNs and propose the correlated optical convolutional neural networks (COCNNs). Such a new type of ONNs exhibits the ability to achieve the “quantum speed up” as effective as that shown by the QCNN. Such a speed-up is verified from two aspects. For the first, the lost function of the COCNN has shown a faster convergence behavior in the classification tasks we consider when compared to that of the traditional CNN model. The results coincides with behavior of the QCNN39. For the second, the COCNN can realize the function of the QCNN, such as the recognition of the Haldane phase. This further validates the close relation between the COCNN and the QCNN. Last but not least, we take the 3-qubit phase recognition circuit as an example and perform an experimental realization of its function using the framework of the COCNN. The results fit well with the theoretical calculation. It means that our proposal reveals a new way of realizing the ONNs with the “quantum speedup”, which will benefit the information processing in the coming years.

Results

The structure of a COCNN and its performances

The sketch of a general COCNN is shown in Fig. 1. It is composed of four basic parts: the correlated light source, the convolution, the pooling, and the detections. As the most basic component, we firstly introduce the part of the correlated light source. This setup is used for encoding the information. Different from the previous ONNs9,13,15 which only employ the amplitude of the light for the encoding, we employ the correlation of the multimode polarized beams to encode data. In fact, using such a special type of classical beams, one can obtain the classical analogy of the quantum correlations between qubits. For example, a polarized beam field \({\boldsymbol{E}}{\boldsymbol{=}}\alpha {\boldsymbol{h}}\,{+}\,\beta {\boldsymbol{v}}\), with \({\boldsymbol{h}}\) (\({\boldsymbol{v}}\)) being the horizontal (vertical) polarization vector, is analog to the qubit state \(\alpha {\boldsymbol{|}}0{\rangle }\,{+}\,\beta |1{\rangle }\) under the mapping \({\boldsymbol{h}}{\boldsymbol{\to }}{\boldsymbol{|}}0{\boldsymbol{\rangle }}\) and \({\boldsymbol{v}}{\boldsymbol{\to }}{\boldsymbol{|}}1{\boldsymbol{\rangle }}\). Here, \(\alpha\) and \(\beta\) represent the complex coefficients of the polarization basis. In what follows, we use the notation | ) to denote the classical states analog to the qubit states, e.g., \(\left|{\boldsymbol{E}}\right)=\alpha \left|{\boldsymbol{h}}\right)\,{\boldsymbol{+}}\,\beta \left|{\boldsymbol{v}}\right)\). More generally, by employing \(N\) multi-mode polarized beams, one can obtain a classical state

$$\left|{NE}\right)={c}_{{h}_{1}{h}_{2}\ldots {h}_{N}}\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right){\boldsymbol{\ldots }}\left|{{\boldsymbol{h}}}_{N}\right)\,{\boldsymbol{+}}\,{c}_{{h}_{1}{h}_{2}\ldots {v}_{N}}\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right){\boldsymbol{\ldots }}\left|{{\boldsymbol{v}}}_{N}\right)\,{\boldsymbol{+}}{\boldsymbol{\ldots }}{\boldsymbol{+}}\,{c}_{{v}_{1}{v}_{2}\ldots {v}_{N}}\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right){\boldsymbol{\ldots }}\left|{{\boldsymbol{v}}}_{N}\right)$$
(1)

which is the analogy of a general N-qubit quantum state \(\left|{\psi }_{N}\right\rangle ={q}_{00\ldots 0}\,\left|00\,\ldots 0\right\rangle \,{\boldsymbol{+}}\,{q}_{00\ldots 1}\left|00\ldots 1\right\rangle \,{\boldsymbol{+}}\,{\boldsymbol{\ldots }}\,{\boldsymbol{+}}\,{q}_{11\ldots 1}\left|11\ldots 1\right\rangle\). The N multi-mode polarized beams that give the state of Eq. (1) can be denoted by \({{\boldsymbol{E}}}_{i}(i=1,\ldots ,N)\), each of which can be expressed by \(\mathop{\sum }\nolimits_{k=1}^{M}{f}_{k}{{\boldsymbol{p}}}_{i,k}\). The \({f}_{k}\) of \({{\boldsymbol{E}}}_{i}\) denotes the orthonormal modes satisfying the relation \(\int {f}_{{k}_{1}}\left({\boldsymbol{r}},t\right)\,\ldots\, {f}_{{k}_{M}}\left({\boldsymbol{r}},t\right)d\varOmega ={\delta }_{{k}_{1},\ldots ,{k}_{M}}\), and Ω represents the parameter domain where the condition holds. \({{\boldsymbol{p}}}_{i,k}{\boldsymbol{=}}\,{p}_{i,k}^{{\rm{H}}}{\boldsymbol{h}}\,{\boldsymbol{+}}\,{p}_{i,k}^{{\rm{V}}}{\boldsymbol{v}}\) denotes the polarization of the mode \({f}_{k}\). Such a setup of the beams \({{\boldsymbol{E}}}_{i}\) (\(i=1,\ldots ,N\)) encodes the information to be processed, which is schematically shown in the leftmost part of Fig. 1. Fundamentally, the N multi-mode polarized beams can be used to analogize the N-qubit state, because the correlation of beams defined by the integral \(\int \left({{\boldsymbol{e}}}_{1}\,{\cdot }\,{{\boldsymbol{E}}}_{1}\right)\left({{\boldsymbol{e}}}_{2}\,{\cdot }\,{{\boldsymbol{E}}}_{2}\right)\ldots \left({{\boldsymbol{e}}}_{N}\,{\cdot }\,{{\boldsymbol{E}}}_{N}\right)d\varOmega\) is formally related to the quantum correlated projection \(\left(\left\langle {{\boldsymbol{e}}}_{1}\right|\left\langle {{\boldsymbol{e}}}_{2}\right|\ldots \left\langle {{\boldsymbol{e}}}_{N}\right|\right)\left|{\psi }_{N}\right\rangle\). The unit vector \({{\boldsymbol{e}}}_{i}(i=1,\ldots ,N\)) denotes the direction of the polarization projection of the beam \({{\boldsymbol{E}}}_{i}\), and \(\left|{{\boldsymbol{e}}}_{i}\right\rangle\) is the corresponding projection state of the \(i\) th qubit40. A detailed explanation of the correspondence between beam states and qubit states is given in the first section of the Materials and methods. Based on Eq. (1), a data sample expressed by a \({2}^{N}\) -dimensional complex vector can be encoded by the correlations of above \(N\) beams. In what follows, we show that encoding the data by the correlations enables a better way to process it.

Fig. 1: The general setup of a COCNN. The correlated light source, whose polarized modes are marked by arrows with different colors, is manipulated by the parts of the convolution (blue) and the pooling (brown).
figure 1

The part of the convolution is composed of the 2-beam operations marked by blue, and the details of a 2-beam operation is given in the blue dashed box. As shown inside the box, a 2-beam operation is implemented by eight Q-H-Qs and three NLUs. The part of the pooling is composed of the combiners marked by brown, and the details of a combiner is given in the brown dashed box. As shown inside the box, a combiner is implemented by a BC, an HWP-PBS-HWP, and an NLU. Finally, the detection of the output is performed by a homodyne interferometer and a processor. The interferometer is mainly built by a pair of mirrors (MP) and two beam splitters (BSs)

Secondly, we introduce the part of the convolution, as shown by the layers of the connected blue squares in Fig. 1. This is an essential part for processing the encoded data in our proposal. Traditionally, the convolution of the data is mathematically expressed by the linear transformation of a vector, which is the main function of the convolution blocks in the previous ONNs5,9,18. Here, the part of the convolution is also the linear transformation, but applied to the correlation rather than the amplitude of the beams of light. This leads to a completely different way of design for implementing the transformation. Following Eq. (1), the convolution can be described by a unitary operation \({U}_{c}\) on \(\left|{NE}\right)\). Here, we restrict that \({U}_{c}\) is composed of general 2-beam operations \({U}_{2E}\), as shown by the blue squares in Fig. 1. Given a state of the two correlated beams \(\left|2E\right)\,{=}\,{c}_{{h}_{1}{h}_{2}}\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\,{+}\,{c}_{{h}_{1}{v}_{2}}\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right)\,{+}\,{c}_{{v}_{1}{h}_{2}}\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\,{+}\,{c}_{{v}_{1}{v}_{2}}\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right)\), \({U}_{2E}\) represents the operations that can rotate \(\left|2E\right)\) to an arbitrary state in the space spanned by \({\{}\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right){,}\,\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right){,}\, \left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right){,}\,\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right){\}}\). It is the analogy of the universal 2-qubit gate in the quantum computing theory. The recipe for implementing \({U}_{2E}\) is shown in the blue dashed box in Fig. 140, including two basic arrangements of the optical elements. The first is the Q-H-Q, composed of two quarter-wave plates (QWP) and one half-wave plate (HWP). A Q-H-Q can arbitrarily rotate the polarization state of a single beam, which is equivalent to the rotation of a single qubit. The second is an optical modulation unit (OMU), shifting the phase of the state component \(\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right)\) by a factor of \(\pi\). To realize such an element is actually tricky, and we provide one kind of design in S1 of the Supplementary Information. The function of an OMU is equivalent to the quantum CZ gate. The reason why the setup of the Q-H-Qs and the OMUs in Fig. 1 implements an arbitrary rotation of \(\left|2E\right)\) can be explained in refereeing to the quantum computing theory. In the theory of quantum computing, a universal 2-qubit gate can be decomposed into three CNOT gates and eight single qubit rotations40, and a CNOT gate can be synthesized by a CZ gate and two Hadamard gates41. Therefore, the decomposition of \({U}_{2E}\) into Q-H-Qs and OMUs can be correspondingly given. According to the definition of a \({U}_{2E}\), it can be parameterized by 15 real numbers, which is the same with the parameter number of a 2-qubit gate. So, as shown by Fig. 1, one layer of \({U}_{2E}\) acting on the adjacent beams of an \(N\) -beam array may have \(15\times \left(N-1\right)\) trainable parameters in total. This is larger than the convolution kernel applied in the standard convolutional neural networks (CNNs) framework42, or other equivalent ONN schemes5,9,18, which has at most \(N\) parameters for an N -dimensional input.

Thirdly, we introduce the part of the pooling, as shown by the layers of the brown squares in Fig. 1 with two inputs and one output. This is the key step for reducing the size of the data. Particularly, each brown square is called a combiner whose basic function is to generate simpler correlated states by decreasing the beam number. The function of the combiner is strictly given by

$$\left|2E\right)\left(2E\right|\to \left[\left({{\boldsymbol{h}}}_{1}\right|\otimes I\right]\left|2E\right)\left(2E\right|\left[\left|{{\boldsymbol{h}}}_{1}\right)\otimes I\right]+\left[\left({{\boldsymbol{v}}}_{1}\right|\otimes {I}_{2}\right]\left|2E\right)\left(2E\right|\left[\left|{{\boldsymbol{v}}}_{1}\right)\otimes I\right]$$
(2)

where \(\left(2E\right|\) (\(\left({{\boldsymbol{h}}}_{1}\right|\), \(\left({{\boldsymbol{v}}}_{1}\right|\)) is the Hermitian conjugate of \(\left|2E\right)\) (\(\left({{\boldsymbol{h}}}_{1}\right|\), \(\left({{\boldsymbol{v}}}_{1}\right|\)). The form \(\left|2E\right)\left(2E\right|\) is analog to the density matrix of a quantum state, and we also call it the density matrix of the beam state \(\left|2E\right)\). I is the identity operation on a single beam state. Notice that Eq. (2) is analog to the partial trace in quantum 2-partite system, which physically means looking into the single particle subspace of the system. Therefore, via the combiner defined by Eq. (2), the correlated state of the two beams is embedded into the state of one beam, losing part of the original information. Hence, if the combiner is applied for k times, an array of N beams is reduced to an array of Nk beams. The recipe of implementing Eq. (2) is shown in the brown dashed box in Fig. 1. At first, the upper beam is fed into a birefringence crystal (BC). The thickness of the BC is demanded to break the coherence of the horizontal and vertical polarization components. Then, the beam passes through an element set of HWP-PBS-HWP. Such an element set is equivalent to a polarizer, which can change an arbitrary state \(\alpha {\boldsymbol{h}}\,{+}\,\beta {\boldsymbol{v}}\) to \(\left(\alpha +\beta \right){\boldsymbol{(}}{\boldsymbol{h}}\,{+}\,{\boldsymbol{v}}{\boldsymbol{)}}{\boldsymbol{/}}2\). Then, the beam is modulated onto the lower beam through a nonlinear crystal unit (NLU). The basic function of an NLU is to generate \(\left[\epsilon \left(\alpha +\beta \right)/2\right]{\boldsymbol{h}}\,{+}\,\left[\gamma \left(\alpha +\beta \right)/2\right]{\boldsymbol{v}}\) with the input \(\left(\alpha +\beta \right){\boldsymbol{(}}{\boldsymbol{h}}\,{+}\,{\boldsymbol{v}}{\boldsymbol{)}}{\boldsymbol{/}}2\) and another beam \(\epsilon {\boldsymbol{h}}\,{+}\,\gamma {\boldsymbol{v}}\), where ϵ and γ are also complex coefficients of the polarization basis. The function of an NLU means that the h (or v) component of one mode in the output beam is the product of the h (or v) components of the corresponding modes of the input beams. This is generally the second-order nonlinear crystal in nature. A detailed analysis is given in the S2 of the Supplementary Information. Notice that by using one layer of the combiner, the information of the data encoded in the N -beam correlation is suppressed into those of \(N-k\) beams. It indicates that the dimension of the state space shrinks from \({2}^{N}\) to \({2}^{N-k}\), which is more efficient than the layer for average pooling or max pooling in the traditional CNNs, or other ONN models involving equivalent structures9.

Fourthly, we introduce the part the of the detection. The basic aim of the detection is to measure the correlation defined by ∫(e1E1)(e2E2) … (eNEN) . In our proposal, we consider a two-step procedure to achieve so. In the first step, each beam interferes with a local oscillator (LO) signal by using a homodyne interferometer, so that the projection of the \(i{\mathrm{th}}\) beam \({{\boldsymbol{e}}}_{i}\,{\cdot }\,{{\boldsymbol{E}}}_{i}\) can be obtained by the intensity difference of the outputs of the interferometer. The polarization of LO signal for measuring the i th beam \({{\boldsymbol{E}}}_{i}\) is set to be at the projection direction \({{\boldsymbol{e}}}_{i}\). The LO signal is coherent with all modes of \({{\boldsymbol{E}}}_{i}\), and can be generated by splitting the beam of \({{\boldsymbol{E}}}_{i}\), as shown by the rightmost part of Fig. 1. In the second step, all the projections \({{\boldsymbol{e}}}_{i}{\,{\cdot }}\,{{\boldsymbol{E}}}_{i}{\boldsymbol{(}}i{\boldsymbol{=}}1{\boldsymbol{,}}{{\ldots }}{\boldsymbol{,}}N{\boldsymbol{)}}\) measured by the homodyne detections are multiplied together by a processor. Then, the obtained result is proportional to the correlation defined by the above integral. In fact, the detection setup here is identical to the one employed in ref. 40. The underlying background of the setup is also the correspondence between the beam states defined by Eq. (1) and the qubit states, which is given in the first section of the Materials and methods as mentioned above.

The training of a COCNN is similar to other machine learning algorithms. By properly setting the loss function, one can assess the performance of the COCNN on a given dataset and update the parameters according to it. Here, we also employ the mean square error (MSE) as the loss function. If the target output of the ith data sample is denoted by \({y}_{i}\), and the corresponding output of the COCNN is denoted by \({y}_{i}^{{\prime} }\), the MSE can be given by

$${MSE}=\frac{1}{2D}\mathop{\sum }\limits_{i=1}^{D}{\left({y}_{i}-{y}_{i}^{{\prime} }\right)}^{2}$$
(3)

where D is the total number of the samples.

According to the above description, it can be noticed that the COCNN has a good correspondence with the QCNN proposed by ref. 37. This can be seen by the similarity between the correlation \(\int \left({{\boldsymbol{e}}}_{1}\,{\cdot }\,{{\boldsymbol{E}}}_{1}\right)\left({{\boldsymbol{e}}}_{2}\,{\cdot }\,{{\boldsymbol{E}}}_{2}\right)\ldots \left({{\boldsymbol{e}}}_{N}\,{\cdot }\,{{\boldsymbol{E}}}_{N}\right)d\varOmega\) and the quantum measurement \(\left(\left\langle {{\boldsymbol{e}}}_{1}\right|\left\langle {{\boldsymbol{e}}}_{2}\right|\ldots \left\langle {{\boldsymbol{e}}}_{N}\right|\right)\left|{\psi }_{N}\right\rangle\). Such a relation indicates that the COCNN can exhibit the properties of the QCNN accordingly. We show the relevant evidence from the following two aspects. The first one is about the speed-up in training. We numerically explore the training process of a COCNN model on two datasets, and take the performance of the CNN model on the same tasks as reference. In general, the considered tasks are to give the correct label of the input data sample by training the model parameters. The results are shown in Figs. 2a and b. The results in Fig. 2a are the convergence curves of the loss function for a binary classification task. The dataset we consider is composed of 0-1 vectors whose sizes are 8 × 1. There are 8 types of vectors in the set, each of which is labeled by 0 or 1. The distribution of the vectors is uniform, and each type of the vectors is randomly labeled. In the COCNN scheme, it only requires the three-beam states, \(\left|3E\right){\boldsymbol{=}}{c}_{{h}_{1}{h}_{2}{h}_{3}}\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\left|{{\boldsymbol{h}}}_{3}\right){\boldsymbol{+}}{c}_{{h}_{1}{h}_{2}{v}_{3}}\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\left|{{\boldsymbol{v}}}_{3}\right){\boldsymbol{+}}\,{\ldots }\,{\boldsymbol{+}}\,{c}_{{v}_{1}{v}_{2}{v}_{3}}\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right)\left|{{\boldsymbol{v}}}_{3}\right)\), to encode this kind of data samples according to our setup. The COCNN model for the task is composed of one convolutional layer (C-layer) and one pooling layer (P-layer). The C-layer contains two 2-beam operation \({U}_{2E}{\rm{s}}\), each of which is restricted to has 6 real trainable parameters. The P-layer contains two combiners, leaving only one beam as the output. The predicted label is given by the projection of the output beam, and the projective measurement is parameterized by 3 real numbers (two for the direction and one for the phase). By sequentially updating the 15 parameters based on the derivative of the loss function, the model finally converges. To verify the robustness of the results, we randomly generate the training dataset for 12 times and take the averages as the final results, as shown by the solid dots in the lower place of Fig. 2a. The green curve is fitted by this part of the results. As the reference, we establish a 15-parameter CNN to learn the same series of datasets, and adopt the MSE as the loss function as well. We also utilize the averages of the 12 trails as the results, as shown by the hollow squares in the upper place of Fig. 2a. The blue curve is fitted by this part of the results. Apart from the case of Fig. 2a, the COCNN can also be applied to complicated multiclass classification tasks. In order to show the point, we also consider the task on classifying a dataset containing four classes, shown in Fig. 2b. The dataset we consider is composed of 0-1 vectors whose sizes are 16 × 1. There are one hundred types vectors in the set, each of which is randomly labeled by 0, 1, 2, or 3. The distribution is approximately uniform. Using the COCNN scheme, the four-beam states, \(\left|4E\right)\,{\boldsymbol{=}}\,{c}_{{h}_{1}{h}_{2}{h}_{3}{h}_{4}}\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\left|{{\boldsymbol{h}}}_{3}\right)\left|{{\boldsymbol{h}}}_{4}\right)\,{\boldsymbol{+}}\,{\ldots }\,{\boldsymbol{+}}\,{c}_{{v}_{1}{v}_{2}{v}_{3}{v}_{4}}\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right)\left|{{\boldsymbol{v}}}_{3}\right)\left|{{\boldsymbol{v}}}_{4}\right)\), are employed to encode the data samples. The COCNN model for the task is also composed of one C-layer and one P-layer. The C-layer contains three 2-beam operation \({U}_{2E}{\rm{s}}\), each of which is also restricted to has 6 real trainable parameters as well. The P-layer contains two combiners, leaving two beams as the output. The predicted label is given by the correlated projection of the two output beams, each local measurement of which is parameterized by 3 real numbers as the above case. The total parameter number is 26. We also generate the training data set for 12 times, and take the averages to obtain the solid dots in the lower place of Fig. 2b. The magenta curve is fitted by this part of the results. The reference results shown by the red curve are fitted by the loss function value (marked by hollow squares in Fig. 2b) of a 26-parameter CNN, and also averaged over 12 trails. The loss function is the same with the above. The details are presented in the second section of the Materials and methods. It can be seen clearly from Figs. 2a and b that the COCNN model converges faster than the CNN model. Also, the loss function of the COCNN model eventually converges to a smaller value than that of the CNN model, indicating a better learning accuracy. It worth noticing that the performance revealed by Figs. 2a and b is comparable with the numerical results in ref. 39. Therefore, the speed-up we find here is as effective as that of a QCNN.

Fig. 2: The performance of COCNN.
figure 2

a The comparison between the training of a COCNN model and that of a traditional CNN model on a binary classification task, and (b) that on a multiclass classification task. In (a) and (b), the hollow squares and solid dots are the numerical results. The blue curve in (a) and the red curve in (b) are fitted by the squares obtained by the traditional CNN model, which is composed of a traditional convolutional layer and a traditional pooling layer (the classical C-P layer model). The green curve in (a) and the magenta curve in (b) are fitted by the dots obtained by the COCNN model, which is composed of a convolutional layer and a pooling layer in our scheme (the correlated C-P layer model). The error bar is marked on the corresponding dots. c The simulation of the 12-qubit QCNN ansatz for recognizing the Haldane phase by our COCNN. The background of the plot is colored according to the derivative of the ground state energy density of the Haldane Hamiltonian. The upper zone represents the paramagnetic phase. The middle zone represents the \({Z}_{2}\times {Z}_{2}\) phase, or the Haldane phase. The lower zone represents the antiferromagnetic phase. The red triangles are obtained by the output of the COCNN analog to the QCNN

The second one is about performing the specific function of a QCNN. As pointed out by ref. 37, one function of a QCNN is to identify the phase of a many-body quantum system. We consider the QCNN circuit for recognizing the Haldane phase. As mentioned above, a correlated projection of a quantum state can be mapped to the correlation of the beams. So, the ground state of the Haldane Hamiltonian,

$$H=-J\mathop{\sum }\limits_{i=1}^{N-2}{Z}_{i}{X}_{i+1}{Z}_{i+2}-{h}_{1}\mathop{\sum }\limits_{i=1}^{N}{X}_{i}-{h}_{2}\mathop{\sum }\limits_{i=1}^{N}{X}_{i}{X}_{i+1}$$
(4)

can also be encoded by the correlated beams we consider. \({X}_{i}\) and \({Z}_{i}\) in Eq. (4) are the Pauli-X and Pauli-Z operator on the ith particle. Integer N is the total number of the particles, and J, h1 and h2 are the coefficients of the Hamiltonian43. In Ref. 37, a strict QCNN is presented to justify the phase of the ground states under different J, \({h}_{1}\) and \({h}_{2}\). Because the convolution and pooling of our scheme have a good correspondence with the quantum counterparts in QCNN, our COCNN can also implement the phase recognition algorithm based on the QCNN. An instruction about the detailed setup is given in the second section of the Materials and methods. We numerically simulate the situation when \(N=12\) and the results are shown in Fig. 2c. The phases identified with the ground state energy density is marked by the color of the background as the reference. The red triangles in Fig. 2c denotes the phase boundary information obtained by the second order derivative of the simulated results of the COCNN for phase recognition. Corresponding to the QCNN scheme, the results of the COCNN are the projections of the output beam states on the Pauli-X basis of the polarization. Notably, due to the layers of pooling, the correlation of the output is involved by fewer beams than the input ones encoding the ground state. Hence, such an output contains the information for identifying the phase and is much easier to measure, just like the characters shown by the QCNN. In the particular case here, the phase of the state encoded by the correlation of the 12 beams is eventually recognized by measuring the correlation of the 3 output beams. It also worth noticing that the results in Fig. 2(c) is comparable with the numerical illustration in ref. 37, validating the correspondence between the COCNN and the QCNN. Therefore, our scheme is capable of carrying out data processing similar to that of a quantum network. We also provide a detailed analysis of the connections between COCNNs and QCNNs in S3 of the Supplementary Information.

The experimental realization of a COCNN

In this part, we explore the experimental demonstration of the above scheme. Taking the COCNN analog to the 3-qubit phase recognition QCNN as an example, we show our experimental setup in Fig. 3. The setup also performs the functions explained in Fig. 1, including the part of the correlated light source, convolution, pooling, and measurements. The part of correlated light source in the experimental setup is implemented by adopting the spatial modes as the fk of the beam state. The spatial modes are generated by one 632.8 nm He-Ne laser (ThrolabsHNL210LB). The light output by the laser is polarized by a beam displacer (BD) and then split into four spatial modes by three beam splitters (BS, ThorlabsBS016). For each mode, the polarization state is adjusted by a Q-H-Q. The filters thereafter are used for balancing the intensity of different modes. This is a simplified version of the strategy for generating the correlated light source in the above discussions. The simplification is based on the intrinsic relation between the beam state defined by Eq. (1) and the corresponding qubit states. As mentioned above, the intrinsic relation is strictly characterized by the Eqs. (S6) in S1 of the Supplementary Information, indicating that there could be more than one setup for the beams in order to mimic a given quantum state. Therefore, it can be explored to identify the most experiment-friendly beam setup among the alternatives. In our case, we choose to settle down the polarization components of the two beams of the set, and implement the required operation effectively by manipulating the states of the rest beam. Hence, the analogy of the quantum circuit can be realized by only one beam involving four spatial modes, as shown by the left of Fig. 3. This largely reduces the requirements on the experiments. A strict explanation of the simplification is presented in the third section of the Materials and methods.

Fig. 3: The experimental realization of a simplified COCNN analog to the 3-qubit QCNN circuit for the phase recognition.
figure 3

The pooling and the detection part are combined into one due to the simplification. The optical elements used in the experiment is listed below. The related elements for performing an operation are marked by the area of a certain color

Next, the modes are fed into the part of convolution, corresponding to the blue layers in Fig. 1. The first operation in this part is an operation analog to the two CZ gates on adjacent qubits. After simplification, the operation is implemented by the pair of HWPs in the left yellow region. The one on the second mode shifts the phase of the vertical component by a factor of \(\pi\), and the one on the third mode shifts the phase of the same component by the same magnitude. The second operation is a combination of the two single-beam rotations, which is analog to the two quantum Hadamard gates on the first and the third qubit respectively. In the general proposal, it can be realized by two Q-H-Qs. Here, it is implemented by the four BSs in the gray region. The upper left BS mix the first and the third modes, and the lower left BS mix the second and the fourth modes. The right two BSs mix the outputs of the left ones. The third operation is a three-beam operation, which is the analogy of quantum Toffoli gate with the second qubit serving as the target qubit. According to our general proposal, it can be realized by using a series of 2-beam operations. Through the simplification strategy here, it is implemented by the HWP in the orange region. This HWP exchange the horizontal and vertical component of the fourth mode. The fourth operation is the same with the first operation. Therefore, it is also implemented by a pair of HWPs whose area is marked by yellow. The fifth operation is a single beam rotation, analog to the single Hadamard gate on the second qubit. This operation is applied for changing the basis of the projection in the next part. Here, it is implemented by four HWPs in the red region. Each HWP in the region changes the \({\boldsymbol{h}}\) polarization component to \(\left({\boldsymbol{h}}\,{\boldsymbol{+}}\,{\boldsymbol{v}}\right){\boldsymbol{/}}\sqrt{2}\) and changes the v polarization component to \(\left({\boldsymbol{h}}\,{\boldsymbol{-}}\,{\boldsymbol{v}}\right){\boldsymbol{/}}\sqrt{2}\). Then, the following projection of the beam is going to be casted in the Pauli-X basis.

After the part of convolution, the spatial modes go into the region for pooling and measurements, corresponding to the brown layers and the detections in Fig. 1. Because of the simplification we consider, the pooling operation in the experiment is realized by measuring the sum of the intensity difference of all the modes (see S4 of the Supplementary Information for details). Hence, it is merged into the measurements as shown in Fig. 3. In the region, the polarization of each mode is divided by a polarized beam splitter (PBS), and finally recorded by the CCDs (Thorlabs BC106N-VIS/M). By using such a detection setup, one can obtain the intensity sum of the vertical or horizontal components of the output. Further, the results corresponding to the quantum measurements can be acquired. Suppose the output state of our optical setup is denoted by \(\left|{E}_{{out}}\right)\). Adding up the intensities of all the components gives the result \({\rm{Tr}}\left\{\left({E}_{{out}}\right)\left({E}_{{out}}\right)\right\}\), and subtracting the vertical component sum from the horizontal component sum gives the result \({\rm{Tr}}\left\{Z\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}\), where \({\rm{Tr}}\{\,\}\) means taking the trace. If one arranges an HWP or a QWP oriented at 22.5° before each PBS in the last region, the results \({\rm{Tr}}\left\{X\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}\) or \({\rm{Tr}}\left\{Y\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}\) can also be obtained by the subtraction, where X, Y, and Z are the Pauli operators. These results correspond to the quantum measurements \({\rm{Tr}}\left\{{\rho }_{{out}}\right\}\), \({\rm{Tr}}\left\{Z{\rho }_{{out}}\right\}\), \({\rm{Tr}}\left\{X{\rho }_{{out}}\right\}\) and \({\rm{Tr}}\left\{Y{\rho }_{{out}}\right\}\) respectively, where \({\rho }_{{out}}\) is the output state of a quantum circuit. Similar to the strategy for estimating a single qubit state, the estimation of the single beam state \(\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\) can be given by \(I\cdot {\rm{Tr}}\left\{\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}+X\,\cdot\, {\rm{Tr}}\left\{X\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}+Y\,\cdot\, {\rm{Tr}}\left\{Y\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}+Z\,\cdot\, {\rm{Tr}}\left\{Z\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}\). Notice that function of our experimental setup also has a one-to-one correspondence with the 3-qubit phase recognition QCNN37. The 3-qubit circuit is shown by Fig. S3 in S4 of the Supplementary Information, composed of four CZ gates, three Hadamard gates and one Toffoli gate. The final output of the circuit is given by the measurements on a single qubit state. A detailed instruction of the whole theoretical background is provided in S4 of the Supplementary Information.

To benchmark the performance, we firstly check the output of the above optical setup when the inputs are the analogies of several special quantum states. Particularly, we consider the ten states of the beams, including \(\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\left|{{\boldsymbol{h}}}_{3}\right)\), \(\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\left|{{\boldsymbol{h}}}_{3}\right)\), \(\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right)\left|{{\boldsymbol{h}}}_{3}\right)\), \(\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right)\left|{{\boldsymbol{v}}}_{3}\right)\), \(\left|{{\boldsymbol{p}}}_{1}\right)\left|{{\boldsymbol{m}}}_{2}\right)\left|{{\boldsymbol{p}}}_{3}\right)\), \(\left|{{\boldsymbol{m}}}_{1}\right)\left|{{\boldsymbol{p}}}_{2}\right)\left|{{\boldsymbol{m}}}_{3}\right)\), \(\left|{{\boldsymbol{l}}}_{1}\right)\left|{{\boldsymbol{r}}}_{2}\right)\left|{{\boldsymbol{l}}}_{3}\right)\), \(\left|{{\boldsymbol{r}}}_{1}\right)\left|{{\boldsymbol{l}}}_{2}\right)\left|{{\boldsymbol{r}}}_{3}\right)\), \(\left[\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\left|{{\boldsymbol{h}}}_{3}\right)\,{\boldsymbol{+}}\,\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right)\left|{{\boldsymbol{v}}}_{3}\right)\right]/\sqrt{2}\), and \(\left[\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\left|{{\boldsymbol{v}}}_{3}\right)\,{\boldsymbol{+}}\,\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right)\left|{{\boldsymbol{h}}}_{3}\right)\,{\boldsymbol{+}}\,\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\left|{{\boldsymbol{h}}}_{3}\right)\right]/\sqrt{3}\). \(\left|{\boldsymbol{p}}\right)\) and \(\left|{\boldsymbol{m}}\right)\) denote \(\left[\left|{\boldsymbol{h}}\right)\,{\boldsymbol{+}}\,\left|{\boldsymbol{v}}\right)\right]/\sqrt{2}\) and \(\left[\left|{\boldsymbol{h}}\right)\,{\boldsymbol{-}}\,\left|{\boldsymbol{v}}\right)\right]/\sqrt{2}\) respectively. \(\left|{\boldsymbol{r}}\right)\) and \(\left|{\boldsymbol{l}}\right)\) denote \(\left[\left|{\boldsymbol{h}}\right)\,{\boldsymbol{+}}\,i\left|{\boldsymbol{v}}\right)\right]/\sqrt{2}\) and \(\left[\left|{\boldsymbol{h}}\right)\,{\boldsymbol{-}\,}i\left|{\boldsymbol{v}}\right)\right]/\sqrt{2}\) respectively. The corresponding ten quantum states are \(\left|000\right\rangle\), \(\left|100\right\rangle\), \(\left|110\right\rangle\), \(|111\rangle\), \(\left|+-+\right\rangle\), \(\left|-+-\right\rangle\), \(\left|{\rm{LRL}}\right\rangle\), \(\left|{\rm{RLR}}\right\rangle\), \(\left|{\rm{GHZ}}\right\rangle =\left(\left|000\right\rangle +\left|111\right\rangle \right)/\sqrt{2}\) and \(\left|{\rm{W}}\right\rangle =\left(\left|001\right\rangle +\left|010\right\rangle +\left|100\right\rangle \right)/\sqrt{3}\), where \(\left|\pm \right\rangle\) denotes \(\left(\left|0\right\rangle \pm \left|1\right\rangle \right)/\sqrt{2}\), and \(\left|{\rm{R}}\right\rangle\) (\(\left|{\rm{L}}\right\rangle\)) denotes \(\left(\left|0\right\rangle +i\left|1\right\rangle \right)/\sqrt{2}\) (\(\left(\left|0\right\rangle -i\left|1\right\rangle \right)/\sqrt{2}\)). The theoretical and experimental results of \(\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\) under different inputs are shown from Figs. 4a to j, in the order of the above ten states. The fidelities of the results are 0.997, 0.9998, 0.9995, 0.9996, 0.9985, 0.9840, 0.9986, 0.9989, 0.9853, and 0.9988, respectively. The theoretical calculation of the experimental output for the ten input states is given in S5 of the Supplementary Information.

Fig. 4: The density matrices of the COCNN outputs.
figure 4

The results are obtained by measuring the quantities \({\rm{Tr}}\left\{\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}\), \({\rm{Tr}}\left\{X\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}\), \({\rm{Tr}}\left\{Y\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}\), and \({\rm{Tr}}\left\{Z\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}\). The input beam states of (aj) are analog to the qubit states \(\left|000\right\rangle\), \(\left|100\right\rangle\), \(\left|110\right\rangle\), \(|111\rangle\), \(\left|+-+\right\rangle\), \(\left|-+-\right\rangle\), \(\left|{LRL}\right\rangle\), \(\left|{RLR}\right\rangle\), \(\left|{GHZ}\right\rangle\), and \(\left|W\right\rangle\) respectively. The heights of the colored inertia are the experimental data, and the heights of the black frame of the cuboids are the theoretical data. The theoretical expression of the results can be found in S5 of the Supplementary Information

Next, we present the phase recognition results. The input states here are set to be the analogies of the ground states of the 3-site Haldane Hamiltonian. The basic strategy to encode the ground states into the beams is given in S6 of the Supplementary Information. The ground states are calculated by the diagonalization. Here, \({h}_{1}/J\) is set to be 0.4, 0.8, 1.2 and 1.6, and \({h}_{2}/J\) is taken from -2.0 to 2.0 with an interval of 0.25. According to the numerical simulation of the COCNN scheme, we measure \({\rm{Tr}}\left\{Z\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}\) in this case, which is equivalent to the measurement of the QCNN phase recognition circuit due to the basis transformation enabled by the four HWPs in the red region. The results are marked by the red dots in the left panel of Fig. 5, and the data for obtaining the results is provided in S6 of the Supplementary Materials.

Fig. 5: The results when the input states encode the Haldane ground states.
figure 5

The red dots in the left panel are the experimental data. The X- and Y-axis of the coordinates represent the ratios \({h}_{1}/J\) and \({h}_{2}/J\). The Z-axis represent the expectation \({\rm{Tr}}\left\{Z\left|{E}_{{out}}\right)\left({E}_{{out}}\right|\right\}\) in our basis, equivalent to the measurements in the QCNN phase recognition circuit. The light blue curves are obtained by the SOP. The dark blue curves are the phase boundaries obtained by the second-order derivative of the experimental data. The right panel display the comparison of the phase boundaries obtained by the experimental data and those shown in Fig. 2(c)

For comparison, we plot the standard phase recognition results obtained by string-order parameters (SOP)37, shown by the light blue curves in left panel of Fig. 5. In the 3-site case, the SOP is given by \(\left\langle {ZXZ}\right\rangle\). From the left panel, we can see that the experimental data fits well with the SOP results, validating the effectiveness of the setup. Besides, the phase boundaries can also be obtained by the second order derivative of the experimental data, shown by the dark blue curves in Fig. 5. Compared with the numerical simulation in Fig. 2c, it can be found that the dark blue curves match the boundaries obtained by the numerical results. A direct illustration is shown by the right panel of Fig. 5. It worth noticing that the original SOP is a quantity that requires to measure the three-particle correlations. Using the COCNN here, it is effectively reduced to measuring the character of a single particle. This phenomenon in the COCNN experiment reveals the benefit of applying the QCNN for recognizing Haldane phase proposed by ref. 37.

Additionally, it can be found that the 3-qubit phase recognition QCNN functions as the fundamental building block of the N -qubit phase recognition QCNN due to the repeated structure of the network (as shown in Fig. S5 of the Supplementary Information). This implies that the experimental setup of the COCNN, which performs the same operation as the 3-qubit phase recognition QCNN, can also be viewed as the fundamental building block of the phase recognition COCNN that utilizes N -beam states. Moreover, for implementing other complicated tasks, the parameters of our general COCNN explained in the second section will be trained on the specific datasets of the tasks. The setups for those tasks can also be given by using the similar arrangements as in Fig. 3, according to our general proposal and the first section of the Materials and methods. Meanwhile, the simplification strategy applied here can also be extended to those tasks involving N beams. Thus, in principle, the N -beam COCNNs for the task or others are also implementable.

Discussion

In summary, we have proposed to introduce the correlation of the light fields for establishing a new ONN framework, which is called as the COCNN. Different from the previous ONN, which only adopts the superposition property of light, the COCNN can exhibit similar characters of the quantum neural networks. We numerically show that for the classification tasks we consider, the loss function of a COCNN converges faster than that of a CNN. Moreover, we have also shown the COCNN can be applied to implement the function of the QCNNs, such as the one for the recognition of the Haldane phase. Considering the fact that the COCNN we propose has a one-to-one correspondence with the quantum circuit, the speed-up here could be as effective as the speed-up of a QCNN. Taking the COCNN analog to the 3-qubit phase recognition QCNN as an example, we have explored the experimental demonstration of the COCNN. The function of the setup has been firstly checked by setting the input to be the analogy of ten quantum states, and secondly set to perform the phase recognition for the ground states of the 3-site Haldane Hamiltonian. All the experimental results are in good agreement with the theoretical results of the QCNN, indicating that the function of the QCNN can be realized by using our COCNN scheme in principle.

As mentioned above, the main character of the COCNN strategy is the modulation of the correlated beam states. It is the major cause of the acceleration in the training process. On the basis of the modulation, a convolutional operation can be given, enabling a quite effective capture of the data feature. More importantly, the pooling operation in this manner can reduce the size of the processed data faster than the traditional pooling of the CNN. The two operations of the COCNN are equivalent to those applied in the QCNN. Hence, we think that such a scheme potentially advances the boundaries of optical acceleration. Meanwhile, the results also indicate that the COCNN allows for the realization of the properties of quantum neural network in a more affordable way. Despite the potential advantages of quantum neural networks, implementing them practically requires deep quantum circuits with many multi-qubit gates and complicated measurements. This necessitates significant resources to stabilize the circuits and correct errors, which is technically challenging due to the unavoidable environmental disturbances. A potentially better alternative to implementing advantageous algorithms suggested by quantum computing theory is to find a system described by the same math as quantum theory and interrupted less by the environment. The proposed COCNN serves as an example of such a system, as evidenced by the ease of element arrangements and low requirements on the circumstances in our experiments. In all, given the exponential growth of data and the scarcity of resources for high-quality computation, the COCNN we propose presents a cost-effective and high-performance solution that could have widespread applications in various data science research fields.

Materials and methods

The correspondence between beam states and qubit states

The array of the beams we consider has been introduced in ref. 40. Here we briefly review the basic setup. Consider the 2-beam state \(\left|2E\right)\,{\boldsymbol{=}}\,{c}_{{h}_{1}{h}_{2}}\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\,{\boldsymbol{+}}\,{c}_{{h}_{1}{v}_{2}}\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right)\,{\boldsymbol{+}}\,{c}_{{v}_{1}{h}_{2}}\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\,{\boldsymbol{+}}\,{c}_{{v}_{1}{v}_{2}}\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right)\) as an example. The state can be given by two beams

$${{\boldsymbol{E}}}_{1}{\boldsymbol{=}}\mathop{\sum }\limits_{k=1}^{M}{f}_{k}{{\boldsymbol{p}}}_{1,k},\,{{\boldsymbol{E}}}_{2}\,{\boldsymbol{=}}\,\mathop{\sum }\limits_{k=1}^{M}{f}_{k}{{\boldsymbol{p}}}_{2,k}$$
(5)

As we mentioned in the main text, the LO beam for measuring \({{\boldsymbol{E}}}_{1}\) and \({{\boldsymbol{E}}}_{2}\) are expressed by \({{\boldsymbol{E}}}_{1}^{{\rm{LO}}}\,{\boldsymbol{=}}\,F{{\boldsymbol{e}}}_{1}\) and \({{\boldsymbol{E}}}_{2}^{{\rm{LO}}}\,{\boldsymbol{=}}\,F{{\boldsymbol{e}}}_{2}\). \(F\) represents the mode coherent with all fks, such that \(F\cdot {f}_{k}\propto {f}_{k}\). Using the homodyne detection, one obtains the real part of the projection of E1

$$\mathrm{Re}\left\{{D}_{1}\right\}={\left|{{\boldsymbol{E}}}_{1}+{{\boldsymbol{E}}}_{1}^{{\rm{LO}}}\right|}^{2}-{\left|{{\boldsymbol{E}}}_{1}-{{\boldsymbol{E}}}_{1}^{{\rm{LO}}}\right|}^{2}=2\mathrm{Re}\left\{{{\boldsymbol{E}}}_{1}^{{\rm{LO}}* }\cdot {{\boldsymbol{E}}}_{1}\right\}$$
(6)

as well as that of \({{\boldsymbol{E}}}_{2}\) denoted by \(\mathrm{Re}\left\{{D}_{2}\right\}\). \(\mathrm{Re}\left\{\,\right\}\) means taking the real part. The imaginary part can be obtained by shifting the phase of \({{\boldsymbol{E}}}_{1}^{{\rm{LO}}}\) and \({{\boldsymbol{E}}}_{2}^{{\rm{LO}}}\) by \(\pi /4\). Then, one can obtain the correlation by multiplying \({D}_{i}\) and \({D}_{i+1}\) and integrating the product in the domain where the orthonormal condition of \({f}_{k}\) holds. This is given by \(\int {D}_{1}{D}_{2}d\varOmega \propto \int \left({{\boldsymbol{e}}}_{1}^{* }\,\cdot\, {{\boldsymbol{E}}}_{1}\right)\left({{\boldsymbol{e}}}_{2}^{* }\,{\cdot\, {\boldsymbol{E}}}_{2}\right)\,d\varOmega\). Considering Eq. (5), one has

$$\begin{array}{l}\int \left({{\boldsymbol{e}}}_{1}^{* }\cdot {{\boldsymbol{E}}}_{1}\right)\left({{\boldsymbol{e}}}_{2}^{* }\,{\cdot\, {\boldsymbol{E}}}_{2}\right)\,d\varOmega \,{\boldsymbol{=}}\mathop{\sum }\limits_{k=1}^{M}\left({{\boldsymbol{e}}}_{1}^{* }\cdot {{\boldsymbol{p}}}_{1,k}\right)\left({{\boldsymbol{e}}}_{2}^{* }\cdot {{\boldsymbol{p}}}_{2,k}\right)=\left[{{\boldsymbol{e}}}_{1}^{* }{{\boldsymbol{e}}}_{2}^{* }\right]\cdot \mathop{\sum }\limits_{k=1}^{M}\left[{{\boldsymbol{p}}}_{1,k}{{\boldsymbol{p}}}_{2,k}\right]\\ =\left[{{\boldsymbol{e}}}_{1}^{* }{{\boldsymbol{e}}}_{2}^{* }\right]\left({c}_{{h}_{1}{h}_{2}}\left[{{\boldsymbol{h}}}_{1}{{\boldsymbol{h}}}_{2}\right]+{c}_{{h}_{1}{h}_{2}}\left[{{\boldsymbol{h}}}_{1}{{\boldsymbol{v}}}_{2}\right]+{c}_{{h}_{1}{h}_{2}}\left[{{\boldsymbol{v}}}_{1}{{\boldsymbol{h}}}_{2}\right]+{c}_{{h}_{1}{h}_{2}}\left[{{\boldsymbol{v}}}_{2}{{\boldsymbol{v}}}_{2}\right]\right)\end{array}$$
(7)

where,

$$\begin{array}{l}{c}_{{h}_{1}{h}_{2}}=\mathop{\sum }\limits_{k=1}^{M}{p}_{1,k}^{{\rm{H}}}{p}_{2,k}^{{\rm{H}}},{c}_{{h}_{1}{v}_{2}}=\mathop{\sum }\limits_{k=1}^{M}{p}_{1,k}^{{\rm{H}}}{p}_{2,k}^{{\rm{V}}},{c}_{{v}_{1}{h}_{2}}=\mathop{\sum }\limits_{k=1}^{M}{p}_{1,k}^{{\rm{V}}}{p}_{2,k}^{{\rm{H}}},{c}_{{v}_{1}{v}_{2}}\\\qquad\,=\mathop{\sum }\limits_{k=1}^{M}{p}_{1,k}^{{\rm{V}}}{p}_{2,k}^{{\rm{V}}}\end{array}$$
(8)

In Eq. (7), \(\left[{{\boldsymbol{e}}}_{1}^{* }{{\boldsymbol{e}}}_{2}^{* }\right]\) denotes dyadic vector generated by \({{\boldsymbol{e}}}_{1}^{* }\) and \({{\boldsymbol{e}}}_{2}^{* }\), and those applied for other vectors are similar. By using the compact notation | ) in the main text, one can reform Eq. (7) into \(\left[\left({{\boldsymbol{e}}}_{1}\right|\left({{\boldsymbol{e}}}_{2}\right|\right]\left|2E\right)\), which is of the same form with the projection of the 2-qubit state. More generally, one can obtain the state given by Eq. (1) with N correlated beams, analog to the N-qubit state

$$\mathop{\sum }\limits_{{j}_{1},{j}_{2},\ldots ,{j}_{N}=0}^{1}{q}_{{j}_{1}{j}_{2}\ldots {j}_{N}}\left|{j}_{1}{j}_{2}\ldots {j}_{N}\right\rangle$$
(9)

As mentioned in the main text, each of the beams is expressed by \(\mathop{\sum }\nolimits_{k=1}^{M}{f}_{k}{{\boldsymbol{p}}}_{i,k}\), such that the general expressions of the \({c}_{{h}_{1}{h}_{2}\ldots {h}_{N}},\ldots ,{c}_{{v}_{1}{v}_{2}\ldots {v}_{N}}\) in Eq. (1) are given by,

$$\begin{array}{c}{c}_{{h}_{1}{h}_{2}\ldots {h}_{N}}=\mathop{\sum }\limits_{k=1}^{M}{p}_{1,k}^{{\rm{H}}}{p}_{2,k}^{{\rm{H}}}\ldots {p}_{N,k}^{{\rm{H}}}\\ {c}_{{h}_{1}{h}_{2}\ldots {v}_{N}}=\mathop{\sum }\limits_{k=1}^{M}{p}_{1,k}^{{\rm{H}}}{p}_{2,k}^{{\rm{H}}}\ldots {p}_{N,k}^{{\rm{V}}}\\ \begin{array}{c}\vdots\qquad\qquad\qquad\qquad \vdots \\ {c}_{{v}_{1}{v}_{2}\ldots {v}_{N}}=\mathop{\sum }\limits_{k=1}^{M}{p}_{1,k}^{{\rm{V}}}{p}_{2,k}^{{\rm{V}}}\ldots {p}_{N,k}^{{\rm{V}}}\end{array}\end{array}$$
(10)

In the main text, we point out that \({c}_{{h}_{1}{h}_{2}\ldots {h}_{N}},\ldots ,{c}_{{v}_{1}{v}_{2}\ldots {v}_{N}}\) in Eq. (1) have a one-to-one relationship with \({q}_{00\ldots 0},\ldots ,{q}_{11\ldots 1}\) in Eq. (9). The underlying conditions for the one-to-one relationship are precisely characterized by Eq. (10). In fact, the coefficients \({c}_{{h}_{1}{h}_{2}\ldots {h}_{N}},\ldots ,{c}_{{v}_{1}{v}_{2}\ldots {v}_{N}}\) in Eq. (1) can be viewed as the components of a vector in the Hilbert space40. More generally, any other physical system that can be described by the similar mathematics would be also characterized by the Hilbert space.

The details of the numerical simulation shown by Fig. 2

The specific models of the numerical simulation in Fig. 2 are instructed below. In Fig. 2a, we compare the convergence of the loss function of the two networks in a binary classification task. The task we consider is to learn the labels of the eight-dimensional 0-1 vectors. The labels of the vectors are either “0” or “1”. In our consideration, the labels of the vectors are generated randomly. The results shown in Fig. 2(a) are averaged over the data obtained by 12 times of training. In each training, the vector set is re-labeled randomly.

To accomplish the task, we set the COCNN by adopting one C-layer and one P-layer on three input beams, such as \({\widetilde{{\boldsymbol{E}}}}_{1}\), \({\widetilde{{\boldsymbol{E}}}}_{2}\) and \({\widetilde{{\boldsymbol{E}}}}_{3}\). The state of the beams can be given by \(\left|3E\right)={c}_{{h}_{1}{h}_{2}{h}_{3}}\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\left|{{\boldsymbol{h}}}_{3}\right)+{c}_{{h}_{1}{h}_{2}{v}_{3}}\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right)\left|{{\boldsymbol{v}}}_{3}\right)+\,{\ldots}\,+{c}_{{v}_{1}{v}_{2}{v}_{3}}\left|{{\boldsymbol{v}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right)\left|{{\boldsymbol{v}}}_{3}\right)\). The C-layer here contains two 2-beam operations shown in the dashed blue box of Fig. 1. Because the task is not so complicated, each 2-beam operation only has six trainable parameters. The 2-beam operation can be given in a matrix multiplication form,

$$\left[{U}_{Z}\left({\theta }_{1}\right)\otimes {U}_{Y}\left({\theta }_{2}\right)\right]\cdot {U}_{{\rm{CX}}}\cdot \left[{U}_{Y}\left({\theta }_{3}\right)\otimes {U}_{Z}\left({\theta }_{4}\right)\right]\,\cdot {U}_{{\rm{CX}}}^{{\prime} }\cdot \left[{U}_{Y}\left({\theta }_{5}\right)\otimes I\right]\cdot {U}_{{\rm{CX}}}\cdot \left[{U}_{Z}\left({\theta }_{6}\right)\otimes H\right]$$
(11)

whose basis is \(\left\{\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{h}}}_{2}\right){\boldsymbol{,}}\,\left|{{\boldsymbol{h}}}_{1}\right)\left|{{\boldsymbol{v}}}_{2}\right){\boldsymbol{,}}\,\left|{{\boldsymbol{v}}}_{1}\right)\left({{\boldsymbol{h}}}_{2}\right){\boldsymbol{,}}\,\left({{\boldsymbol{v}}}_{1}\right)\left({{\boldsymbol{v}}}_{2}\right)\right\}\) (or \(\left\{\left|{{\boldsymbol{h}}}_{2}\right)\left|{{\boldsymbol{h}}}_{3}\right){\boldsymbol{,}}\,\left|{{\boldsymbol{h}}}_{2}\right)\left|{{\boldsymbol{v}}}_{3}\right){\boldsymbol{,}}\,\left|{{\boldsymbol{v}}}_{2}\right)\left|{{\boldsymbol{h}}}_{3}\right){\boldsymbol{,}}\,\left|{{\boldsymbol{v}}}_{2}\right)\left|{{\boldsymbol{v}}}_{3}\right)\right\}\)). The notation in Eq. (11) is defined by,

$$\begin{array}{c}{U}_{Z}\left(\theta \right)=\exp (-{iZ}\theta /2),{U}_{Y}\left(\theta \right)=\exp (-{iY}\theta /2)\\ {U}_{{\rm{CX}}}=\left(\begin{array}{cccc}1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1\\ 0 & 0 & 1 & 0\end{array}\right),{U}_{{\rm{CX}}}^{{\prime} }=\left(\begin{array}{cccc}1 & 0 & 0 & 0\\ 0 & 0 & 0 & 1\\ 0 & 0 & 1 & 0\\ 0 & 1 & 0 & 0\end{array}\right)\end{array}$$
(12)

In fact, \({U}_{Y}\left(\theta \right)\) and \({U}_{Z}\left(\theta \right)\) are the Pauli-Y and Pauli-Z rotations in the group theory. \({U}_{{\rm{CX}}}\) and \({U}_{{\rm{CX}}}^{{\prime} }\) are the same with the matrix form of the quantum CNOT gates. The two 2-beam operations act on the states encoded by the first pair of beams (\({\widetilde{{\boldsymbol{E}}}}_{1}\), \({\widetilde{{\boldsymbol{E}}}}_{2}\)) and the second pair of beams (\({\widetilde{{\boldsymbol{E}}}}_{2}\), \({\widetilde{{\boldsymbol{E}}}}_{3}\)) individually, while the parameters of the two operations are independently trained. After the C-layer, a P-layer is applied such that only one beam (such as \({\widetilde{{\boldsymbol{E}}}}_{2})\) of the three is left, which contains the correlation information of the others. The specific formula can be given by applying Eq. (2) twice,

$$\begin{array}{l}\left|3E\right)\left(3E\right|\to \left[\left({{\boldsymbol{h}}}_{1}\right|\otimes I\otimes \left({{\boldsymbol{h}}}_{3}\right|\right]\left|3E\right)\left(3E\right|\left[\left|{{\boldsymbol{h}}}_{1}\right)\otimes I\otimes \left|{{\boldsymbol{h}}}_{3}\right)\right]\\\qquad\qquad\quad+\,\left[\left({{\boldsymbol{h}}}_{1}\right|\otimes I\otimes \left({{\boldsymbol{v}}}_{3}\right|\right]\left|3E\right)\left(3E\right|\left[\left|{{\boldsymbol{h}}}_{1}\right)\otimes I\otimes \left|{{\boldsymbol{v}}}_{3}\right)\right]\\\qquad\qquad\quad+\,\left[\left({{\boldsymbol{v}}}_{1}\right|\otimes I\otimes \left({{\boldsymbol{h}}}_{3}\right|\right]\left|3E\right)\left(3E\right|\left[\left|{{\boldsymbol{v}}}_{1}\right)\otimes I\otimes \left|{{\boldsymbol{h}}}_{3}\right)\right]\\ \qquad\qquad\quad+\,\left[\left({{\boldsymbol{v}}}_{1}\right|\otimes I\otimes \left({{\boldsymbol{v}}}_{3}\right|\right]\left|3E\right)\left(3E\right|\left[\left|{{\boldsymbol{v}}}_{1}\right)\otimes I\otimes \left|{{\boldsymbol{v}}}_{3}\right)\right]\end{array}$$
(13)

Notice that Eq. (13) is an analogy of looking into the second qubit of a 3-qubit system. The final output of the COCNN here is given by projecting the polarization state of beam \({\widetilde{{\boldsymbol{E}}}}_{2}\) onto a direction and taking its modular square. The direction is parameterized by 3 real variables, corresponding to the horizontal and vertical components and their difference in phases respectively. In summary, the COCNN here has 15 trainable parameters. During the training process, the three parameters of the final projection are normalized after being updated in each iteration. In a real experimental setup, the 15 trainable parameters can be tuned by adjusting the fast-axis angles of the waveplates. Particularly, because the two 2-beam operations defined by Eq. (11) are implemented by the setup in the blue dashed box of Fig. 1, the 12 parameters of them can be tuned by the corresponding waveplates in the setup. Also, because the projection is implemented by an interferometer with an LO input whose polarization is at the direction of projection, the three parameters of projection are tuned by the corresponding waveplates for modulating the polarization of the LO input, as illustrated in the detection part of Fig. 1.

In order to perform a fair comparison, the CNN we employ also has one layer for traditional convolution and one layer for traditional pooling, with 15 parameters in total. Because the input is an 8 × 1 vector, the convolution kernel is set to be a 3 × 1 vector, the elements of which are trainable. We apply three independent kernels in the layer for convolution, so the parameter number of the layer is 9. After the convolution, one input vector is transformed into three 6 × 1 vectors. Then, the data is fed into the layer for pooling. We apply max pooling here, and the strip is set to be 3. Then, three 4 × 1 vectors are obtained. The final output is given by the weighted sum of the three vectors modulated by the sigmoid function. In specific, the elements of each vector are firstly summed up so that three values in total are obtained. Then, each of the three numbers are multiplied by one parameter and added by one bias respectively, so three new values are obtained. In the end, the three new values are substituted into the sigmoid function, and the average of the sigmoid function outputs is used as the final output of the network.

In Fig. 2b, we compare the convergence of the loss function of the two networks in a more complicated task, the multiclass classification task. The task we consider is to learn the labels of the sixteen-dimensional 0-1 vectors. The labels of the vectors are “0”, “1”, “2” or “3”. In our consideration, the labels of the vectors are generated randomly. The results shown in Fig. 2b are averaged over the data obtained by 12 times of training trials. In each trail, the vector set is also re-labeled randomly. The fundamental setups of the COCNN and the reference CNN are similar to those applied for the cases of Fig. 2a. The COCNN in this case also adopts one C-layer and one P-layer, acting on four input beams. The C-layer contains three 2-beam operations defined by Eq. (11). They operate on the 1st-2nd, the 2nd-3rd, and the 3rd-4th beams respectively, and all the parameters are trained independently. The P-layer contains two combiners, acting on the 1st-2nd and the 3rd-4th beams. Therefore, only two beams are left. The correlated measurement of the two beams is parameterized by 6 real variables in total. Each local projection is parameterized by 3 real variables as the above case. Adding them all, the whole COCNN here has 24 trainable parameters. During the training process, the three parameters of the final projection are also normalized after being updated in each iteration. The CNN here also has one layer for traditional convolution and one layer for traditional pooling, with 24 parameters in total. With the 16 × 1 input, the convolution kernel is set to be 5 × 1 and 6 × 1 vectors, the elements of which are trainable. We apply two 5 × 1 kernels and one 6 × 1 kernel, which are independently trained. After the convolution, one input vector is transformed into two 12 × 1 vectors and one 11 × 1 vector. Then, the data is fed into the layer for pooling. We apply average pooling here, and the strip is set to be 4. Then, two 9 × 1 vectors and one 8 × 1 vector are obtained. The final output is given by the weighted sum of the three vectors modulated by the sigmoid function. In specific, the elements of the three vectors are combined to a 26 × 1 vector, and then divided into four sets. Three of the sets have six elements, one has eight elements. Then, sum up the values of each set, and multiply the outcomes by four parameters with four biases being added respectively. Hence, four new values are obtained. In the end, the four new values are substituted into the sigmoid function, and the average of the sigmoid function outputs is used as the final output of the network.

In summary of the setup for Fig. 2a, b, the COCNN model for the task contains one C-layer and one P-layer, with 15 and 24 parameters respectively. As the reference, the CNN model for the same task contains one layer for convolution and one layer for pooling, with the same number of parameters correspondingly as well. Therefore, a faster convergence of the loss function of the COCNN than that of the CNN is observed. As mentioned in the main text, it is comparable with the results in ref. 39.

In Fig. 2c, we further show the connection of the COCNN with the QCNN by simulating the phase recognition circuit in ref. 37. The task in this case is to identify the phase of the input states, which are the ground states of the different Haldane models. The Hamiltonian of the Haldane model is given by Eq. (4). We numerically simulate the Hamiltonian when the number of sites is 12, which is sufficient to show the phase boundary according to our results in Fig. 2(c). Based on Eqs. (1) and (10), the ground states can be encoded by the 12 correlated beams discussed above. Then, the beams are modulated by the COCNN analogy to the QCNN shown in the Fig. 2b of ref. 37. In the main text, we mentioned that the C-layer is composed of a series of 2-beam operations. Each operation can generate any type of correlated states of the two beams, so it can realize the analogy of all the quantum gates in the QCNN. The essential gates of the QCNN in Fig. 2b of ref. 37 are CZ gates, Toffoli gates, SWAP gates, and measurement-based phase-flip gates. The corresponding operations of our COCNN scheme are thoroughly discussed in S3 of the Supplementary Information which specifies the connections of two networks. After being modulated by the composite of one C-layer, one P-layer, and an additional fully connected layer, only four beams are left and measured on their X-basis. If the four output beams are denoted by \({\widetilde{{\boldsymbol{E}}}}_{4}\), \({\widetilde{{\boldsymbol{E}}}}_{5}\), \({\widetilde{{\boldsymbol{E}}}}_{6}\), and \({\widetilde{{\boldsymbol{E}}}}_{7}\) and the state of them is denoted by \(\left|4\widetilde{E}\right)\left(4\widetilde{E}\right|\), the result of the measurement on the X basis can be expressed by \({\rm{Tr}}\left\{{X}_{1}{\otimes }{X}_{2}{\otimes }{X}_{3}{\otimes }{X}_{4}\left(4\widetilde{E}\right)\left(4\widetilde{E}\right|\right\}\). The correlation form of the measurement is given by

$$\mathop{\sum }\limits_{{{\boldsymbol{e}}}_{4},{{\boldsymbol{e}}}_{5},{{\boldsymbol{e}}}_{6},{{\boldsymbol{e}}}_{7}={{\boldsymbol{p}}}^{-}}^{{{\boldsymbol{p}}}^{+}}{\left(-1\right)}^{C\left({{\boldsymbol{e}}}_{4},{{\boldsymbol{e}}}_{5},{{\boldsymbol{e}}}_{6},{{\boldsymbol{e}}}_{7}\right)}{\left|\int \left({{\boldsymbol{e}}}_{4}\,{\cdot }\,{\widetilde{{\boldsymbol{E}}}}_{4}\right)\left({{\boldsymbol{e}}}_{5}\,{\cdot }\,{\widetilde{{\boldsymbol{E}}}}_{5}\right)\left({{\boldsymbol{e}}}_{6}\,{\cdot }\,{\widetilde{{\boldsymbol{E}}}}_{6}\right)\left({{\boldsymbol{e}}}_{7}\,{\cdot }\,{\widetilde{{\boldsymbol{E}}}}_{7}\right)d\varOmega \right|}^{2}$$
(14)

where \({{\boldsymbol{e}}}_{a}\) (\(a=4,\ldots ,7\)) here are restricted to be either \({{\boldsymbol{p}}}^{-}=\left({\boldsymbol{h}}-{\boldsymbol{v}}\right){\boldsymbol{/}}\sqrt{2}\) or \({{\boldsymbol{p}}}^{+}{\boldsymbol{=}}\left({\boldsymbol{h}}+{\boldsymbol{v}}\right){\boldsymbol{/}}\sqrt{2}\), and \(C\left({{\boldsymbol{e}}}_{4},{{\boldsymbol{e}}}_{5},{{\boldsymbol{e}}}_{6},{{\boldsymbol{e}}}_{7}\right)\) returns the number of \({{\boldsymbol{p}}}^{{\boldsymbol{-}}}\) among the eas. By changing the coefficient ratios h1/J and h2/J, one can calculate the different ground states and encode them by the 12 beams as mentioned above. Then, using Eq. (14) to do the measurements, a two-dimensional surface can be obtained by plotting the results. The red dots in Fig. 2c are plotted by marking the turning point of the surface in accordance with its second order derivative. The colors of background are determined by the ground-state energy density, whose boundaries are identified by the derivative of the energy density function. Notice that identifying the phase of the ground state usually requires to measure the string order parameter of all the particles. The motivation for proposing the QCNN is to decrease the number of the particles needed to measure. In our simulated example, the ground state of the 12-site Haldane Hamiltonian is mapped to the correlation of the 12 beams, and a corresponding phase graph can be obtained by measuring the four output beams. Therefore, the results can be viewed as an illustration of the QCNN spirit, showing a good match with the results in ref. 37.

The strict explanation of the simplification strategy used in our experiments

We perform the experiment to show a feasible COCNN analog to the 3-qubit QCNN. In general, an arbitrary three-qubit state can be denoted by

$$\begin{array}{c}\left|{\psi }_{3}\right\rangle ={q}_{000}\left|000\right\rangle +{q}_{001}\left|001\right\rangle +{q}_{010}\left|010\right\rangle +{q}_{011}\left|011\right\rangle \\ \qquad\qquad+\,{q}_{100}\left|100\right\rangle +{q}_{101}\left|101\right\rangle +{q}_{110}\left|110\right\rangle +{q}_{111}\left|111\right\rangle \end{array}$$
(15)

Using the strategy in the first section of Materials and methods and the main text, we adopt three beams to encode Eq. (15). The number of modes is required to support the solvability of Eq. (10). For example, if the mode number \(M=2\), one has

$$\begin{array}{c}{\widetilde{{\boldsymbol{E}}}}_{r}=\left({p}_{r,1}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{r,1}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{1}+\left({p}_{r,2}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{r,2}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{2}\\ {\widetilde{{\boldsymbol{E}}}}_{s}=\left({p}_{s,1}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{s,1}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{1}+\left({p}_{s,2}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{s,2}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{2}\\ {\widetilde{{\boldsymbol{E}}}}_{t}=\left({p}_{t,1}^{{\rm{H}}}{\boldsymbol{h}}+{q}_{t,1}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{1}+\left({q}_{t,2}^{{\rm{H}}}{\boldsymbol{h}}+{q}_{t,2}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{2}\end{array}$$
(16)

Notice that, there are 12 unknown variables (\({p}^{H}\) and \({p}^{V}\)) in Eq. (16). According to Eq. (10), the conditions of the 12 variables for mimicking state (15) can be characterized by 8 equations. So, the equation set has a solution in principle. However, if more modes are introduced, the experimental realization of the modulations can be further simplified. Consider the case when \(M=4\), one has

$$\begin{array}{c}{{\boldsymbol{E}}}_{r}=\left({p}_{r,1}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{r,1}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{1}+\left({p}_{r,2}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{r,2}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{2}+\left({p}_{r,3}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{r,3}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{3}+\left({p}_{r,4}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{r,4}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{4}\\ {{\boldsymbol{E}}}_{s}=\left({p}_{s,1}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{s,1}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{1}+\left({p}_{s,2}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{s,2}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{2}+\left({p}_{s,3}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{s,3}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{3}+\left({p}_{s,4}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{s,4}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{4}\\ {{\boldsymbol{E}}}_{t}=\left({p}_{t,1}^{{\rm{H}}}{\boldsymbol{h}}+{q}_{t,1}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{1}+\left({q}_{t,2}^{{\rm{H}}}{\boldsymbol{h}}+{q}_{t,2}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{2}+\left({p}_{t,3}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{r,3}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{3}+\left({p}_{t,4}^{{\rm{H}}}{\boldsymbol{h}}+{p}_{t,4}^{{\rm{V}}}{\boldsymbol{v}}\right){f}_{4}\end{array}$$
(17)

The equation set can be expressed in a matrix form,

$$\left(\begin{array}{cccccccc}{p}_{r,1}^{{\rm{H}}}{p}_{t,1}^{{\rm{H}}} & {p}_{r,2}^{{\rm{H}}}{p}_{t,2}^{{\rm{H}}} & {p}_{r,3}^{{\rm{H}}}{p}_{t,3}^{{\rm{H}}} & {p}_{r,4}^{{\rm{H}}}{p}_{t,4}^{{\rm{H}}} & 0 & 0 & 0 & 0\\ {p}_{r,1}^{{\rm{H}}}{p}_{t,1}^{{\rm{V}}} & {p}_{r,2}^{{\rm{H}}}{p}_{t,2}^{{\rm{V}}} & {p}_{r,3}^{{\rm{H}}}{p}_{t,3}^{{\rm{V}}} & {p}_{r,4}^{{\rm{H}}}{p}_{t,4}^{{\rm{V}}} & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & {p}_{r,1}^{{\rm{H}}}{p}_{t,1}^{{\rm{H}}} & {p}_{r,2}^{{\rm{H}}}{p}_{t,2}^{{\rm{H}}} & {p}_{r,3}^{{\rm{H}}}{p}_{t,3}^{{\rm{V}}} & {p}_{r,4}^{{\rm{H}}}{p}_{t,4}^{{\rm{V}}}\\ 0 & 0 & 0 & 0 & {p}_{r,1}^{{\rm{H}}}{p}_{t,1}^{{\rm{V}}} & {p}_{r,2}^{{\rm{H}}}{p}_{t,2}^{{\rm{V}}} & {p}_{r,3}^{{\rm{H}}}{p}_{t,3}^{{\rm{V}}} & {p}_{r,4}^{{\rm{H}}}{p}_{t,4}^{{\rm{V}}}\\ {p}_{r,1}^{{\rm{V}}}{p}_{t,1}^{{\rm{H}}} & {p}_{r,2}^{{\rm{V}}}{p}_{t,2}^{{\rm{H}}} & {p}_{r,3}^{{\rm{V}}}{p}_{t,3}^{{\rm{H}}} & {p}_{r,4}^{{\rm{V}}}{p}_{t,4}^{{\rm{H}}} & 0 & 0 & 0 & 0\\ {p}_{r,1}^{{\rm{V}}}{p}_{t,1}^{{\rm{V}}} & {p}_{r,2}^{{\rm{V}}}{p}_{t,2}^{{\rm{V}}} & {p}_{r,3}^{{\rm{V}}}{p}_{t,3}^{{\rm{V}}} & {p}_{r,4}^{{\rm{V}}}{p}_{t,4}^{{\rm{V}}} & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & {p}_{r,1}^{{\rm{V}}}{p}_{t,1}^{{\rm{H}}} & {p}_{r,2}^{{\rm{V}}}{p}_{t,2}^{{\rm{H}}} & {p}_{r,3}^{{\rm{V}}}{p}_{t,3}^{{\rm{H}}} & {p}_{r,4}^{{\rm{V}}}{p}_{t,4}^{{\rm{H}}}\\ 0 & 0 & 0 & 0 & {p}_{r,1}^{{\rm{V}}}{p}_{t,1}^{{\rm{V}}} & {p}_{r,2}^{{\rm{V}}}{p}_{t,2}^{{\rm{V}}} & {p}_{r,3}^{{\rm{V}}}{p}_{t,3}^{{\rm{V}}} & {p}_{r,4}^{{\rm{V}}}{p}_{t,4}^{{\rm{V}}}\end{array}\right)\left(\begin{array}{c}{p}_{s,1}^{{\rm{H}}}\\ {p}_{s,2}^{{\rm{H}}}\\ {p}_{s,3}^{{\rm{H}}}\\ {p}_{s,4}^{{\rm{H}}}\\ {p}_{s,1}^{{\rm{V}}}\\ {p}_{s,2}^{{\rm{V}}}\\ {p}_{s,3}^{{\rm{V}}}\\ {p}_{s,4}^{{\rm{V}}}\end{array}\right)=\left(\begin{array}{c}{q}_{000}\\ {q}_{001}\\ {q}_{010}\\ {q}_{011}\\ {q}_{100}\\ {q}_{101}\\ {q}_{110}\\ {q}_{111}\end{array}\right)$$
(18)

Because the equation set is under determined, we can fix several unknown variables while keeping the solvability of the equation set. In our experiment, we consider to set \({p}_{r}{\rm{s}}\) and \({p}_{t}{\rm{s}}\) as follows,

$$\begin{array}{c}{p}_{r,1}^{{\rm{H}}}=1,{p}_{r,1}^{{\rm{V}}}=0,{p}_{r,2}^{{\rm{H}}}=1,{p}_{r,2}^{{\rm{V}}}=0,{p}_{r,3}^{{\rm{H}}}=0,{p}_{r,3}^{{\rm{V}}}=1,{p}_{r,4}^{{\rm{H}}}=0,{p}_{r,4}^{{\rm{V}}}=1\\ {p}_{t,1}^{{\rm{H}}}=1,{p}_{t,1}^{{\rm{V}}}=0,{p}_{t,2}^{{\rm{H}}}=0,{p}_{t,2}^{{\rm{V}}}=1,{p}_{t,3}^{{\rm{H}}}=1,{p}_{t,3}^{{\rm{V}}}=0,{p}_{t,4}^{{\rm{H}}}=0,{p}_{t,4}^{{\rm{V}}}=1\end{array}$$
(19)

Then, the coefficient matrix in reference of \({p}_{s}{\rm{s}}\) changes to

$${M}_{s}=\left(\begin{array}{cccccccc}1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1\end{array}\right)$$
(20)

Notice that \({M}_{s}\) is a symmetric and sparse matrix. Given a unitary operator \({U}_{3}\) on \(\left|{\psi }_{3}\right\rangle\), the matrix form of \({U}_{3}\) is denoted by \({M}_{U}\). Hence, the corresponding modulation on the beams that generates the analogy of \({U}_{3}\left|{\psi }_{3}\right\rangle\) can be deduced by following relations,

$$\begin{array}{c}{\rm{qubit\; state}}:\left|{\psi }_{3}\right\rangle =\left(\begin{array}{c}{q}_{000}\\ {q}_{001}\\ {q}_{010}\\ {q}_{011}\\ {q}_{100}\\ {q}_{101}\\ {q}_{110}\\ {q}_{111}\end{array}\right)\mathop{\to }\limits^{{U}_{3}}{U}_{3}\left|{\psi }_{3}\right\rangle {\boldsymbol{=}}{M}_{U}\left(\begin{array}{c}{q}_{000}\\ {q}_{001}\\ {q}_{010}\\ {q}_{011}\\ {q}_{100}\\ {q}_{101}\\ {q}_{110}\\ {q}_{111}\end{array}\right)=\left(\begin{array}{c}{q{\prime} }_{000}\\ {q{\prime} }_{001}\\ {q{\prime} }_{010}\\ {q{\prime} }_{011}\\ {q{\prime} }_{100}\\ {q{\prime} }_{101}\\ {q{\prime} }_{110}\\ {q{\prime} }_{111}\end{array}\right)\\ {\rm{beam\; state}}:{M}_{s}\left(\begin{array}{c}{p}_{s,1}^{{\rm{H}}}\\ {p}_{s,2}^{{\rm{H}}}\\ {p}_{s,3}^{{\rm{H}}}\\ {p}_{s,4}^{{\rm{H}}}\\ {p}_{s,1}^{{\rm{V}}}\\ {p}_{s,2}^{{\rm{V}}}\\ {p}_{s,3}^{{\rm{V}}}\\ {p}_{s,4}^{{\rm{V}}}\end{array}\right)\mathop{\to }\limits^{{M}_{U}}{M}_{U}{M}_{s}\left(\begin{array}{c}{p}_{s,1}^{{\rm{H}}}\\ {p}_{s,2}^{{\rm{H}}}\\ {p}_{s,3}^{{\rm{H}}}\\ {p}_{s,4}^{{\rm{H}}}\\ {p}_{s,1}^{{\rm{V}}}\\ {p}_{s,2}^{{\rm{V}}}\\ {p}_{s,3}^{{\rm{V}}}\\ {p}_{s,4}^{{\rm{V}}}\end{array}\right)={M}_{s}{M}_{U}^{{\prime} }\left(\begin{array}{c}{p}_{s,1}^{{\rm{H}}}\\ {p}_{s,2}^{{\rm{H}}}\\ {p}_{s,3}^{{\rm{H}}}\\ {p}_{s,4}^{{\rm{H}}}\\ {p}_{s,1}^{{\rm{V}}}\\ {p}_{s,2}^{{\rm{V}}}\\ {p}_{s,3}^{{\rm{V}}}\\ {p}_{s,4}^{{\rm{V}}}\end{array}\right)={M}_{s}\left(\begin{array}{c}{p{\prime} }_{s,1}^{{\rm{H}}}\\ {p{\prime} }_{s,2}^{{\rm{H}}}\\ {p{\prime} }_{s,3}^{{\rm{H}}}\\ {p{\prime} }_{s,4}^{{\rm{H}}}\\ {p{\prime} }_{s,1}^{{\rm{V}}}\\ {p{\prime} }_{s,2}^{{\rm{V}}}\\ {p{\prime} }_{s,3}^{{\rm{V}}}\\ {p{\prime} }_{s,4}^{{\rm{V}}}\end{array}\right)\end{array}$$
(21)

where \({M}_{U}^{{\prime} }={M}_{s}^{-1}{M}_{U}{M}_{s}\). The meaning of Eq. (21) is that applying the operation \({M}_{U}^{{\prime} }\) on \({{\boldsymbol{E}}}_{s}\) is equivalent to performing \({U}_{3}\) on \(\left|{\psi }_{3}\right\rangle\). In our experiment, we use Eq. (21) to map all the gates of the 3-qubit QCNN to the operations on \({{\boldsymbol{E}}}_{s}\). The optical setup in Fig. 3 is the implementation of the operations. A step-by-step calculation of the circuit is provided in S4 of the Supplementary Information.