Learning-based attacks for detecting the vulnerability of computer-generated hologram based optical encryption

: Optical encryption has attracted wide attention for its remarkable characteristics. Inspired by the development of double random phase encoding, many researchers have developed a number of optical encryption systems for practical applications. It has also been found that computer-generated hologram (CGH) is highly promising for optical encryption, and the CGH-based optical encryption possesses remarkable advantages of simplicity and high feasibility for practical implementations. An input image, i.e., plaintext, can be iteratively or non-iteratively encoded into one or several phase-only masks via phase retrieval algorithms. Without security keys, it is impossible for unauthorized receivers to correctly extract the input image from ciphertext. However, cryptoanalysis of CGH-based optical encryption systems has not been eﬀectively carried out before, and it is also concerned whether CGH-based optical encryption is suﬃciently secure for practical applications. In this paper, learning-based attack is proposed to demonstrate the vulnerability of CGH-based optical security system without the direct retrieval of optical encryption keys for the ﬁrst time to our knowledge. Many pairs of the extracted CGH patterns and their corresponding input images (i.e., ciphertext-plaintext pairs) are used to train a designed learning model. After training, it is straightforward to directly retrieve unknown plaintexts from the given ciphertexts (i.e., phase-only masks) by using the trained learning model without subsidiary conditions. Moreover, the proposed learning-based attacks are also feasible and eﬀective for the cryptoanalysis of CGH-based optical security systems with multiple cascaded phase-only masks. The proposed learning-based attacking method paves the way for the cryptoanalysis of CGH-based optical encryption.


Introduction
With the growing demands of secured communication and storage, information security has received the increasing attention and many information security systems [1][2][3][4][5][6][7] have been proposed.
Recently, optical encryption is demonstrated to be promising, and can open up a new direction of cryptography due to its distinguished characteristics, e.g., multiple degrees of freedom and parallel processing . Double random phase encoding (DRPE) [6] was first proposed, in which an image (i.e., plaintext) is encoded into white stationary noise (i.e., ciphertext) by using two statistically-independent random phase-only masks. The DRPE scheme has been continuously developed in different domains, e.g., Gyrator transform, fractional Fourier transform and Fresnel transform [17][18][19][20][21][22][23]. In addition, optical encryption systems based on imaging mechanisms [8,14,28,29], such as ghost imaging, diffractive imaging and interferometric imaging, are illustrated to be applicable for securing information. It is also found that computer-generated hologram (CGH) [30][31][32][33][34][35] can be applied for optical encryption, and its remarkable advantages, e.g., simple implementations, have been explored for practical applications. In the CGH-based optical encryption, the input image can be iteratively or non-iteratively encoded into one or several phase-only masks (i.e., CGH) [31][32][33][34][35]. The CGH-based optical encryption systems can achieve high security, and convenient optical implementations become possible in practice, e.g., by using spatial light modulators for optical decryption.
In recent years, there is more and more interest on the vulnerability analysis of optical encryption systems [36][37][38][39][40][41]. Carnicer et al. [36] first proposed chosen-ciphertext attack to validate that the DRPE-based infrastructure is not secure when some conditions are given. Subsequently, different vulnerability analysis methods for optical encryption systems have been proposed, e.g., chosen-plaintext attack and ciphertext-only attack [37][38][39][40]. The attacking methods have been continuously improved, and play an important role in the evolution of cryptoanalysis of various optical encryption systems. However, the estimation of optical encryption keys using elaborately-designed plaintexts-ciphertexts pairs or complex phase retrieval algorithms is usually carried out. The applicability of conventional cryptoanalysis methods can be deservedly confined in practice, and a particular phase retrieval algorithm usually needs to be designed for analyzing the security of each different optical encryption system. In addition, the vulnerability of CGH-based optical encryption systems has not been effectively studied before, and it is also concerned whether the security of CGH-based optical encryption can be sufficiently high for practical applications.
In this paper, we propose learning-based attacks on CGH-based optical encryption for the first time to our knowledge. Using many pairs of the extracted CGH patterns and their corresponding input images, a designed learning model [41][42][43][44] is trained to emulate the inner representations (e.g., optical setup parameters and phase-only masks) of the training data. Then, the trained learning model is applied to recover unknown plaintexts from the given CGH patterns without the usage of optical encryption keys. The proposed learning-based attacks are demonstrated to be feasible and effective to fully analyze the vulnerability of CGH-based optical cryptosystems. Figure 1 shows a schematic setup for a typical optical security system based on CGH with two cascaded phase-only masks (i.e., one phase-only mask M1 to be extracted as ciphertext and one fixed phase-only mask M2 as security key). The encoding process of CGH-based optical cryptosystem is conducted by using an iterative phase retrieval algorithm to extract the phase-only mask M1 under the constraint of original input image (i.e., plaintext). In the initial iteration, phase-only mask M1 is initialized randomly in a range of [0, 2π], and random phase-only mask M2 is pre-defined and fixed. For the encoding, the phase-only masks M1 and M2 are respectively denoted as M n 1 (µ, υ) and M 2 (ξ, η), where n (i.e., integers 1, 2, 3,. . . . . . ) denotes the nth iteration. The wavefront f n (x, y) in the input image plane can be described by

CGH-based optical encryption
where FrT denotes free-space wave propagation in the Fresnel domain [45], λ denotes wavelength of the incident wave, d 1 denotes the axial distance between phase-only mask M1 and phase-only mask M2, and d 2 denotes the axial distance between phase-only mask M2 and the input image plane. Spectrum method is adopted to describe the free-space wave propagation. To update the wavefront f n (x, y) in the input image plane, original input image I(x, y) is applied as a constraint.
where | | denotes modulus operation, and f n update (x, y) denotes the updated wavefront in the input image plane. Subsequently, phase-only mask M1 is further updated by where asterisk denotes complex conjugate. Then, the wavefront in the input image plane can be further updated by To evaluate the difference between the estimated input image |f n (x, y)| and original input image I(x, y), correlation coefficient (CC) is implemented by using a ready-made Matlab function 'corr2'. The higher value of CC means the higher similarity between the estimated input image and original input image. When a preset CC value (i.e., threshold) is achieved, the iterative process can be stopped and the updated phase-only pattern M n 1 (µ, υ) is correspondingly used as the ultimate estimation of phase-only mask M1, i.e., as M 1 (µ, υ).It has been widely demonstrated that without optical encryption keys (e.g., wavelength, axial distances and the fixed phase-only mask M2), it is impossible to extract original input image (i.e., plaintext) from a given phase-only mask M1 [31][32][33][34][35] during optical decryption.

Learning-based attacks
Learning-based attacks are proposed and designed here to analyze the security of CGH-based optical encryption, which can extract senior representations from the given data [42][43][44][46][47][48][49]. One of the most popular learning architectures for the imaging problems is convolutional neural network (CNN), which is widely used for object classification [46] and object reconstruction [42][43][44]. The CNN architecture belongs to supervised learning algorithms, which are trained by using input-output pairs D = {x, y} N n=1 , where x denotes an input vector and y denotes an output vector. Based on a loss function L(y,ŷ), the CNN model devotes to finding the optimal model parameters Θ with the given data pairs. After training, the trained model can be denoted as f (x; Θ), whereŷ denotes the prediction of arbitraryx obtained from the function f (x; Θ). Through the end-to-end learning, the designed learning models are trained to learn the inner mapping relationships of the input data and output data without any subsidiary conditions (e.g., parameters of the environments for data acquisition). This breathtaking scheme [42][43][44] has been thoroughly applied in various areas, which can also make significant progress in the development of cryptoanalysis of CGH-based optical encryption systems in this study. With sufficient pairs of ciphertexts and the corresponding plaintexts fed to a designed learning model, the trained machine learning model can be used to retrieve the unknown plaintext from a given ciphertext without the usage of optical encryption keys existing in the typical CGH-based optical encryption setup shown in Fig. 1, e.g., optical setup parameters and phase-only mask M2.
A framework of learning-based attacks using CNN model for analyzing the vulnerability of CGH-based optical encryption is shown in Fig. 2. Many pairs of ciphertexts and the corresponding plaintexts are sent to the designed learning model. Each ciphertext is processed by n groups of cascaded convolution and pooling layers. The convolution layer is labeled as C 1 , C 2 , . . . , C n , and the pooling layer is identified as P 1 , P 2 , . . . , P n . It is worth noting that the sequence of convolution layers and pooling layers is not confined to the arrangement of a convolution layer followed by a pooling layer. The arrangement of convolution layers and pooling layer shown in Fig. 2 is an exemplification of the proposed CNN-based machine learning attacks. Assume that the ciphertext (denoted as x) is of size m×m, and then the input ciphertext is convolved with certain kernels forming the first convolution layer C 1 . Size of the kernels for the first convolution layer is denoted as L 1 ×L 1 with the number of the kernels as p, and the kernels are initialized to be Θ 1 = {w 1 , b 1 }, where w 1 denotes a set of weights and b 1 denotes a set of biases for the first convolution layer C 1 . Hence, the feature map for the first convolution layer x 1 can be described by where * denotes the processing of convolution, and activation functions for neural networks are denoted as σ. Size of the first convolution layer is (m-L 1 +1)×(m-L 1 +1)×p. Followed by down-sampling, the convolution layer C 1 (x 1 ) is transformed to the first pooling layer P 1 ( Similarly, the feature map for the nth convolution layer x n is given by where x n−1,p denotes feature map of the (n−1)th pooling layer P n−1 , and w n and b n denote parameters for the nth convolution layer C n as Followed by a reshaping layer, the nth pooling layer P n is converted to a one-dimensional vector A. Then, the reshaped layer is fully connected to a one-dimensional layer of the desired plaintext by Θ FC = {w FC , b FC }, where w FC denotes a set of weights and b FC denotes a set of biases for the fully connected layer FC. Finally, the fully connected layer is reshaped to the size of original plaintext, and a predictionŷ of the plaintext is generated. To evaluate the difference between the predicted plaintextŷ and original plaintext y, mean squared error (MSE) is used as the loss function described by where Y i andŶ i respectively denote the ith pixel value of original plaintext y and the predicted plaintextŷ (i.e., i ranges from 1 to N), and N denotes the total pixel number. The parameters of weights and biases should be updated until the preset MSE is reached. When the calculated MSE value is larger than the preset threshold, backpropagation is implemented to further update the weights and biases. The error ∆d between the prediction and original plaintext is denoted as The error ∆F at the fully connected layer is given by where [w FC ] T denotes transpose of w FC . Then, ∆F is reshaped to the size of the nth pooling layer P n with size of A framework for the proposed learning-based attacks using a designed CNN model for analyzing the security of CGH-based optical encryption. C 1 : the first convolution layer; C n : the nth convolution layer; P 1 : the first pooling layer; P n : the nth pooling layer; R: the shaping layer; FC: fully connected layer. Many pairs of ciphertexts and plaintexts are fed to the designed CNN model. Each ciphertext is processed by n groups of convolution and pooling layers, followed by a reshaping layer to convert the three-dimensional nth pooling layer into a one-dimensional vector. Then, the reshaped vector is fully connected to a one-dimensional vector. It is worth noting that size of the one-dimensional vector coincides with the size of the transformed one-dimensional plaintext. Through the processing of reshaping, size of the one-dimensional vector is converted to that of original plaintext. After multiple layers of processing, the input ciphertext is decomposed and fully connected to original plaintext. Feeding a number of ciphertexts-plaintexts pairs to the designed learning model, the CNN model can be adequately trained to be ready for making a real-time prediction of unknown plaintexts from the given ciphertexts.
p.The reshaped error ∆P n at the nth pooling layer P n is backpropagated to the nth convolution layer C n by upsampling, and the error at the nth convolution layer is denoted as ∆C n .Through the convolution, the error at the (n−1)th pooling layer P n−1 is denoted as ∆P n−1 . Following the aforementioned error propagation rules, the error at the first convolution layer is ∆C 1 . Then, the gradient of w FC can be given by where w FC g denotes the gradient of w FC . b FC g is given by The gradient of w n at the nth convolution layer is the convolution of x n−1,p and ∆C n . Similarly, b n g is given by b n g = ∆C n .
Following the aforementioned method for calculating the gradient of weights and biases, the gradient of w 1 at the first convolution layer is the convolution of x and ∆C 1 . Similarly, b 1 g is given by Here, stochastic gradient descent is applied to update the weights and biases [49], which is described by v n w = mv n−1 w + αw n g , where v n w and v n b respectively denote the velocity of weights and biases, m denotes the momentum, α denotes the learning rate, w n g and b n g respectively denote the gradients of the weight and bias at the nth convolution layer, and w n−1 g and b n−1 g respectively denote the gradients of weight and bias at the (n−1)th convolution layer. Through the updating rule aforementioned, the weights and biases can be continuously updated until the MSE value approaches the preset value. When the number of ciphertext-plaintext pairs used for training is insufficient, the existing ciphertext-plaintext pairs can be reused for several epochs of iterations. Finally, the designed learning model is adequately trained to be applied for retrieving unknown plaintext from the given ciphertext without the usage of optical encryption keys existing in the CGH-based optical encryption setup shown in Fig. 1.

Results and discussion
A typical CGH-based encryption setup shown in Fig. 1 is applied to illustrate feasibility and effectiveness of the proposed method. In practice, laser source with wavelength of 632.8 nm can be expanded by a pinhole, and then the expanded light can be collimated by a collimating lens. Subsequently, the expanded and collimated laser beam illuminates the extracted phase-only mask M1 and further modulated by the fixed random phase-only mask M2. Axial distances d 1 and d 2 are 17.0 cm and 12.0 cm, respectively. The random phase-only masks can be embedded into spatial light modulators with a dimension of 512 × 512 pixels in practice. In Fig. 1, the input image plane can be replaced by using a CCD camera for optical decryption. It is worth noting that in CGH-based optical encryption, a digital approach usually needs to be used for the encoding, and either a digital or optical approach can be flexibly applied for the decoding.
The input images (i.e., plaintexts) are handwritten-digit patterns from the MNIST database [47], and also the patterns with fashion products from the fashion MNIST database [48]. All input images randomly selected from the handwritten-digit patterns and fashion images are of 8-bit and 512 × 512 pixels which are digitally resized from pixel size of 28 × 28. From each database, 5000 images are randomly selected to verify validity of the proposed method. Therefore, 10000 CGH patterns, i.e., phase-only mask M1, can be correspondingly extracted by using the iterative phase retrieval algorithm. The maximum number of iterations is set as 1000 in the iterative phase retrieval algorithm, and the CC threshold is set as 0.98. In CGH-based optical encryption, it is feasible to recover the plaintext from the ciphertext during the decoding when all optical setup parameters and phase-only mask M2, i.e., security keys, are given and correctly applied. However, until now there is no systematic study about the security of CGH-based encryption systems, and nowadays it is still concerned whether CGH-based optical encryption is sufficiently secure for practical applications. In this paper, using the proposed learning-based attacks, we demonstrate for the first time to our knowledge that the CGH-based optical encryption is vulnerable, and optical encryption keys, e.g., setup parameters and the fixed phase-only mask M2, are not requested to be used for retrieving unknown plaintexts from the given ciphertexts in the proposed learning-based attacking method.
The structure of the designed learning-based attacks for the cryptoanalysis of CGH-based optical encryption is shown in Fig. 3. The designed learning model has two convolution layers, two pooling layers, one reshaping layer and one fully connected layer. To lower the computational cost, the extracted phase-only mask M1 (i.e., ciphertext) with a dimension of 512×512 pixels is resized to 200×200 pixels. The resized ciphertext convolves with 30 kernels of size 1×1, and then activated by the sigmoid function to generate the first convolution layer of size 200×200×30. The first convolution layer is downsized to the first pooling layer with size of 100×100×30. Then, the down-sampled feature maps further convolve with 30 kernels (size of 1×1) forming the second convolution layer of size 100×100×30. Followed by the second action of down-sampling, the second convolution layer is converted to the second pooling layer with size of 50×50×30. Subsequently, the feature maps obtained after two rounds of convolution and down-sampling processing are reshaped to a column vector with size of 1×75000. The reshaped vector is processed by the fully connected layer with size of 1×784 followed by an action of reshaping, and then the output layer is a vector with size of 28×28. 4800 pairs of the extracted CGH patterns and their corresponding plaintexts from each database are fed to the designed learning model in the training phase. The learning rate is set as 10 −6 , and momentum is set as −0.00095. The training epoch is set as 5. The weights are initially set as random values between 0 and 1, and the biases are initially set as 0. It takes about 8.0 h to train a CNN model for each database. To implement the proposed method, Matlab platform is used with Nvidia Geforce GTX1080Ti GPU and RAM of 64GB. After training, the trained CNN model for each database can predict unknown plaintexts from the given ciphertexts in real time without the usage of various optical encryption keys. Fig. 3. A designed CNN architecture for attacking the CGH-based optical encryption shown in Fig. 1. Two convolution layers and two pooling layers are used in this study. The convolution layers are activated by the sigmoid function. The ciphertexts are resized from 512×512 pixels to 200×200 pixels to lower the computational load. With 4800 ciphertext-plaintext pairs sent to the designed learning model, the designed CNN model is well trained. Then, 200 ciphertexts without prior knowledge about their plaintexts are tested, and the trained CNN model is able to predict unknown plaintexts from the given ciphertexts.
Two databases are used to illustrate performance of the proposed attacks on CGH-based cryptosystem. Figure 4 shows the recovered plaintexts from the given CGH patterns by usage of the trained models. Figures 4(a), 4(g) and 4(m) in the first column, Figs. 4(c), 4(i) and 4(o) in the fourth column, and Figs. 4(e), 4(k) and 4(q) in the seventh column show the encrypted CGH patterns (i.e., ciphertexts). It can be seen from the ciphertexts in Fig. 4 that original input images are completely encrypted. Figures 4(b), 4(d) and 4(f) in the first row show the unknown plaintexts retrieved by utilizing the fashion MNIST database trained model. Figures 4(h), 4(j) and 4(l) in the second row show the unknown plaintexts retrieved by using the handwritten-digit MNIST database trained model. Figures 4(bb), 4(dd), 4(ff), 4(hh), 4(jj) and 4(ll) show the original plaintexts. To evaluate quality of the predicted plaintexts, peak signal-to-noise ratio (PSNR) and CC values are adopted here. PSNRs of the extracted plaintexts in the first and second rows are 34.96 dB, 28.34 dB, 25.63 dB, 30.02 dB, 28.90 dB and 29.81 dB, respectively. CCs of the extracted plaintexts in the first and second rows are 0.96, 0.92, 0.89, 0.87, 0.83 and 0.86, respectively. In view of the PSNR and CC values, the unknown plaintexts are fully extracted from ciphertexts. It is demonstrated that the designed attacks can fully retrieve the unknown plaintexts without any requirement of setup keys. Therefore, the CGH-based encryption systems are not secure enough for practical applications under the designed attacks. To illustrate the robustness and universality of the attacks on CGH-based encoding, the trained model is further applied to extract the plaintexts from different databases, which are comprised of distinct patterns (e.g., lowercase, double digits and uppercase letters). It is worth mentioning that these distinct patterns have never been used in the training stage. Figures 4(m), 4(o) and 4(q) in the third row present the ciphertexts of these input images encrypted by the CGH-based encryption setup in Fig. 1. Figures 4(n It is demonstrated that the trained CNN model is applicable to correctly predict or extract the data which is from different databases. According to learning concept, the nonlinear mapping relationship between the input and output data can be learned. Hence, the trained learning model for CGH-based encryption is also available for the data which is not used in the training phase. Although the plaintexts are from different databases, the ciphertexts obtained by the CGH-based cryptosystem have the similarity in some aspects. Hence, the trained model can be applied to predict the data from different databases, and the vulnerability of CGH-based encryption is effectively detected. It has been illustrated in the literature [31][32][33][34][35] that when more random phase-only masks are used in CGH-based encryption setup, the higher security for CGH-based encryption system can be achieved. In this study, the proposed attacks on CGH-based encryption with multiple random phase-only masks are also investigated. The CGH-based setup with multiple random phase-only masks (i.e., phase-only mask M1 to be extracted and fixed phase-only masks M2 and M3 as principal security keys) is schematically shown in Fig. 5. The axial distances d 1 , d 2 and d 3 are 17.0 cm, 12.0 cm and 15.0 cm, respectively. 5000 images are randomly selected from each database as the plaintexts, and the CGH patterns, i.e., M1, are sequentially retrieved as ciphertexts. 4800 ciphertext-plaintext pairs generated by using each database are used to train the model. Another 200 extracted CGH patterns are used for the testing. The time taken to train each database is about 8.0 h. Figures 6(a), 6(g) and 6(m) in the first column, Figs. 6(c), 6(i) and 6(o) in the fourth column, and Figs. 6(e), 6(k) and 6(q) in the seventh column show the ciphertexts by usage of the CGH-based encoding setup shown in Fig. 5. The ciphertexts shown in the first row are further processed by usage of the fashion MNIST database trained model. The ciphertexts shown in the second row are further processed by usage of the handwritten-digit MNIST database trained model. The predicted plaintexts are also shown in Fig. 6 just following the corresponding ciphertexts. The ciphertexts given in the third row are obtained by using the input images selected from different databases. It is also worth mentioning that these images have never been used to train the model in the training phase. By using the model trained by using the handwritten-digit MNIST database, the ciphertetxs in the third row can be successfully processed, and their correspondingly predicted plaintexts are shown in Figs. 6(n), 6(p) and 6(r). The original plaintexts are shown in Figs. 6(bb), 6(dd), 6(ff), 6(hh), 6(jj), 6(ll), 6(nn), 6(pp) and 6(rr), respectively. PSNRs of the retrieved plaintexts in Fig. 6  It is demonstrated that the method presented here is feasible and effective for detecting the vulnerability of CGH-based cryptosystem with multiple cascaded random phase-only masks. The results and analyses aforementioned have systematically demonstrated that the designed attacks are feasible and effective for vetting the security of CGH-based cryptosystems. The trained CNN model can recover original images from the given CGH patterns in real time. Moreover, robustness of the proposed attacks is also illustrated that the trained CNN model is applicable to attack different databases. Although multiple random phase-only masks can be used in practice, the trained model still performs well to retrieve original images. The method presented here provides a powerful tool for analyzing the vulnerability of CGH-based cryptosystems, which has never been studied before. It is believed that the method presented here could push the further developments of securer CGH-based encryption systems. The second row (h), (j) and (l) the retrieved plaintexts by the MNIST database trained model corresponding to (g), (i) and (k), respectively. The third row (n), (p) and (r) the plaintexts (from different databases) retrieved by the MNIST database trained learning model corresponding to (m), (o) and (q), respectively. The third column ((bb), (hh), (nn)), the sixth column ((dd), (jj), (pp)) and the ninth column ((ff), (ll), (rr)) original plaintexts corresponding to the ciphertexts in the first column ((a), (g), (m)), the fourth column ((c), (i), (o)) and the seventh column ((e), (k), (q)), respectively.

Conclusions
We have presented the learning-based attacks on CGH-based encryption, and vulnerability of CGH-based encryption has been detected. The feasibility and effectiveness of the designed model are fully verified, and input images from a different database are also tested and successfully predicted. Furthermore, the method presented here can also effectively analyze the CGH-based encryption with multiple cascaded random phase-only masks. The method presented here paves the way for the development of cryptoanalysis of CGH-based encryption, and can eventually promote the development of CGH-based encryption.

Disclosures
The authors declare that there are no conflicts of interest.