Reconstructing Images Through Multimode Fibers From the Up-Conversion Speckle Patterns via Deep Learning

The mode mixing and mode dispersion in the multimode fiber (MMF) will produce complex speckle patterns in the distal end of the fiber as an object passes through the MMF, rendering image reconstruction to be a challenging task. In recent years, convolutional neural networks have been successfully applied to image reconstruction from speckles. However, the imaging spectra of these studies are mostly in the visible spectrum range and require complete speckle information for reconstruction. In this paper, researchers build an optical imaging system that employs up-conversion imaging technology to collect speckle patterns generated by infrared light transmitting through a multimode fiber and a frequency-doubling crystal. They propose a speckle restoration network (SRNet) based on a generative adversarial network (GAN) to reconstruct speckle images. The generator of GAN uses ResNest and atrous spatial pyramid pooling (ASPP) to extract multi-level features and multi-scale context information, respectively. The discriminator of GAN significantly improves the quality of the reconstructed image generated by the generator. In addition, researchers adopt a special training method named pre-training generator to avoid gradient disappearance or gradient explosion in the training process. With the designed network, high-quality images were successfully reconstructed even with only a portion of the speckle information.


I. INTRODUCTION
Optical fiber has been extensively used in endoscopic imaging [1], [2], [3]. Multimode fiber (MMF) has been demonstrated to transmit a significant number of modes simultaneously [1], which significantly enhances the information transmission efficiency compared to single-mode fiber [4]. However, dispersion, coupling, and other phenomena will occur between various modes during MMF transmission, resulting in unrecognizable speckle patterns in the distal end of the fiber. Some researchers have developed methods for reconstructing images from speckle patterns, such as transmission matrix [5], [6], [7], wavefront shaping [8], and digital phase The associate editor coordinating the review of this manuscript and approving it for publication was Md. Moinul Hossain .
conjugation [9]. These methods have been proven to be able to reconstruct images from speckles. However, these methods require very precise control and measurement as well as are sensitive to the optical system and imaging environment. Even minor changes in the scattering medium may seriously affect the imaging quality. To address these issues, many researchers have proposed a new imaging method based on deep learning, which overcomes the limitations of traditional imaging methods and reconstructs images more stably and continuously. For example, the widely used convolutional neural network (CNN) has been employed for speckle reconstruction [10], [11], [12], [13]. To train CNN, a large number of speckle image pairs need to be used as training data to realize end-to-end training [14], [15]. In the training process, the neural network learns the corresponding relationship between the input image and the label image, without considering the structure of the relational optical system and the light propagation process.
The research works mentioned above [10], [11], [12], [13] prove that deep learning can be effectively applied to speckle image reconstruction. However, most of the current research is based on the imaging spectrum in the visible range (wavelength: 400 -700 nm). When the imaging spectrum is outside of the visible spectrum range, the imaging of MMF will be difficult to obtain [16], [17], such as THZ imaging. In addition, some previous work [18], [19] used complete speckle information for image reconstruction. However, complete speckle fields may not be obtained in practical applications. Therefore, it is of great significance to re-construct the complete original image from the incomplete speckle image. In this paper, we collected the visible speckle images as the infrared laser beam passes through the MMF and subsequently through a frequency-doubling crystal. The deep learning method has been employed to reconstruct the original infrared images from speckles detected by the detector working in the visible range. Specifically, because the infrared spectrum is not in the visible light range, the upconversion imaging technology (such as frequency doubling) is used in the system to transfer the imaging wavelength from the infrared range to the visible light wavelength, so that the charge-coupled device (CCD) can detect the output speckle pattern. Then we built a network called SRNet (speckle restoration network), which is inspired by Deeplabv3+ [20] and GAN [21]. Similar to most image restoration architectures, our generator is also based on encoder-decoder architecture. The difference is that we do not use the traditional U-shaped structure. In the proposed model, ResNext101 and ASPP [20] modules are embedded in the encoder to extract features, and then the decoder is used to recover the extracted features. To improve the repair quality of image details, we designed a discriminator to monitor the output of the network.
We evaluated the performance of SRNet on the collected data set. Specifically, two experiments prove the powerful capability of the proposed network. Firstly, we collected three groups of speckle patterns generated by different incident energy, and the structural similarity (SSIM) of reconstruction results of each group of data can reach above 0.9. Secondly, only local speckle information is used to train the model. The experimental results show that the proposed model can reconstruct high-quality images when only one-quarter of the original speckle image is used.

A. EXPERIMENTAL SETUP AND DATA ACQUISITION
The data acquisition device is shown in Fig. 1. The incident light is emitted by an infrared laser with a wavelength of 1028 nm. The polarizer adjusts the polarization direction of the laser beam to meet the modulated polarization direction of the spatial light modulator (SLM). When the label image is uploaded to SLM, the beam reflected by SLM will FIGURE 1. Experimental apparatus for collecting data. In the distal end of the MMF, an infrared speckle pattern is generated, and the infrared speckle patterns are upconverted to the visible patterns of wavelength 514nm by second harmonic generation (SHG). P is a polarizer, BS is a beam splitter. SLM is a spatial light modulator, O1 and O2 are two objective lenses, O1: 10×, NA = 0.25, WD = 7.316mm; O2: 20×, NA = 0.4, WD = 1.875mm. MMF is a multimode fiber of length 20 m with a core diameter of 62.5µm, and NA of 0.275. LBO is a frequency-doubling crystal, FB is a bandpass filter, and CCD is a common charge-coupled device.
carry object information and be reflected by BS to objective lens O1. The objective lens O1 couples the laser beam into the MMF. The transmission of light in MMF will lead to dispersion, mode coupling, and other phenomena, which will lead to the formation of infrared speckles. The infrared speckle output from MMF is collected by objective lens O2 and is up-converted to visible speckle by an LBO frequency doubling crystal. The remaining infrared light can be filtered out by using a band-pass filter. The visible speckle pattern is collected by CCD.
We use the device shown in Fig. 1 to get our dataset, which is based on the open-source dataset MNIST [22]. The specific method is to extract 10000 pictures from MNIST and upload them to the spatial light modulator (SLM), after passing through the device of Fig. 1, the corresponding visible speckle pattern is collected by the CCD. The dataset is further divided into three groups by modulating the incident light power, i.e. the incident energy is 1000 nJ, 2000 nJ, and 4000 nJ. There are 10000 pairs of images in each group. The data set is divided into the training set and the test set by the ratio of 8:2.

B. NETWORK STRUCTURE
The architecture of the SRNet is shown in Fig. 2. Inspired by the GAN [21] network, the structure is divided into a generator and a discriminator. The generator is responsible for generating the reconstructed speckle image, and the discriminator is responsible for judging the difference between the image generated by the generator and the real image. The discriminator will promote the quality of the image generated by the generator. The generator and the discriminator play games with each other until they reach a balance. In this situation, the discriminator cannot recognize the true and false pictures from the generator. Fig. 2 (a) and Fig. 2 (b) are the encoder and decoder of the generator, respectively. In the encoder, the feature extraction backbone network is Resnest-101, which is different from Resnet-101 [23] in Deeplabv3+ [20]. We think that the splitattention in Resnest can perform better in extracting the features from speckle images, which will help improve the 55562 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  quality of reconstructed images. The backbone network outputs two feature maps of different sizes low_level_features and X. X is first sent into the atrous spatial pyramid pooling (ASPP) module, which contains five operations: a 1 × 1 ordinary convolution, three convolution kernels with a size of 3 × 3 with different expansion rates, and a global average pooling [24]. Concatenate the five feature maps output by the above five operations, and then the output of ASPP is gotten. The output of ASPP is connected to the decoder through a jump connection, in which we use 1 × 1 convolution and up-sampling to adjust the dimensions of the feature map.
Another output of the feature extraction backbone network, low_level_features, is directly input into the decoder. In the decoder, we first apply 1 × 1 convolution to low_level_features to reduce its channel number, so that low_level_features can balance with X6 in dimensions. After the connection, several 3 × 3 convolutions are used to refine the features, and then another bilinear upsampling is used to obtain the final output.
The predicted picture from the generator and the ground truth is input to the discriminator and then converted to an 18 × 18 tensor by a series of convolution operations, as shown in Fig. 2 (c). Each dis_block module is composed of Conv2D, BatchNormalization, and LeakyRelu. The convolution kernel size in the yellow dis_block is 3 × 3 and the stride is 1, the convolution kernel size in the blue dis_block is 3 × 3 and the stride is 2. The discriminator calculates the loss using all 1 tensors of the ground truth value image and all zero tensors of the generator output, respectively. This strategy attempts to classify the input image as true or false [25].

C. LOSS FUNCTION
The objective function in this paper is a combination of global loss and adversarial loss. Such multiple losses are conducive to the convolution network to avoid the local optimum [26]. The expression is where V is the adversarial loss and global_ Loss is the global loss, the α is the weight of the adversarial loss. In (1), the global_ Loss is responsible for repairing the main content of the image, while the adversarial can enrich the image details and make the reconstructed image more realistic. The form of adversarial loss of the model proposed in this paper is given by where X is the label image, Z is the speckle image. G(z) and D(x) represent the output of the generator and discriminator. Equation (2) can be expressed as maximizing the discriminator loss and minimizing the generator loss. D(x) represents the probability that an image is a real image. When the discriminator thinks that the probability of an image being a real image is relatively high, D(x) approaches 1, otherwise, D(x) approaches 0. In the training process, G is first fixed and D is trained. Since x is a real label image and z is a speckle image, D(X) is expected to be larger and D(G(x)) smaller, that is max D . Then fix D and train G. G is expected to generate a realistic enough picture to fool the discriminator, so D(G(z)) is expected to be larger, that is min G . Through this continuous alternating training, the generator and discriminator are gradually optimized. Finally, a balance is reached, so that D(G(z)) is close to 0.5. For the global loss in (1), i.e. the generator loss, the binary cross entropy loss is adopted, and the formula is whereŷ i is the prediction, whose size is between 0 and 1, and y i is the ground truth, which is equal to 0 or 1. N is the total number of pixels.

D. NETWORK TRAINING
Experiments were performed using the data collected as described in subsection A, each group of data is composed of 10000 image pairs, with 80% being the training set and 20% being the test set.

1) DETERMINE THE WEIGHT OF THE LOSS FUNCTION
The total loss function in (1) is in the form of weighted global loss and adversarial loss. The weight of global loss is set to 1. To prevent the loss of the discriminator feed-back to the generator from being too large, resulting in model divergence, the value α should be carefully chosen. After several attempts by selecting different values between 0 and 0.01, the value is selected as 0.00004.

2) TRAINING METHOD
In the training process, the generator is used to reconstruct the target image, and the discriminator is used to judge whether the generated image is true or false. In theory, the discriminator should be abandoned after the two reach Nash equilibrium. But considering the poor performance of the generator at the initial stage of training and the difference between the true graph and the false graph is very obvious, the discriminator can easily have strong discrimination ability and feedback large losses to the generator, which makes it difficult for the generator to converge. Therefore, the generator is trained for n (number of epochs that the generator has been pre-trained) epoch firstly to ensure that the generator has a certain ability, and then the discriminator is introduced. Experiments show that the generator has enough performance to confuse the discriminator when n is 1. The generator output after one epoch of training is shown in Fig. 3. It can be seen from Fig. 3 that the generator has formed the corresponding contour of the image after one round of training. After a round of training, the discriminator is introduced to constrain the generator.
The optimizer selects Adam. The learning rate is finally determined to be 0.00003 through continuous adjustment, batch_size = 16, and all data was trained for 60 epochs. The detailed parameters of training are shown in Table 1.

III. RESULT A. EVALUATING INDICATOR
To evaluate the quality of the generated image, we introduced the structural similarity index (SSIM) [27] and Pearson correlation coefficient (PCC) [28]. Structural similarity evaluates the image quality from three aspects of brightness, contrast, and structure, which is more consistent with the visual effect observed by human vision, given two signals x and y, the calculation formula is given by where µ and σ represent the mean and covariance, respectively, and c 1 , c 2 , c 3 are constants. Pearson correlation coefficient can measure the correlation between two images, and its calculation formula is:

B. RESULTS ON THE ORIGINAL SPECKLE IMAGE
The size of the speckle image we collected from the speckle field is 256 × 256, an image of this size can contain almost all the useful information in the speckle field. Fig. 4 shows the loss change curve during data training corresponding to different incident energy. Fig. 4 (a) is the loss curve of the first epoch of training for the generator. It can be seen that the first epoch of training loss converges quickly. In this stage, the picture output by the generator has been somewhat confusing to the discriminator. Fig. 4 (b) shows the loss change curve of the generator after adding the discriminator, the network can converge well.
In each group of data, 2000 pairs of images are used as the test set to verify the training effect of the model. Fig. 5 shows the results of each model on its corresponding test set. The upper left corner of the image is the SSIM value. The results show that the visual effect of the reconstructed digital image is very similar to that of the label image. It seems that the reconstruction quality is slightly declining with the increasing energy. This is related to the physical mechanism of the frequency doubling. We employ frequency doubling to realize the up-conversion imaging. The intensity of the frequency doubling is proportional to the square of the intensity of the incident fundamental frequency, this is I(2ω) ∝ I(ω) 2 . From this point, the speckle pattern of the up-conversion imaging generally is different from that of the initial speckle pattern (the fundamental frequency I(ω)), as is shown in Fig. 6, and the increasing of the intensity will result in a larger difference deviated from the initial speckle pattern (the fundamental frequency I(ω))), leading to the decreasing of the reconstruction quality.
The average SSIM and PCC of the test set are given in Fig.7. The results show that our network can reconstruct high-fidelity images from handwritten digital data sets. It is worth mentioning that, because our loss function contains adversarial loss, the network repairs the details of the image very well, making the recovered image have a higher SSIM.

C. RECONSTRUCTION OF SPECKLE IMAGE USING ONLY LOCAL INFORMATION
In some practical applications, it is difficult to detect the full speckle pattern. Therefore, it is meaningful to verify whether partial information in the speckle field can be used to reconstruct images. Based on the speckle image obtained, we intercept the local information of different sizes and explore how to use only a part of the speckle information to reconstruct the complete speckle image. This task is challenging because a portion of the speckle pattern may not contain enough information about the original image.
Specifically, we intercept parts of different sizes at the center of the speckle field, as shown in Fig. 8. The intercepted size is 128 × 128 and 64 × 64. This means that only a small part of speckle information is used to reconstruct a complete image.
To achieve our goal, we use the proposed network to train on data of different sizes, we only use speckles with an incident energy of 1000nJ as an example in this work. Fig. 9 shows the loss of generators in each group of data training, each group of data can converge well. Fig. 10 shows the comparison of reconstruction results of speckle images with different sizes. The reconstruction VOLUME 11, 2023 55565 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  The second row is the speckles without frequency doubling (i.e., the speckle in the wavelength 1028 nm), and the third row is the speckles after frequency doubling (i.e., the speckle in the wavelength 514 nm).
results are different in different sizes. Specifically, the result obtained on the original speckle image is the best. With the decrease in the speckle image size, the quality of the reconstructed image decreases slightly. However, according to the results obtained still has a good appearance and high SSIM. Table 2 shows the index comparison of speckle image reconstruction at different scales.

D. ABLATION STUDY
To analyze the effectiveness of the feature extraction backbone network and discriminator, ablation experiments were     performed on the data set with an incident energy of 1000 nJ. The experimental results are given in Table 3.
First of all, compare the effects of Resnest-101 and Resnet-101 on the reconstruction quality without adding a discriminator (Row 2 and Row 3 in Table 3). When Resnest-101 is used instead of Resnet-101, both SSIM and PCC are improved. Then we add the discriminator (Row 4 and Row 5 in Table 3), and Resnest-101 still performs better. In addition, no matter who is chosen as the feature extraction backbone network, the addition of a discriminator will also improve the SSIM and PCC, this proves the validity of the designed discriminator.

IV. CONCLUSION
In this paper, we combine upconversion imaging technology with deep learning to study the imaging problem of infrared objects through multi-mode fiber. Firstly, we convert infrared speckles into visible speckles by using a frequency-doubling crystal. Then, to get useful information, we set up a neural network to reconstruct the speckle. The proposed network is named SRNet, which uses Resnest101 as the backbone of the generator to extract multi-level features. A discriminator is introduced to improve the quality of the reconstructed image of the generator. Moreover, the method of pre-training generator is used to avoid the problem of model divergence that may be caused by introducing a discriminator.
The results show that the trained SRNet can reconstruct high-quality images by using the partial speckle image, and the network has good generalization for different incident energies. Considering that it is difficult to detect complete speckle images in some scenes of practical applications, our work is very meaningful. In addition, our work also proves that the combination of upconversion imaging technology and deep learning technology is significant for expanding the imaging wavelength.