A Variational Multi-Scale Error Compensation Network for Single-Pixel Imaging

Single-pixel imaging is an advanced computational imaging technique based on compressive sensing that offers higher signal-to-noise ratio and broader application scope compared to traditional imaging techniques. However, conventional reconstruction algorithms suffer from issues such as long processing time and low reconstruction accuracy during the sampling and reconstruction processes. Deep learning-based compressed reconstruction networks can circumvent the complex iterative computations of traditional algorithms and achieve fast, high-quality reconstruction. In this paper, we propose a Variational Multi-Scale Error Compensation Network (VMSE) based on variational autoencoders. VMSE designs an error compensation network to enhance the feature representation capability of the sampling reconstruction network. We employ multiple latent variables to generate error features at different scales in the intermediate layers of the error compensation network, compensating the reconstructed image. Additionally, we design a module that simultaneously learns in the spatial and frequency domains, which is used for upsampling and complementing the missing high-frequency information in the frequency domain. On the MNIST dataset, when the sampling rate is 0.025, VMSE achieved higher Peak Signal-to-Noise Ratio(PSNR) and Structural Similarity Index(SSIM) scores, especially with an SSIM score of 0.963, significantly surpassing Reconnet and DR2Net's scores of 0.930 and 0.920, respectively. This was further corroborated by practical experiments, where at low sampling rates, VMSE could reconstruct outlines more clearly compared to TVAL3.


A Variational Multi-Scale Error Compensation
Network for Single-Pixel Imaging Jian Lin, Qiurong Yan , Member, IEEE, Quan Zou , Shida Sun , Zhen Wei , and Hua Du Abstract-Single-pixel imaging is an advanced computational imaging technique based on compressive sensing that offers higher signal-to-noise ratio and broader application scope compared to traditional imaging techniques.However, conventional reconstruction algorithms suffer from issues such as long processing time and low reconstruction accuracy during the sampling and reconstruction processes.Deep learning-based compressed reconstruction networks can circumvent the complex iterative computations of traditional algorithms and achieve fast, high-quality reconstruction.In this paper, we propose a Variational Multi-Scale Error Compensation Network (VMSE) based on variational autoencoders.VMSE designs an error compensation network to enhance the feature representation capability of the sampling reconstruction network.We employ multiple latent variables to generate error features at different scales in the intermediate layers of the error compensation network, compensating the reconstructed image.Additionally, we design a module that simultaneously learns in the spatial and frequency domains, which is used for upsampling and complementing the missing high-frequency information in the frequency domain.On the MNIST dataset, when the sampling rate is 0.025, VMSE achieved higher Peak Signal-to-Noise Ratio(PSNR) and Structural Similarity Index(SSIM) scores, especially with an SSIM score of 0.963, significantly surpassing Reconnet and DR2Net's scores of 0.930 and 0.920, respectively.This was further corroborated by practical experiments, where at low sampling rates, VMSE could reconstruct outlines more clearly compared to TVAL3.

I. INTRODUCTION
S INGLE-PIXEL imaging is an emerging technique that exploits sparse sampling and computational imaging for image reconstruction.Compared to conventional array-based imaging techniques, single-pixel imaging enables twodimensional imaging using a spatially non-resolving singlepixel detector, thereby holding great promise for applications in the infrared and terahertz spectral regions.Moreover, the single-pixel detector has the capability to simultaneously measure the optical intensity of multiple pixel points, leading to a significant improvement in the signal-to-noise ratio.As a result, single-pixel imaging holds great potential for various applications in fields such as optical encryption [1], [2], [3], optical computing [4], [5], radar [6], and multispectral imaging [7], [8].
We find that existing deep neural networks typically employ a fully connected layer or convolutional layer for initial reconstruction of the measurements [9], [10], [11], followed by learning on the original image size, without fully leveraging the underlying prior information of the image.When dealing with common data types such as images, audio, or videos, it is often assumed that they are generated by more fundamental variables.Variational autoencoders(VAE) [12] provide a probabilistic approach to describe these variables.Therefore, this paper proposes a variational multiscale error compensation network (VMSE) based on VAE.By performing error compensation and sub-pixel convolutional upsampling, VMSE can learn rich image semantic information during the initial reconstruction stage.The main contributions of this paper are as follows: r A variational multiscale error compensation network (VMSE) is proposed, which designs an error compensation network to enhance the feature representation of the sampling reconstruction network.Multiple latent variables are utilized to generate features at different scales in the intermediate layers of the error compensation network, aiming to enrich the texture information of the reconstructed image.r A module named SFBlock is designed, which en- ables simultaneous learning in both spatial and frequency domains, and utilizes subpixel convolution for efficient upsampling.While extracting features in the spatial domain, SFBlock also complements the missing high-frequency information in the frequency domain.
r We conducted extensive testing of VMSE on multiple datasets and single-pixel imaging systems, and achieved remarkable experimental results.The experiments demonstrate that even at low sampling rates, VMSE is capable of reconstructing the major contours of real objects.

A. Traditional Compressive Reconstruction Algorithm
Traditional signal processing requires signal sampling to satisfy the Nyquist sampling theorem to ensure the recovery of the original signal.This results in traditional reconstruction methods experiencing lengthy system acquisition and transmission of image information [13], [14].Single-pixel imaging requires projecting a certain number of speckle matrix sequences to modulate the image, resulting in encoding modulation times far longer than traditional imaging methods, severely limiting its real-time application potential.Compressive sensing-based single-pixel imaging leverages prior information such as image sparsity to achieve compressed sampling and reconstruction well below the Nyquist sampling frequency.Examples include Orthogonal Matching Pursuit(OMP) [15], Gradient Projection for Sparse Reconstruction(GPSR) [16], Bayesian Compressive Sensing(BCS) [17], and Total variation Augmented Lagrangian Alternating Direction Algorithm(TVAL3) [18].Although compressive sensing-based reconstruction algorithms significantly reduce sampling time, they still require lengthy iterative operations to obtain high-quality images.Most of these algorithms are based on the assumption that the image is sparse in some transform domain, providing theoretical guarantees but requiring long computation times.Additionally, the modulation speed of spatial light modulators also significantly affects imaging time.The Digital Micromirror Device (DMD) is the most widely used modulator, meeting the requirements of common band and moving object imaging [19], with optimal performance in programmability, modulation speed, grayscale modulation, and price.Other modulators include Liquid Crystal Spatial Light Modulator (LC-SLM) and Optical Phase Arrays (OPA) [20], among others.Among these, LC-SLM is cost-effective and can load grayscale masks, making it a good alternative to DMD.

B. Deep Learning Based Compressive Reconstruction Network
To improve both the speed and quality of reconstruction, many researchers have shifted their focus to deep learning-based approaches for single-pixel imaging [21], [22], [23].In 2017, Lyu et al. proposed a novel computational ghost imaging(GI) framework utilizing deep learning [24].In 2018, He et al. improved the conventional convolutional neural network and introduced a new deep learning-based ghost imaging method [25].This method enables faster reconstruction of target images at low measurement rates.In the same year, Higham et al. achieved real-time recovery of high-resolution videos using a deep convolutional autoencoder [26].In 2019, Wang et al. designed an end-to-end neural network that directly recovers the target image using measured bucket signals [27].In 2020, Li et al. achieved high-quality reconstruction under high scattering conditions through deep learning [28].Also in 2020, Dou et al. proposed a novel ghost imaging method capable of reconstructing the shape of highly transparent objects [29].Zhu et al. introduced a Y-net-based ghost imaging scheme in the same year, which performed well under deterministic and uncertain illumination conditions [30].
In 2021, Kim et al. proposed a Bayesian denoising method to enhance the quality of ghost imaging, which exhibits strong robustness to additive Gaussian noise [31].Yang et al. presented an underwater ghost imaging method based on generative adversarial networks in 2021, effectively improving the target reconstruction performance of underwater ghost imaging [32].In 2021, a Hadamard-based single-pixel imaging system was proposed, which holds great promise for underwater imaging due to its high resolution and anti-scattering properties [33].In 2022, He et al. constructed a ghost imaging system called GI-RNN based on recursive neural networks, capable of recovering high-quality images at low sampling rates [34].In 2022, Wang et al. combined the physical model of GI image formation with a deep neural network to reconstruct far-field images at a resolution exceeding the diffraction limit [35].Stojek et al. proposed a differential binary high-resolution high-compression sampling scheme, which laid the foundation for high-resolution single-pixel imaging [36].
Based on deep learning, single-pixel imaging has greatly improved in quality.However, most existing methods simply employ a fully connected layer or convolutional layer to reconstruct the measurements into the original size, and then proceed with further learning, neglecting the importance of this stage.We believe that this stage plays a crucial role in imaging quality, thus, we choose to utilize VAE for learning at this stage.

C. Variational Autoencoder
Variational Autoencoder(VAE) is based on the idea of inserting a latent variable layer between the encoder and decoder, allowing the model to generate new data by learning a low-dimensional representation of the data.VAE consists of an encoder q ϕ (z|x) and a decoder p θ (x|z).The encoder's main task is to map the input data to the latent space and generate a probability distribution of the latent variables.The decoder, on the other hand, reconstructs the latent variables into outputs that closely resemble the original data.
The optimization objective of VAE is to maximize the loglikelihood of the observed data log p(x) = z p(x, z).However, this equation does not have an analytical solution, so variational inference is employed for approximation.Variational inference introduces a variable q ϕ (z|x) as an approximation to the true posterior distribution p θ (z|x), constructing a variational lower bound on the log-likelihood log p(x).By maximizing the variational lower bound, we can achieve the maximization of the log-likelihood log p(x).
The loss function of VAE can be expressed as: = Where E z−q ϕ (z|x) [− log p(x|z)] represents the reconstruction loss, aiming to make the output of the decoder as close as possible to the true data.When the latent variables follow a standard normal distribution z N (0, I), the KL divergence has a closed-form solution, given by: VAE has many advantages, such as learning high-quality latent representations, strong interpretability, good generalization ability, and flexibility.Therefore, it has been widely used in various fields including image generation, data compression, and feature extraction [37], [38], [39], [40], [41], [42], [43].

A. Single Pixel Imaging System
We present our deep learning-based single-pixel imaging system in Fig. 1.The parallel light source is composed of LED(CreeQ5), attenuation sheet(LOPF-25C-405) and objective lens(OLBQ25.4-050).After the weak parallel light emitted by the parallel light source irradiates the imaging object, it passes through an imaging lens(OLBQ25.4-050)and forms an image on the DMD(0.7XGA12°DDR).The DMD is consists of 1024 × 768 rotatable tiny mirrors, and the size of each mirror is 13.68 um × 13.68 um.The binary matrix is loaded on the control module of the DMD frame by frame through the FPGA (Altera DE2-115), and the mirror corresponding to the element 1 is selected to deflect +12°, and the mirror corresponding to the element 0 is deflected −12°.The light modulated by the DMD is sent to the photon counter PMT(Hamamatsu H10682) by the focusing lens(OLBQ25.4-050)for counting.Select an appropriate algorithm to reconstruct the photon count value, and then the reconstructed image can be obtained.The imaging principle of the single-pixel imaging system can be expressed as: where e represents the noise interference in the system, x ∈ R N ×1 denotes the imaging target, φ ∈ R M ×N represents the measurement matrix, y ∈ R M ×1 and represents the measurements.M/N is called the measurement rate(M is much smaller than N ).The image x can be reconstructed from the measurements y and the measurement matrix φ.

B. Network Architecture
VMSE consists of two sub-networks: the sampling reconstruction network and the error compensation network, as shown in Fig. 2. Unlike existing compressed reconstruction networks, we do not directly perform preliminary reconstruction of the measurements y to restore them to the original image size.Instead, we first restore them to a smaller scale and then use sub-pixel convolution for upsampling.These operations are performed by the SFBlock upsampling module.The error compensation network is composed of the decoder part of a pre-trained VAE, which is fine-tuned during the training process.We will provide detailed information about the structure of the VAE used later.The error compensation network takes multiple trainable latent variables as input and outputs the features of error images (The difference between the original image and the reconstructed image) at different scales.These error features will be used to compensate for the features of the sampling reconstruction network at different scales.Specifically, the intermediate features F mid from the error compensation network are passed through two convolutional blocks (Two layers of 3 × 3 convolution and both use the relu activation function) to obtain a set of affine transformation parameters (α, β).These parameters are then used to correct the features of the sampling reconstruction network at different stages.Since (α, β) and F mid have the same size, the size of F c is also the same as F mid .
Here, • represents element-wise multiplication between the two matrices.
For 32 × 32 images, the sizes of F c1 , F c2 , and F c3 can be set to 4 × 4, 8 × 8, and 16 × 16, respectively.At this point, after the final SFBlock, a preliminary reconstructed image of size 32 × 32 will be obtained.Any deep reconstruction network can be inserted into VMSE to further enhance the details in VMSE's output.Finally, the output is obtained through a 1 × 1 convolution and a sigmoid function.The network parameters used will be elaborated on later in this paper.
Sampling reconstruction network: In the process of compressed sensing reconstruction, upsampling is essential.Traditional upsampling methods include interpolation algorithms and deconvolution, but they often introduce irrelevant information and affect the accuracy of image reconstruction.In contrast, subpixel convolution, which uses multiscale convolution for dimension expansion, is a more efficient upsampling method.It can effectively improve the resolution and quality of the image while preserving image details.It is a more effective approach for  We propose a module called SFBlock, which performs feature extraction in both spatial and frequency domains simultaneously.Its structure is shown in Fig. 4. Firstly, we use a 3 × 3 convolutional layer to extract initial features from the input, resulting in initial feature maps.Then, we pass the initial features through three different branches within the network, where each branch applies a different processing method to the initial features.In the first branch, we start by applying a 1 × 1 convolutional layer to expand the dimensions of the initial features.Then, we restore the size of the features using sub-pixel interpolation.Finally, we use a 3 × 3 convolutional layer to extract features while ensuring that the output feature maps have the same dimensions as the other branches.In the second branch, we use a 5 × 5 convolutional layer followed by a PReLU activation function to extract features from the initial features.Then, we apply a 1 × 1 convolutional layer and sub-pixel interpolation to obtain the upsampled result.Finally, we use a 3 × 3 convolutional Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.layer and a 1 × 1 convolutional layer for feature extraction and dimension matching, respectively.PReLU activation functions are added after both convolutional layers.In the third branch, we first perform Fourier transform on the initial features and then remove the low-frequency components.Next, we use a 1 × 1 convolutional layer to expand the dimensions of the magnitude and phase components.We then apply sub-pixel interpolation to obtain the upsampled result.Finally, we use a 3 × 3 convolutional layer to further extract features and perform inverse Fourier transform to return to the spatial domain.In the end, the outputs of the three branches are summed together with a weight ratio of 1:0.1:1, resulting in the final output of the SFBlock module.
For the activation function after the convolutional layers, we recommend using PReLU [44] instead of ReLU.Compared to ReLU, the PReLU activation function introduces a learnable parameter, denoted as f (x) = max(x, 0) + α min(0, x).Among them, α is a parameter that can be learned, which enables PReLU to better handle negative input and has better stability.
Deep reconstruction network: The deep reconstruction network consists of 8 ResBlocks [45] and employs skip connections.The specific structure is shown in Fig. 5. ResBlock consists of three convolutional layers and the PReLU function.In the first branch, the input features undergo feature extraction through two 3 × 3 convolutional layers and PReLU activation.The second branch performs dimension matching using a 1 × 1 convolutional layer.The outputs of both branches are added together and fused using the PReLU activation function to obtain the final output features.The output dimensions of the 8 ResBlcoks are set to 128, 64, 16, 8, 8, 16, 64, and 128, respectively.Furthermore, it is advisable to incorporate the preliminary reconstructed image into the final output for more stable trainingFurthermore, when utilizing the deep reconstruction network.
Error compensation network: The error compensation network consists of the decoder part of a VAE.In a variational autoencoder, latent variables are learned by the encoder network and represent high-level features extracted during the encoding of input data.The structure of the VAE used for training the error compensation network is shown in Fig. 6.The encoder part consists of three 3 × 3 convolutions with a stride of 2, and their output channels are 32, 32, and 64, respectively.After the input image passes through the encoder, three feature maps at different scales, F 1 , F 2 , and F 3 , are obtained.F 1 is separately inputted into two fully connected layers with a dimension of 512 each to learn the mean μ 1 and variance σ 1 .F 2 is separately inputted into two fully connected layers with a dimension of 128 each to learn the mean μ 2 and variance σ 2 .F 3 first goes through a dimension reduction using a fully connected layer with a dimension of 128, and then it is further processed by two fully connected layers with a dimension of 64 each to learn the mean μ 3 and variance σ 3 .The decoder consists of three 3 × 3 transpose convolutional layers with a stride of 2, and six fully connected layers.
z 0 is a trainable parameter, and it is concatenated with z 3 .After concatenation, dimensionality expansion is performed by a fully connected layer with dimensionality 128 and the original image size.Finally get F 3 with the same size as F 3 .Next, F 3 performs deconvolution to obtain F2 .Similarly, z 2 is passed through a fully connected layer with a dimension of 256 and another fully connected layer with a dimension equal to the size of the original image.This process upsamples z 2 to obtain F 2 , which has the same size as F 2 .The concatenated result of F 2 and F2 is used as the input to the deconvolutional layer, which produces F1 as its output.z 1 is passed through a fully connected layer with a dimension of 512 and another fully connected layer with a dimension equal to the size of the original image.This process upsamples z 1 to obtain F 1 , which has the same size as F 1 .After concatenating F 1 and F1 , they are passed through a deconvolutional layer to generate the output x of the VAE.

C. Loss Function Design
The error compensation network used by VMSE is pre-trained by VAE first, and then fine-tuned.As explained in Section 2.2, the VAE's loss function consists of two components: the reconstruction loss and the KL divergence.We take the difference between the input and output of the VAE as the reconstruction loss.z 1 , z 2 , and z 3 calculate the KL divergence respectively, and form the final KL divergence item according to the ratio of 0.25, 0.25, and 0.5.
where E denotes the encoder of the VAE and D denotes the decoder.VMSE consists of three loss functions, corresponding to the sampling reconstruction network, error compensation network, and latent variable optimization.The sampling reconstruction network takes the measurements and intermediate layer features from the error compensation network as inputs and performs the reconstruction process to obtain the reconstructed image x.To measure the difference between the reconstructed image x and the original image x, we use the l 2 norm as the loss function.The error compensation network takes the latent variable z as input and uses the l 2 norm to measure the difference between the output image x z and the error image x r (x r = x − x).
where φ θ represents the sampling reconstruction network, A is the measurement matrix, F mid1 , F mid 2 , F mid3 represent the middle layer features of the error compensation network, and D Ψ represents the error compensation network.The sampling network, error compensation network, and latent variables are updated alternately.

D. Training Method
It is worth noting that, whether it is training or testing, when any one of φ θ , D Ψ , and z is updated, the other two should remain unchanged.This is outlined in Algorithm Fig. 1.
When our network is applied to a single-pixel imaging system, a binary measurement matrix is required.Inspired by binary neural networks [46], we use the sign function to binarize the first fully connected layer during training.Where A is a floating point weight and A b is a binary weight.But the derivative of the sign function is almost 0 everywhere, so we need to use the tanh function instead of the sign function to rewrite the gradient when updating the gradient.
IV. RESULTS AND DISCUSSION The datasets we use are MNIST and Fashion MNIST and interpolate them to obtain 32 × 32 images.Each dataset comprises 70,000 images in total, with 60,000 used for training and 10,000 used for testing.We set the learning rate to 0.0001, the batch size to 64, and the training epochs to 100.It should be noted that the sampling reconstruction network of VMSE is trained for 100 rounds.These parameters are also applicable to the pre-training of the error compensation network.In the testing phase, we selected 64 images for evaluation and took their average as the final result.In this section, we refer to the VMSE using the deep reconstruction network as VMSE+ for easy distinction.

A. Performance Evaluation Experiment of VMSE
In this subsection, VMSE does not use a deep reconstruction network.To verify the effectiveness of the VMSE structure, we compare it with TVAL3, Reconnet [47], DR2Net [48] and VMSE without error compensation network(VMSE-).Table I and Fig. 7 present the imaging results of these networks at measurement rates of 0.025, 0.05, 0.1, 0.15, and 0.2.It is evident from the results that VMSE outperforms the other networks by a significant margin.SFBlock demonstrates excellent feature extraction capabilities and achieves higher SSIM at low sampling rates, indicating its ability to recover clearer texture details and contour boundaries.The error compensation network exhibits strong Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Input batches of data (x i ) N i 5: for all i do 6: compute coded value: Update D e , D d to minimize the L vae : Compute

B. Exploring the Influence of Error Compensation Network on VMSE
In a Variational Autoencoder (VAE), the dimension of the latent variable is an important factor that represents different attributes or features of the data.Each dimension corresponds to a specific attribute, and by adjusting different dimensions of the latent variable, we can control the VAE's generation of samples with different attributes.For a latent variable of dimensionality n, there will be n means and variances, and variations in different dimensions will result in the decoder generating different features.To verify whether a higher-dimensional error compensation network can bring more benefits, we selected two VAEs with different latent variable dimensions to train the error compensation network.The higher-dimensional VAE has a latent variable dimension of 512, 128, 64, while the lower-dimensional VAE has a latent variable dimension of 128, 32, 8. Pre-training can improve the generalization ability and learning effect of the model, so we also compared the influence of whether the error compensation network is pre-trained on VMSE performance.In Table II, we present the reconstruction results of VMSE using different error compensation networks at five measurement rates, and Fig. 8 shows the reconstructed image at a measurement rate of 0.025.
From the experimental results, it can be observed that when the dimensionality of the latent variable is low, the generated digits already exhibit some similarity to the original images, but compared to high-dimensional latent variables, there is a slight decrease in SSIM.By increasing the dimensionality of the latent variable, VMSE can capture richer data features and attributes, resulting in higher-quality reconstructed images.Pretraining the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

C. SFBlock Performance Verification Experiment
Unlike the majority of upsampling methods that solely focus on spatial domain learning, SFBlock incorporates the consideration of frequency domain information.In order to evaluate the effectiveness of frequency domain enhancement, we conducted experiments comparing SFBlock with and without frequency domain enhancement.The results, presented in Fig. 9, demonstrate the benefits of incorporating frequency domain information in SFBlock.SFBlock with frequency domain enhancement consistently achieves higher PSNR values at five measurement rates.This shows that combined with frequency domain information, SFBlock is able to enhance the details and quality of images, thus improving the reconstruction results.

D. Comparison of VMSE+ and Other Existing Networks
To further investigate the effectiveness of the proposed framework, we compare VMSE+ with DR2Net, CSNet [10], and MR-CSGAN [11].Both CSNet and MR-CSGAN utilize 8 MSRB (Multi-Scale Residual Block) to generate multi-scale features of the image after the initial reconstruction process.To ensure fairness, VMSE+ replaces these with 8 simplified ResBlocks.VMSE+ applies an additional l 1 norm constraint on the latent variables with a weight of 0.0001.Table III is the experimental results of these networks.
On the Fashion MNIST dataset, VMSE+ continues to demonstrate its strong modeling capabilities, and has achieved excellent results in both PSNR and SSIM indicators.Fig. 10 shows the reconstructed images of these reconstruction algorithms when the measurement rate is 0.025.It can be seen that VMSE+ performs well in terms of image quality improvement.

E. Performance Verification of VMSE on a Single-Pixel Imaging System
We also applied VMSE to a single-pixel imaging system.We binarized the weights of the first fully connected layer and loaded this binary measurement matrix onto the digital micromirror device (DMD) to acquire the light intensity information of the imaged object.By reconstructing the measurements from the single-pixel imaging system, we obtained reconstructed images of the digits "3" and "5" at measurement rates of 0.025, 0.05, 0.1, 0.15, and 0.2.The reconstruction results are shown in Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

V. CONCLUSION
In this paper, a variational multiscale error compensation network VMSE is proposed.Experimental results show that the network structure we designed can extract multi-scale image features more effectively.Through the error compensation network to compensate the sampling reconstruction network at different stages and SFBlock's unique upsampling method, VMSE can learn rich image semantic information in the initial reconstruction stage.In addition, VMSE is also well adaptable and can be combined with arbitrarily deep reconstruction networks.Compared to existing algorithms, VMSE, utilizing only eight ResBlocks with residual connections as its deep reconstruction network (referred to as VMSE+), possesses superior feature representation capabilities.Under the same iteration conditions, VMSE+ performs better on indicators such as PSNR and SSIM.And VMSE can be directly applied to single-pixel imaging systems.Practical experiments show that VMSE can restore specific images more accurately and completely.
In our future work, we plan to design the deep reconstruction network as a Generative Adversarial Network (GAN), combining both VAE and GAN, instead of using the existing simple residual structure.This hybrid model may fully exploit the data representation capability of VAE while leveraging the generative power of GAN, leading to further improvements in image quality and stability.

Fig. 2 .
Fig. 2. Overall structure diagram of VMSE.x represents the original image, which undergoes downsampling via a fully connected layer to obtain measurements y.Subsequently, y is upsampled multiple times to restore it to the original image size, resulting in intermediate features F mid1 , F mid2 , and F mid3 at different scales.The error compensation network is comprised of the decoder part of the VAE, which generates features corresponding to the scale of the intermediate features.It also enhances the intermediate features through specific operations to obtain the augmented features F c1 , F c2 , and F c3 .Finally, an appropriate deep reconstruction network can be selected to obtain the prediction of the original image x.

Fig. 4 .
Fig. 4. Structure of SFBlock.It consists of three branches, each employing a different approach to process features.Finally, the outputs of the three branches are merged using a residual structure.

Fig. 6 .
Fig. 6.The structure of the VAE used to train the error compensation network.It incorporates multiple dimensions of latent variables into training, fully harnessing the generative capacity of the VAE.

Fig. 7 .
Fig. 7. PSNR of images reconstructed by different networks on the MNIST dataset.

Algorithm 1 : 1 :
VMSE Algorithm.Training of Error Compensation Network: 2: Initialize the encoder and decoder weights of the VAE D e , D d 3: for each iteration do 4:

Fig. 8 .
Fig. 8. Reconstructed images using VMSE with different error compensation networks at a measurement rate of 0.025, from top to bottom are (a) original image; (b) low dimension and pretrained; (c) high dimensional and not pretrained; (d) high dimensionality and pretrained.

Fig. 11 .
Fig. 11.Reconstruction results on a single-pixel imaging system, (a) the reconstruction algorithm is TVAL3; (b) the reconstruction algorithm is VMSE.

Fig. 11 .
Fig. 11.VMSE achieved better imaging quality and visual effects compared to TVAL3.Particularly at a measurement rate of 0.025, VMSE reconstructed the contours of the digits, while TVAL3 reconstructed images were difficult to recognize.At higher measurement rates, VMSE consistently produced clearer images compared to TVAL3, with effective recovery of fine details along the image edges.
number of iterations is divisible by 10 then 11:update z 3i , z 2i , z 1i to minimize the reconstruction error: , z 2 , z 1 weights, sampling reconstruction network φ θ , error compensation network D Ψ ., z 2 , z 1 to minimize the reconstruction error: x Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I PSNR
(DB) AND SSIM OF IMAGES RECONSTRUCTED ON THE MNIST DATASET BY DIFFERENT NETWORKS AT DIFFERENT MEASUREMENT RATES

TABLE II PSNR
(DB) AND SSIM OF VMSE RECONSTRUCTION RESULTS USING DIFFERENT ERROR COMPENSATION NETWORKS AT DIFFERENT MEASUREMENT RATEScompensatory abilities, enabling the rapid addition of image details, resulting in more realistic and refined reconstructions, ensuring that VMSE can reconstruct higher-quality images.