Neural Invertible Variable-degree Optical Aberrations Correction

Optical aberrations of optical systems cause significant degradation of imaging quality. Aberration correction by sophisticated lens designs and special glass materials generally incurs high cost of manufacturing and the increase in the weight of optical systems, thus recent work has shifted to aberration correction with deep learning-based post-processing. Though real-world optical aberrations vary in degree, existing methods cannot eliminate variable-degree aberrations well, especially for the severe degrees of degradation. Also, previous methods use a single feed-forward neural network and suffer from information loss in the output. To address the issues, we propose a novel aberration correction method with an invertible architecture by leveraging its information-lossless property. Within the architecture, we develop conditional invertible blocks to allow the processing of aberrations with variable degrees. Our method is evaluated on both a synthetic dataset from physics-based imaging simulation and a real captured dataset. Quantitative and qualitative experimental results demonstrate that our method outperforms compared methods in correcting variable-degree optical aberrations.


Introduction
Optical aberrations are defects introduced in the design, manufacturing and assembly of camera lenses. The defects cause the incident light to diffuse and fail to focus to form a sharp image, producing images with blurry and dispersive appearance [1]. Despite the rapid development of optical design and digital imaging technologies, they still can not totally avoid image degradation caused by optical aberrations, especially from lightweight cheap lenses and camera lenses of smart phones, whose optical aberrations are relatively significant.
In practice, optical aberrations are mitigated by incorporating an array of lenses, using aspheric lenses, and employing special glass materials for lenses. However, the increase in lens types and lens materials poses challenges to the manufacturing process and raises production costs. Therefore, recent works have shifted the genre of aberration corrections from sophisticated lens design to post-processing [2].
Currently, two main technical routes have been proposed for optical aberrations correction in post-processing. One is the model-driven traditional methods, which assume an image degradation model, use various natural image priors, optimize in multiple iterations to find the degradation kernel, and perform deconvolution to obtain the sharp image. However, traditional methods are not robust enough in dealing with spatially varying degradation. The other is the data-driven deep learning methods, which have recently become increasingly popular. These methods utilize training data to train the neural networks and recover sharp images from degraded images. However, most of them can not deal with variable-degree aberrations since they are exclusively designed for a specific degree of degradation. In addition, they leverage a single feed-forward autoencoder architecture and usually suffer from information loss during the encoding and decoding process.
To address the information loss problem, we propose a novel aberration correction method based on invertible neural networks (INNs) to learn the transformation from aberration images to aberration-free images, where the invertible design assures that the neural networks can preserve information, especially the details of images [3,4]. Due to the limited nonlinear transformation ability of invertible neural networks [5], we introduce a feature extraction module to improve the non-linear transformation capability of the INNs. In order to process optical aberrations with variable degrees, we propose enhanced conditional encoding modules, which use the degradation degree of the aberration image as the input. This provides our method with the capability to restore sharp images from input images with variable degradation degrees.
Since images captured by cameras are naturally degraded by optical aberrations, large-scale real paired aberration and aberration-free images are unavailable. To mitigate this issue, we establish an imaging simulation process to synthesize realistic degraded images from sharp reference images. The image simulation process incorporates lens parameters of an optical system and is able to perform physics-based ray tracing to simulate the optical aberrations with variable degrees. We leverage this approach to produce large-scale paired datasets for the proposed aberration correction method.
Experimental results show that the proposed method achieves better numerical metrics and visual effects on both synthetic images and real aberration-degraded images. The visual results verify that our method can recover more details by the inference along the forward direction, thanks to the highly invertible design in the architecture. Meanwhile, our method brings the benefit that it can synthesize aberration images from sharp images along the reverse direction. The contributions of this paper can be summarized as follows: • We design an imaging simulation process based on ray tracing and spatial convolution to generate large-scale paired datasets with variable degradation degrees.
• We propose an invertible neural network architecture for optical aberration correction that can largely alleviate the information loss problem and improve image quality.
• We introduce conditional encoding modules for the invertible neural network to deal with varying degrees of optical aberrations.

Optical Aberrations Correction
Due to the inherent optical aberrations of optical systems, the captured image will be degraded. This degradation can hardly be totally avoided by sophisticated optical system design, so recent works turn to post-processing for removing aberrations. The current methods mostly perform the process in two steps: first, estimate the point spread function (PSF) of the target optical system, and then use the non-blind deconvolution or deep neural networks to restore the image. Optical aberrations are spatially varying, and the methods for obtaining spatially varying PSF can be divided into three categories: real shooting-based methods, calibration-based methods and optical simulation-based methods. In [6], the point spread function was directly measured by imaging the pinhole grid pattern in a dark room. The work in [7] used the frame random mode to calibrate PSFs. The work in [8] calculated the PSFs by raytracing and coherent superposition in a simulation manner. After obtaining PSFs, some methods use a deconvolution process [1,9] to solve the linear inverse problem. The scholars of [10] used a two-step scheme to correct the aberration of a single image, and then used a convolutional neural network to remove the remaining chromatic aberration in the image. However, deconvolution involves a complex iterative process. Due to the strong fitting ability of the deep neural network, some methods [8,[11][12][13] used the autoencoder-based architecture to restore degraded images. The work in [9] designed a PSF aware neural network, which takes degraded images and PSF images as inputs and generates latent high-quality images by combining depth prior. The work in [13] proposed a frequency-based adaptive block, which was inserted into the neural network to perform feature based deconvolution to correct non-uniform blur. However, these networks need to be calibrated PSFs as the input. The scholars of [12] proposed an end-to-end neural network to remove the aberrations in the input image. However, their architecture based on a feed-forward autoencoder was unable to deal with varying degrees of aberrations. In contrast, our method does not need a complex PSF estimation process and the proposed a conditional invertible neural network allows to correct degradations with variable degrees.

Invertible Neural Networks
The development of invertible neural networks (INN) can be traced back to nonlinear independent component estimation (NICE), which was proposed by [5]. It learns the nonlinear bi-directional mapping of input data to latent space in an unsupervised way. The forward calculation process and the reverse process have shared model parameters. Based on this, RevNet [14] was put forward, which can complete the back propagation of the network without storing activation. In this way, the memory consumption of the model can be greatly reduced. In order to deal with image-related tasks, the scholars of [15] introduced the convolution layer and multi-scale layer in the coupling model to reduce the computing cost and improve the model regularization ability. The work in [16] built a reversible network architecture i-RevNet based on RevNet, which retains all information of the input signal in all intermediate representations except the last layer.
The article also proves that information loss is not a necessary condition when learning can be generalized to the representation of unfamiliar data. The work in [17] proposed an effective reversible 1 × 1 convolution block and Glow, which can synthesize and process large images efficiently and realistically.
Because of the information-lossless and powerful generation ability, INN has been used in many image restoration tasks. The work in [18] used INN to learn the reversible bĳection transformation of image downscaling and upscaling to achieve information-lossless image rescaling. Scholars of [3] designed a reversible neural network for denoising tasks. In the forward process, the noise image is mapped to a low resolution image and a latent representation space; In the reverse process, sampling from a prior distribution will replace the latent representation to discard the noise. Other image restoration tasks using INN include image decolorization [19], image hiding [20], etc. However, to the best of our knowledge, there has been no work trying to apply INN to the task of optical aberration correction.

Method
In this section, we first introduce the simulation method of degraded images based on ray tracing in Section 3.1. Second, we illustrate the overall architecture of our invertible aberration correction neural network in Section 3.2. And in Section 3.3, we elaborate on the composition of the loss function.

Raytracing Based Imaging Simulation
The main problem of the supervised aberrations correction algorithm based on deep learning is the lack of real paired datasets. Established methods ignore the underlying optical systems and simply synthesize the degradation caused by optical aberration with a Gaussian degenerate kernel, leading to a large gap between the synthetic paired data and the real data. For optical systems with large fields of view and large apertures, the commonly used Gaussian degenerate kernel causes inaccurate simulation since the actual point spread functions (PSFs) are spatially varying across the field of view (FOV).
The recent work [8] has proposed using the imaging simulation method without shooting or registration operation to solve the above problems. This method is easy to migrate to different optical systems. We adopt this approach [8] and introduce a distance to the focal plane as one of the simulation inputs to generate synthetic degraded images with varying-degree aberrations.
Our imaging simulation process consists of two steps: first, we calculate patch-wise PSFs with accurate raytracing; second, assuming that the degradation in a local region of a natural image is similar, The sharp image patches are convolved with the patch-wise PSFs to simulate the degradation process. Assuming that the degradation degree of the image patch ( , , ) is consistent when shooting at the same object distance, the degradation process can be modeled as follows: where ( , ) indicates the spatial coordinates of the patch ( , , ). is the distance to the focal plane. ℎ ( , ) is the latent sharp image of ( , , ). ⊗ is the operation of convolution. ( , , ) is the normalized point spread function, representing the energy diffusion caused by the aberrations of the optical system. ( , ) models the random noise introduced in the imaging process, and noise can be approximated as the well-established heteroscedastic Gaussian model [21].
Point spread function calculation. According to the lens parameters of the optical system used for imaging, the wavefront aberrations and the point spread functions of the optical system are calculated by sequential raytracing. Fig. 1 shows the process of raytracing and FOV-dependent PSFs. We define a ray R = ( , , ) as follows: where O = ( 0 , 0 , 0 ) is the starting point of the ray R, D = ( , , ) is the normalized direction vector of the ray R, t is the ray marching distance from the starting point O.
The first step of sequential raytracing is to calculate the intersection point of the ray and the surface. For spheric surfaces, the value for the intersection point is solved analytically. For aspheric surfaces, we define them with sagittal height expression as follows: where is the longitudinal coordinate of a point on the surface, is the curvature of the spherical part, = √︁ 2 + 2 is the distance from the point to axis. 2 , 4 and are the coefficients of higher order terms. The value of for intersections can be solved by plunging Eq. 2 into Eq. 3. Since Eq. 3 contains high-order terms, the coordinate of the intersection point can only be calculated by multiple iterations numerically [8].
The second step of sequential raytracing is to calculate the refraction direction of the ray. Given the refractive indices 1 and 2 on both sides of the refraction surface and the incident angle of the incident ray, we use Snell's law to calculate the direction of the refracted ray.
The wavefront aberrations are the deviation of the actual wavefront and the ideal wavefront, which are expressed by the optical path difference. When calculating the optical path difference for a given ray, the ray starts from the object plane, reaches the image plane, and then reversely traces from the image plane back to the reference sphere at the exit pupil. The complex pupil function can be constructed by combining the phase information of the optical path difference and the amplitude information formed by the exit pupil. The pupil function can be expressed as: where ( , ) represents the pupil plane coordinates, ( , ) is the complex amplitude distribution of the exit pupil surface, ( , ) is the optical path difference between the ray at the exit pupil and the chief ray.
The point spread function is the spot formed by the rays from point light sources after passing through the optical system. The amplitude spread function is the Fourier transform of the pupil function P ( , ). The amplitude spread function can be expressed as: where ( , ) represents the image plane coordinates. The point spread function is the squared magnitude of the amplitude spread function ℎ( , ). Patch-wise spatial domain convolution. First, we segment the image into 32 × 32 uniform patches. These image patches are respectively convolved with PSFs of the corresponding center FOVs in the spatial domain to simulate imaging. See Eq. 1 for the specific operations. Then, we splice the degraded image patches together. Finally, we multiply pixel values of the degraded image by the relative illuminance coefficient at the corresponding FOV, which can be obtained according to the pixel position. It should be noted that two additional operations in this process: On the one hand, the PSF needs to be normalized in advance to keep the energy of the image unchanged; On the other hand, to ensure that the smoothness of the image is not affected by patch-wise convolution, the edges of the patches need to be interpolated.

Invertible Aberration Correction Architecture
We design a conditional invertible neural network to conduct the aberration correction. Fig. 2 shows the overall architecture, which consists of a feature extraction module and a conditional invertible module to correct optical aberrations of variable degrees.
Feature Extraction Module. The design of an invertible neural network (INN) needs to ensure strict reversibility, thus INN usually has limited the nonlinear transformation ability. Thus, we add a feature extraction module in front of the INN to improve the nonlinear transformation ability. In the forward process(black arrow), the input degraded image passes through a conditional invertible neural network composed of a feature extraction module and 12 conditional invertible blocks to obtain a sharp image G(Y). In the reverse process(red arrow), we reverse G(Y) into the network to obtain the degraded image F (G(Y)). Under the joint forward and reverse process, the information of the image is kept as much as possible, making details of the restored image clearer. Fig. 3. The detailed architecture of feature extraction module. "Conv7-64" means that the convolution kernel size of this layer is 7 × 7, and the number of convolution kernels is 64.
The feature extraction module is based on multi-scale ResBlocks [22], including up-sampling and down-sampling processes. The details are shown in Fig. 3. It should be noted that the weights of the feature extraction module of the forward and reverse processes are not shared.
Conditional Invertible Module. The conditional invertible module is composed of conditional invertible blocks, where is set to 12. Each conditional invertible block consists of the squeeze operation, invertible 1 × 1 convolution, conditional affine coupling layer, and unsqueeze operation. These operations are invertible, thus the entire INN is completely invertible. Next, we elaborate on the components of the conditional invertible block.
Squeeze and unsqueeze. The squeeze operation [15] is similar to the convolution operation in CNN, reducing the size of the feature map and increasing the channel of the feature map to capture correlation and structure over a greater spatial distance. Unlike the convolution operation, the squeeze operation extracts features according to the checkerboard pattern to ensure reversibility, as shown in Fig. 4. The squeeze operation increases the channel dimension while retaining the local correlation of the image. The unsqueeze operation is the inverse process of the squeeze operation to recover the original size of the feature map. Invertible 1 × 1 convolution. The invertible 1 × 1 convolution [17] is a learnable convolution. It fuses information between different feature channels. This allows more interaction and fusion between the information from different incoming data flows.
Conditional affine coupling layers. The schematic diagram of the proposed layers is shown in Fig. 5. The conditional affine coupling layer [23] adds coded condition variables to the affine coupling layer [5,15] to improve the efficiency of the flow model. Here, the conditional code ℎ represents the aberrations degree of the image, specifically referring to the object distance to the focal plane. We use binary mode to encode the distance. The images in our dataset have 101 different distances, so the length of the binary code is 7-bit. For the i-th affine coupling layer, the input is divided into 1 and 2 along the channel direction, and then they undergo augmented affine transformation [15,18]: The Eq. 6 corresponds to a forward process, and the outputs [ +1 1 , +1 2 ] are concatenated again and passed to the next affine coupling block. In the reverse process, only addition (+) and multiplication (×) operations are reversed to subtraction (-) and division (/), while the internal transformation functions( (), (), (), ()) do not need to be reversible, and they can be represented by arbitrary neural networks. we employ a multi-scale residual concatenated convolutional block, which is a simplified version of the feature enhancement module. Specifically, we changed the 64 convolution kernels in the first pink block "Conv7-64" to 32 convolution kernels("Conv7-32") in Fig. 3, and we reduced the middle "ResBlock × 4" to "ResBlock × 2". When the output +1 is given, the corresponding reverse process can be expressed as:

Loss Function
We optimize the proposed invertible aberration correction neural network end-to-end with the following loss function: where 1 , 2 , 3 , 4 are hyperparameters, which control the importance of different loss items respectively. They are empirically set as 1 = 1, 2 = 0.5, 3 = 0.05 and 4 = 0.02. G stands for the forward transformation from degraded images to sharp images. F is the inverse process of G. X is the reference sharp image. Y is the input image with optical aberrations. Forward Loss. This loss is applied in the forward process to eliminate the optical aberrations of the image, such that the output content is close to the content of the reference. The 1 norm regularized pixel level loss function is used as it provides better quality than other norms, like the Reverse Loss. The reverse loss makes the learning process more stable and increases the robustness of the neural network. G(Y) is the output of the degraded image Y after passing through the forward network, and F (G(Y)) is the degraded image after the reverse process. The reverse loss make the content of F (G(Y)) close to the initial degraded image Y.
Edge Loss. The edge loss can take high frequency texture structure information into account, and improve the details of hyper-segmented images.
where Δ denotes the Laplacian operator. Perceptual Loss. Perceptual Loss [24] measures the difference between two images by features extracted from the benchmark VGG model [25]. It enhances the perceptual similarity between the generated image and the reference image, thus helping to produce a more realistic image.
where represents the m-th layer; , and stand for the number of channels, height, and width of the feature maps, respectively; (X) represents the feature response of the sharp X at the m-th layer, and (G(Y)) represents the feature response of G(Y) at the m-th layer. In this work, we use the tenth ( = 10) convolutional layer of the pretrained VGG-19 network to extract features.

Experiments
This section first introduces the details of the experimental setup and then qualitatively and quantitatively compares our proposed method with state-of-the-art methods on both the synthetic data and the real data. Finally, we further conducted ablation studies and provide an analysis of the results.

Experimental Settings
We use the DIV2K [26] dataset, consisting of 1000 high-quality 2K resolution images. We randomly select a part of the DIV2K dataset and ISO 12233 chart, and we conduct batch imaging simulation to generate paired synthetic datasets. The target optical system is the GCO 232005 optical lens. In imaging simulation, we set the object distance to the focal plane from −125 mm to 125 mm to construct the synthetic dataset, with an interval of 2.5 mm, corresponding to 101 different degrees of aberrations. For this optical system, when the object distance to the focal plane is more than 80 mm or less than −80 mm, the simulated image can be considered as heavily degraded, which brings great challenges to the aberration correction task.
The synthetic dataset contains 6000 image pairs, which are divided into the training set, validation set and test set in a 4:1:1 ratio. We implement the proposed method using PyTorch and train neural networks on two RTX 5000 GPUs with Adam [27] optimizer( 1 = 0.9, 2 = 0.999) for total 150 epochs. The initial learning rate is fixed at 1 × 10 −4 , which decays by half every 50 epochs. The training patch size is 256 × 256 and the batch size is 8. We employ random cropping, flipping and rotation to augment the training data as in [28,29]. Generally, it takes around 1.5 days to train a model for 150 epochs.
We also set up an experimental optical system to capture real aberration data. The optical system consists of GCO 232005 optical lens and an MER-131-210U3C-L CMOS sensor. Same as in our simulation, we capture real images by setting the imaging object distance to the focal plane as values ranging from −125 mm to 125 mm with the interval 5.0 mm.  Fig. 6. Qualitative comparisons on the synthetic dataset. The results are produced by DeblurGANv2 [30], FOV-KPN [12], MIMO-UNet [31], MPRNet [32], Stripformer [33] and our method. Here, "Distance" represents the imaging object distance to the focal plane. The first scene is obtained by cropping the image patch from the ISO 12233 test chart. The second and third scenes are from DIV2K dataset [26].

Evaluation on Synthetic Images
To demonstrate the advantages of the proposed method, we firstly compare it with five stateof-the-art methods on the synthetic test images, including DeblurGANv2 [30], FOV-KPN [12], MIMO-UNet [31], MPRNet [32], Stripformer [33]. For a fair comparison, all the compared methods adopt the default settings from the original papers. The training/test dataset consists of 4000/1000 degradation images, which are generated by the proposed imaging simulation framework and include 101 degradation degrees. The compared methods are retrained on the synthetic training dataset. We evaluate these methods through commonly used metrics, including Peak Signal Noise Ratio (PSNR), Structural Similarity Index (SSIM) [34], Learned Perceptual Image Patch Similarity (LPIPS) [35]. Table 1 and Fig. 6 show the quantitative and qualitative results of our method and the compared methods on the synthetic test dataset, respectively. Our method outperforms the previous best method MIMO-UNet by 0.34dB in PSNR, as shown in Table 1. Fig. 6 demonstrates that our method can recover images with better quality than all the compared methods. Our method yields sharper results with more details, especially in highly textured regions such as the text on the clothes. Furthermore, our method benefits from the invertible design and the combined forward loss and reverse loss, which effectively avoids unrealistic information and artifacts in the results.

Evaluation on Real Images
In Fig. 7, we compare our method with the state-of-the-art methods on realistic test images captured by our experimental camera. In addition to the methods mentioned in the Section 4.2, we add the model-driven optimization algorithm [36] for comparison. For convenience, we refer to [36] as "Dark Channel Prior". As can be seen from Fig. 7, the proposed method outperforms all methods in terms of the visual effect. It effectively eliminates the degradation caused by optical aberrations, and the conditional invertible neural network can largely retain details of the original image, such as text edges and hair structures. Although the DeblurGANv2 recovers images with sharper edges, it introduced unrealistic information and severe noise. It is worth noting that MIMO-UNet [31], which ranks second in quantitative results on the synthetic dataset, cannot handle severe spatially variant degradation. Other methods fail to deal with the optical aberrations well, and the restored images are not sufficiently clear. Overall, our method performs better in resolving optical aberrations on real images.
To further evaluate the improvement of image quality, we analyze the MTF of restored images. The second scene of Fig. 7 is used because there are many edges of text. We show the MTF curves of the degraded input, the sharp image restored by our methods, and the images restored by other methods for comparison, which can be seen in Fig. 8. Our method improves the MTF50 from 0.025 to 0.092 c/p, demonstrating that our method generates images with sharp edges. Although DeblurGANv2 produces a higher MTF50 value, their model leads to over-sharp edges which are unrealistic and causes severe noise which can not be reflected by the MTF curve. The restored image by Stripformer [33] also suffers from severe noise which can be seen in the blue background.

Ablation Studies
In this section, we evaluate the impact of every component of our method by ablating different parts of the neural network and comparing them with the complete architecture, as shown in Table 2. We conduct ablation experiments on the synthetic dataset. Details about the dataset can be found in Section 4.1. We use the Adam [27] optimizer with a learning rate of 0.0001 to train for 150 epochs, and the learning rate decreases by half every 50 epochs.
Analysis of the feature extraction module. To verify the performance of the proposed nonlinear feature extraction module, we conduct a corresponding ablation study on the module, as shown in Table 2. Specifically, we train the proposed method with and without the feature extraction module respectively and keep the other training settings the same. Fig. 9 shows the visualizations of the two models evaluated on the test dataset, and we can observe that using the feature enhancement module effectively improves the image quality. Analysis of the conditional invertible blocks. Removing the conditional code, the quantitative scores significantly drop as shown in the second row of Table 2. It means that the conditional code is crucial for image fidelity and perceptual quality. After removing the entire conditional invertible module and only keeping the feature extraction module as shown in the third row, the aberrations correcting ability of the network gets worse, which demonstrates the necessity of the conditional invertible module.   Fig. 9. Ablation study on feature extraction module. This image is from DIV2K dataset [26].
Analysis of the FOV encoder [12]. The work in [8] found that adding the field of view (FOV) as an additional input can improve the model performance. We also try to add a FOV encoder [12] in front of the forward process. However, the model performance is lower than the proposed method without FOV as can be seen in Table 2. Thus, we do not include the FOV encoder in the proposed method.
Analysis of the number of conditional invertible blocks. We verify the effect of different numbers of conditionally reversible blocks on the optical aberration correction performance of our method in Table 3. Reducing the number of conditionally reversible blocks will lead to artifacts in the restored results, and the PSNR and SSIM also decrease significantly, such as when = 8. When increasing the number of blocks to 12, the proposed method is able to recover more image details, and the performance of the model is greatly improved. When the number of blocks is further increased to 16, the performance improvement of the model is not obvious. However, the increase in the number of reversible blocks means an increase in the number of parameters and inference time, so in order to better balance model performance and efficiency, we use = 12 as the default option. Analysis of the proposed loss functions. The loss functions are applied during the training stage to minimize the difference between the restored image and the ground truth image, as described in section 3.3. We conduct ablation studies on different loss functions to verify the impact of them, and the results are shown in Table 4. We noticed that the proposed method without any of the reverse loss, edge loss, and perceptual loss leads to worse performance. The reverse loss makes the model more stable. Edge loss and perceptual loss minimize the difference between two images from the perspective of high-frequency details and perceptual quality, enabling the model to generate sharper images.

Conclusions
In this work, we have proposed an enhanced conditional invertible neural framework to correct variable-degree optical aberrations. The conditional invertible neural network can effectively avoid information loss and restore image details. Meanwhile, to better handle different degrees of aberrations, we embed the degree of degradation into the model as a conditional encoding. Comprehensive experiments verify that our method outperforms compared methods in correcting optical aberrations on both synthetic images and real images. Furthermore, our method performs quite competitively in terms of model size. The proposed method is promising to be embedded into ISP systems to improve imaging quality.