1 Introduction

The field of ophthalmology has seen an increasing interest in vision simulation methods, owing largely to recent advancements in general computing and rendering. A great number of applications exist [1,2,3,4], including the evaluation of visual acuity [5,6,7,8], vision testing [9], studying the mechanisms of the human eye [10,11,12], evaluating the impact of eye diseases on our ability to perform real-world tasks [13], and the visualization of eye diseases in synthetic [14,15,16], virtual reality [17,18,19], and augmented reality [19,20,21] environments.

Personalized vision simulations are enormously useful for the proper planning of invasive vision-correcting processes (such as intraocular lens implants [22] and laser surgery [23]) because of the inherent risk of post-surgery complications posed by these operations. Non-invasive vision-correcting solutions (such as progressive lenses [24,25,26,27], vision-correcting displays [28,29,30,31,32], head-mounted displays [33], and holographic lenses [34]) are also rapidly evolving and benefit greatly from improved vision simulation algorithms.

Head-worn displays are common building blocks of many vision-correcting devices and can cause discomfort over an extended period of use, owing largely to the disparity between the vision of the eye and the synthetic images shown to the user. Accounting for the characteristics of the wearer’s visual system via the inclusion of depth-of-field (DOF) [35,36,37,38], chromatic aberration [39, 40], and the eye’s own visual aberrations [41] can significantly reduce the discomfort experienced by the user. Special-purpose fonts have also been developed to reduce the fatigue from rapid context switches [42], a common issue with augmented reality applications. Fast, accurate, and personalizable vision simulation methods are essential in developing and implementing these features.

Wavefront aberrations are highly impactful for all optical systems, and algorithms for synthetically generating images that properly account for these effects have a long history in computer graphics [43,44,45,46,47]. Such aberrations greatly affect the human eye as well, and thus, numerous solutions exist for simulating visual aberrations [48]. These methods are often hindered by the disregard of peripheral aberrations, computational cost of the simulation, and lack of easy personalization with user-specific information. Recently, Csoba and Kunkli [49] solved the performance and personalization issues at the expense of a long precomputation step. Furthermore, their algorithm ignored peripheral aberrations, which is a common limitation of convolution-based approaches. Peripheral vision is an important source of information for many vision-dependent tasks [50,51,52,53] and can behave in highly unexpected ways when eye diseases are present. As an example, we illustrate relative peripheral hyperopia in Fig. 1, whereby the blurriness of a myopic eye decreases as the angle of incidence increases, causing the sharp region to fall outside the central vision [54, 55].

Fig. 1
figure 1

Examples of visual aberration simulations generated using our proposed approach for a highly myopic eye for demonstrating the differences between ignoring (left) and simulating (right) peripheral visual aberrations. Eye structure and aberration estimation combined took 0.28 s (left) and 0.66 s (right) on an AMD Ryzen 7 1700 CPU. Rendering took 6.68 ms (left) and 12.38 ms (right) on an NVIDIA TITAN Xp GPU

In this paper, we present our new solution for the interactively personalizable simulation of ocular wavefront aberrations, with arbitrary compositions of low- and high-order aberrations, for central and peripheral vision. We improve upon [49] in several critical ways; our main contributions are:

  • A neural network-based method for estimating the eye structure and its aberrations in various states, which achieves sub-second computational performance, facilitating the interactive personalization of the simulation.

  • An improved PSF interpolation strategy for simulating central visual aberrations, which vastly reduces the memory footprint of the previous process and leads to a fully real-time performance profile.

  • An extension of our new PSF interpolation approach, which facilitates the simulation of peripheral vision across the entire visual field with an approximately real-time performance.

2 Previous work and motivation

2.1 Algorithms with an offline performance profile

Among the earliest works utilizing vision simulation, Camp et al. used paraxial ray tracing with a simplified eye model to efficiently calculate the PSF of the eye and performed convolution with the PSF to simulate vision [56]. Greivenkamp et al. improved the accuracy of this approach by using a full aspherical eye model, exact ray tracing, and a model of the Stiles–Crawford effect via an apodizing filter [7].

To facilitate personalized vision simulations, Barsky utilized a Shack–Hartmann aberrometer to calculate the PSFs from measurement data [57]. Ray tracing was employed with the measured aberration surface to compute a small set of depth-dependent PSFs, and the vision-simulated imagery was acquired by partitioning the input image into depth slices and performing convolution on each slice with the corresponding depth-dependent PSFs. This approach causes visible artifacts along the edges of the depth slices, which Barsky solved using object detection [58].

To simulate peripheral aberrations of progressive addition lenses, Rodríguez Celaya et al. used a small 3D PSF grid (with different axes corresponding to the horizontal angle, vertical angle, and object-space distance) and trilinear interpolation for approximating the per-pixel PSF of the lens [26]. Alternatively, Gonzalez-Utrera utilized a similar 3D PSF grid with Barsky’s depth-dependent image-splitting approach [27]. A critical limitation of the two methods is that the simulated peripheral area is highly restricted and the axial resolution of the PSF grid is small for accurate simulations.

Instead of convolution, Mostafawy et al. utilized the Gullstrand schematic eye model and a flat retina for simulating vision using distributed ray tracing [59]. Personalization was achieved by placing additional lenses in front of the eye model and utilizing several corneal zones with varying refractive profiles. To properly handle peripheral vision, Wu et al. utilized the aspherical Navarro eye model [60] and Dias et al. investigated the impact of retina shape on vision [61]. The contributions of these works were collected by Lian et al. and further extended with other aspects (such as spectral ray tracing [62] for simulating chromatic aberration and stochastic ray-bending to model diffraction effects) in their open-source vision simulation tool called ISET3d [63].

For an even more detailed and personalizable eye structure representation, Fink and Micol modeled the refracting surfaces of the eye using the Zernike circle polynomials and used ray tracing with 2D images to simulate vision [64]. A significant drawback of this method is the high cost of the iterative intersection calculations, which can be solved by utilizing polygon meshes to describe the refracting elements, facilitating the use of much faster, triangle-based intersection computation tools, as demonstrated by Wei et al. [65].

Another severe issue with ray tracing is the high levels of noise stemming from its stochastic nature. To overcome this issue, Vu et al. employed an interpolation-based approach, whereby forward ray tracing and a coarse ray grid were used with a 2D input image to compute a small set of retinal samples, with triangulation and interpolation utilized to fill in the gaps in the grid and obtain a noise-free simulation [66].

2.2 Real-time vision simulation algorithms

The algorithms described thus far are only suitable for offline applications. As an inexpensive alternative to PSF-based convolution, Tang and Xiao utilized Gaussian kernels and a neural network-based per-pixel blurriness map generated with a schematic eye model [67]. This approach facilitates the real-time simulation of peripheral vision, albeit limited to low-order aberrations due to the characteristics of the kernel.

Alternatively, Lima et al. utilized a tree data structure for encoding the list of image-space sample locations on layered input images for several sample points from the pupil disk [68]. This solution facilitates the real-time simulation of low-order aberrations and efficiently handles partial occlusion, a common issue with convolution-based algorithms. Peripheral vision, however, was not considered by this study.

For the real-time simulation of arbitrary central visual aberrations, Csoba and Kunkli utilized a coarsely sampled PSF grid [49], similar to Barsky’s concept. Tiled PSF splatting and a GPU-based per-pixel PSF interpolation method were employed to achieve interactive performance. To derive the unknown aberration coefficients for the entire PSF grid from a single aberration measurement, the authors described a custom parametric eye model and eye structure estimation process, which facilitates the simulation of chromatic aberration, varying pupil sizes, and configurable focus distances but hinders the performance of the system, because the eye estimation process can take hours to complete.

Recently, Xiao et al. used convolutional neural networks and deep learning for simulating the DOF of the human eye [35]. However, their main goal was to reduce the vergence-accommodation conflict when wearing head-mounted displays, and thus, their algorithm is limited to healthy vision.

2.3 Summary and motivation

In summary, despite the significant number of vision simulation algorithms, personalization is a mostly overlooked aspect of the system. Although ocular wavefront tomography [69] can be utilized for overcoming the issue, it requires a lengthy optimization process and a large number of aberration measurements. Alternatively, to estimate the eye structure from a single on-axis measurement, Csoba and Kunkli [49] used an optimization-based approach with a custom parametric eye model; however, their method is still very time-consuming and often takes several hours to produce an output. To overcome these limitations, we leverage the massive generalization capabilities of neural networks, which facilitate the interactive estimation of the eye structure and aberration coefficients from a single measurement for producing the PSFs of convolution-based vision simulation algorithms.

Furthermore, there is a lack of high-performance algorithms for simulating peripheral vision, with the existing approaches either being unsuitable for interactive environments or unable to fully reproduce the true aberration structure of the eye. We solve these problems by utilizing the tiled PSF splatting approach of Csoba and Kunkli [49] during the rendering stage and proposing a significantly faster and more memory-efficient PSF interpolation method, facilitating the efficient simulation of arbitrary central and peripheral visual aberrations in highly interactive and real-time environments.

3 Mathematical background of eye optics

In this section, we provide a brief overview of the necessary concepts of human eye optics to describe our new results. The main concepts discussed are visualized in Fig. 2.

Fig. 2
figure 2

Optical concepts of the human eye discussed in this section. a Cross sectional view of the eye model proposed in [49]. b Plot of a wavefront aberration surface calculated using this eye model. c The PSF computed using the aberration surface shown in (b)

3.1 Structure of the parametric eye model

Despite the existence of numerous schematic human eye models in the scientific literature, most models focus on accurately representing a healthy eye. To handle arbitrary eye conditions and estimate the simulated eye’s physical structure, Csoba and Kunkli constructed a simplified parametric eye model with the aim of balancing the number of model parameters and physical accuracy of the model. Figure 2a displays a cross sectional view of their eye model.

Their parametric model is based on Navarro’s unaccommodated aspherical eye model [70] and uses a spherical surface as the retina, a plane with a variable circular hole as the pupil, and four refracting surfaces to model the cornea and crystalline lens: the front (anterior) and back (posterior) surfaces for both components. These four surfaces were modeled as quadrics of revolution with the following formula:

$$ z\left( {x,y} \right) = \frac{{x^{2} + y^{2} }}{{R_{s} \cdot \left( {1 + \sqrt {1 - \left( {1 + k} \right) \cdot \frac{{x^{2} + y^{2} }}{{R_{s}^{2} }}} } \right)}}, $$
(1)

where \(R_{s}\) is the radius of curvature and \(k\) is the conic constant (asphericity factor), which identifies the surface type: a hyperboloid (\(k < - 1\)), paraboloid (\(k = - 1\)), prolate spheroid (\(- 1 < k < 0\)), sphere (\(k = 0\)), or oblate spheroid (\(k > 0\)).

Additionally, the authors split the radii of curvature of the two corneal surfaces into separate terms for simulating corneal astigmatism and added an offset function to the front surface for modeling its fine-scale irregularities. Thus, their proposed anterior cornea surface can be defined as follows:

$$ z_{c}^{A} \left( {x,y} \right) = z_{c} \left( {x,y} \right) + \Delta \left( {x,y} \right), $$
(2)
$$ z_{c} \left( {x,y} \right) = \frac{{\frac{{x^{2} }}{{R_{x} }} + \frac{{y^{2} }}{{R_{y} }}}}{{1 + \sqrt {1 - \left( {1 + k} \right) \cdot \left( {\frac{{x^{2} }}{{R_{x}^{2} }} + \frac{{\frac{{R_{y} }}{{R_{x} }} \cdot y^{2} }}{{R_{y}^{2} }}} \right)} }} , $$
(3)
$$ \Delta \left( {x,y} \right) = \mathop \sum \limits_{n,m} \alpha_{n}^{m} \cdot U_{n}^{m} \left( {x, y} \right), $$
(4)

where \(z_{c}\) is the astigmatic surface, \(\Delta \) is the offset function, \(R_{x}\) and \(R_{y}\) are the astigmatic radii of curvature, \(U_{n}^{m}\) is the Zernike polynomial of radial order \(n\) and azimuthal order \(m\) (as defined by (2.19) in [71], with \(n \ge \left| m \right| \ge 0\), \(n - \left| m \right|\) even), and \(\alpha_{n}^{m}\) is the expansion coefficient for \(U_{n}^{m}\).

Finally, the authors also added a rotation term (around the optical axis) to the cornea and a set of tilt and decentration parameters (along the x- and y-axis) to the lens.

3.2 Wavefront aberrations and the PSF of the eye

The PSF of the human eye is the projection of an infinitesimal point source on the retina and heavily depends on many parameters, including light wavelength, distance of the point source from the eye, horizontal and vertical incidence angles, pupil size, and focus distance. The terms on-axis and off-axis are generally used to refer to the origin of the point source with respect to the optical axis. Among the main uses of the PSF is the simulation of the eye’s optical performance, whereby the PSF is utilized as a kernel to perform a convolution on the input image.

One way of calculating the PSF is using diffraction theory and wavefront aberrations of the eye. Wavefront aberrations describe the differences in optical path length between a reference wavefront (locus of points with the same phase) and the emerging wavefront as incoming waves pass through the optical system. The wavefront aberrations of the human eye are often described using Zernike polynomials [72]:

$$ W\left( {\rho ,\phi } \right) = \mathop \sum \limits_{n,m} \alpha_{n}^{m} \cdot U_{n}^{m} \left( {\rho ,\phi } \right), $$
(5)

where \(\left( {\rho ,\phi } \right)\) are polar coordinates in the exit pupil plane, \(U_{n}^{m}\) is the circle polynomial of Zernike of radial order \(n\) and azimuthal order \(m\) (with the same definition as in (4)), and \(\alpha_{n}^{m}\) is the aberration coefficient corresponding to \(U_{n}^{m}\). An example of an aberration surface is shown in Fig. 2b.

To obtain the aberration coefficients of the human eye, a wavefront aberrometer is often employed for measuring the aberrations for a set of parameters (pupil size, focus distance, and incidence angles), with the results often being made available in eye aberration studies. Conversion formulas also exist for simulating common eye conditions (such as myopia, hyperopia, and astigmatism) using corrective spectacle lens prescriptions [73]. Alternatively, ray tracing can be employed if the physical structure of the eye is available, which can be obtained using eye structure estimation from aberration measurements, as demonstrated in [49], where a generalized pattern search (GPS) algorithm was utilized to estimate the physical eye structure from on-axis aberrations.

Given an input set of wavefront aberration coefficients, the extended Nijboer-Zernike (ENZ) diffraction theory [74] can be utilized for efficiently calculating the corresponding PSF at arbitrary defocus. The ENZ theory provides several different formulations of the PSF; for the human eye and arbitrary defocus, the large-defocus scalar formula is suitable:

$$ U\left( {r,\phi ,f} \right) = 2\mathop \sum \limits_{n,m} \beta_{n}^{m} i^{\left| m \right|} V_{n}^{\left| m \right|} \left( {r,f} \right)\exp \left\{ {{\text{im}}\phi } \right\}, $$
(6)

where \(U\) is the PSF, \(\left( {r,\phi } \right)\) are polar coordinates in the image plane (normalized by the diffraction unit \(\lambda /{\text{NA}}\), where \({\text{NA}}\) is the image-side numerical aperture of the system), \(f\) is the defocus parameter (\(f = 0\) at best focus and \(f = \frac{{ - 2\pi u_{0} }}{\lambda }z\), where \(z\) is the amount of focus shift in free-space, \(u_{0} = 1 - \sqrt {1 - s_{0}^{2} }\), \(s_{0} = {\text{NA}}/n\), and \(n\) is the refraction index in the image space of the optical system). Additionally, the \(V_{n}^{m}\) functions are basic integrals of the Zernike radial polynomials and the \(\beta_{n}^{m}\) terms are complex expansion coefficients that can be derived from the real-valued \(\alpha_{n}^{m}\) phase-aberration coefficients. The authors of [49] used the Bessel-Bessel series expression for \(V_{n}^{m}\) (as given by (26) in [72]) and the fitting strategy described by (30) in [75] to calculate the \(\beta_{n}^{m}\) values corresponding to a set of \(\alpha_{n}^{m}\) aberration coefficients. An example of a PSF generated using (6) is shown in Fig. 2c.

The ENZ PSF formulation has several benefits. Firstly, it facilitates the computation of the depth-dependent PSF without the need for additional aberration coefficients. Secondly, the \(V_{n}^{m}\) terms can be cached for significantly shortening the PSF computation process. Finally, the \(V_{n}^{m}\) terms can also be rearranged in a GPU-friendly manner for computing large sets of PSF in a highly efficient, GPU-based manner [76].

4 Our proposed method

An overview of the main steps of our proposed vision simulation approach is presented in Fig. 3, which comprises three main phases: training, precomputation, and rendering. During training, we construct a set of neural networks to estimate the unknown eye structure and aberration coefficients. To simulate vision, we adopt the tile-based rendering approach of Csoba and Kunkli [49] and utilize the trained networks during all subsequent precomputation and rendering steps.

Fig. 3
figure 3

Overview of our proposed solution for simulating aberrated human vision. A set of neural networks are first trained to estimate the relaxed (\({\uprho }\)) and focused (\({\hat{\rho }}^{{\left( {\text{i}} \right)}}\)) eye structure parameters using the input aberration measurement settings (\({\text{Z}},{\text{ A}},{ }\lambda\)), aperture sizes (\({\text{A}}^{{\left( {\text{i}} \right)}}\)), and focus distances (\({\text{f}}^{{\left( {\text{i}} \right)}}\)), and another network to compute the aberration coefficients (\({\text{Z}}^{{\left( {\text{i}} \right)}}\)) of the focused eye for a set of PSF parameters (\({\text{P}}^{{\left( {\text{i}} \right)}}\)). During precomputation, these networks are used to construct a coarse PSF grid, stored on the GPU. At runtime, vision is simulated using the tiled PSF-splatting approach of [49] and an improved PSF interpolation method proposed in this paper, whereby a 3D PSF texture is constructed from the precomputed PSF grid, and the output vision simulation is generated using tiled convolution with an input color map, depth map, and the per-sample approximation of the PSF

During precomputation, we generate the PSF grid for the online rendering stage using a GPU-based PSF computation approach [76]. We employ the networks trained during the training phase for efficiently estimating the structure of the eye in the relaxed and focused states and computing the aberration coefficients for an input set of PSF parameters (comprising object distance \(d\), angles of incidence \(h\) and \(v\), light wavelength \(\lambda\), aperture diameter \(A\), and focus distance \(f\)).

In the rendering stage, we utilize tiled convolution [49], whereby screen-aligned tiles are constructed out of the pixels of the input color and depth images and traversed for each output pixel to generate the vision-simulated image, utilizing a 3D GPU texture to approximate the PSF with hardware acceleration. We improve this approach with a much more memory-efficient and thus significantly faster sampling scheme for on-axis aberrations and an extension facilitating the fast and efficient rendering of off-axis aberrations.

5 Estimation of eye parameters and wavefront aberrations using neural networks

5.1 Overview of our neural networks

Conceptually, estimating the eye structure and its unknown aberrations requires the following main steps:

  1. 1.

    Find a set of parameters for the relaxed eye with aberrations closely matching the input aberrations.

  2. 2.

    Modify the crystalline lens parameters to focus the eye at each user-defined focus distance.

  3. 3.

    Compute the aberration coefficients using the resulting eye models for each PSF parameter combination.

We constructed a separate network to solve these tasks; the exact network inputs and outputs are summarized in Table 1. First, we built an eye structure estimator and a focused eye estimator network to estimate the relaxed and focused eye parameters from an input set of aberration measurements and focus distances. Next, we created an aberration estimator for computing the Zernike aberration coefficients using the focused eye parameters and input PSF parameters. Finally, we also constructed a discriminator network for computing the on-axis aberrations for an input set of eye parameters, which we used during the training of the eye structure estimator, as described in Sect. 5.4.

Table 1 Inputs and outputs of our neural networks

Despite the theoretical possibility of merging certain networks, our approach facilitates the use of multiple smaller networks, and thus, significantly reduces their individual memory footprints, increases the training accuracy and performance, and keeps the inference costs low. Furthermore, the estimated eye model parameters can be directly observed, providing additional knowledge about the simulated eye.

5.2 Training data generation

The use of the custom parametric eye model described in Sect. 3.1 necessitated the construction of custom datasets for network training. We generated a different dataset for each estimator because the network inputs and outputs (and the means to obtain them) vary considerably per network.

To further reduce the complexity of the eye model and make its parametrization better suited for neural network learning, we modified the modeling of the cornea. Instead of defining the radius of curvature separately for the anterior and posterior surfaces, we defined the posterior radius of curvature using an anterior-to-posterior ratio parameter and used a dedicated astigmatism term (ratio of \(R_{y}\) to \(R_{x}\)). The resulting list of parameters is listed in Table 2.

Table 2 Parameters and ranges used to compute our training datasets

5.2.1 Aberration datasets

Because the eye structure estimator, aberration estimator, and discriminator networks are all based on aberrations and eye structure parameters, we used the same approach to generate the training datasets for all three networks. First, we constructed a modified set of eye parameter domains (summarized in Table 2) using the eye population data described in [49]. We then generated a set of samples for each dataset by randomly sampling (with a uniform distribution) the eye parameters \(\rho_{{1{-}45}}\) (comprising the Cornea, Aqueous, Lens, and Eye parameters in Table 2) and other network inputs (the PSF parameters \(A\), \(\lambda\), \(h\), and v). Next, we calculated the corresponding aberration coefficients \(Z_{{2{-}28}}\) (the first six degrees of Zernike polynomials, with the \(Z_{1}\) piston term ignored) using ray tracing. Finally, we formed feature and target pools from the columns of the eye parameters and corresponding aberration coefficients, based on the roles of the columns in the network the dataset is being generated for.

As noted by the authors of [49], the cost of aberration computation contributed greatly to the length of their eye structure estimation step, because the GPS optimizer computes aberrations for a high number of samples. To reduce the cost of eye reconstruction, the authors skipped the steps where the ray bundles are aligned to the pupil center and the outgoing ray grid is fit to the physical pupil disk. These simplifications introduce inaccuracies into the resulting aberration coefficients, because the non-centered samples alter the distribution of asymmetric coefficients and the incomplete wavefront information causes deviations from the true aberrations. Although the GPS algorithm can recover from noisy sample locations, an inaccurate dataset can substantially hinder the training of neural networks. Furthermore, the authors only considered on-axis input locations, which further exacerbates the impact of the simplifications outlined above, because our goal of estimating peripheral aberrations also requires the computation of off-axis aberrations.

To overcome the aforementioned limitations, we extended the aberration computation approach described in [49] with several extra steps, leading to a much more physically correct approach that significantly improves the accuracy of the wavefront aberration computation process. The main steps of our modified algorithm are shown in Fig. 4 and an in-depth description is provided in Appendix 1.

Fig. 4
figure 4

Visualization of the main steps of our extended aberration computation process used to generate our aberration-based training datasets. First, a single ray is cast through the center of the pupil to find the retinal starting location for reverse ray tracing. An outgoing ray grid is then fit to the physical pupil and traced through the optical system to find the output coordinates. Finally, the ideal and emerging ray locations are used to calculate the local wavefront tilts, from which the resulting aberration coefficients are calculated via least-squares fitting [77]. The full algorithm is provided in Appendix 1

5.2.2 Focused eye parameters dataset

Our approach for generating the training dataset of the focused eye estimator was similar to the one described in Sect. 5.2.1. We generated a set of random input samples (each comprising the 45 eye parameters \(\rho_{{1{-}45}}\), a pupil diameter \(A\), and a diopter-based focus distance \(f\)) and calculated the corresponding changes of the eye parameters (lens diameter \(\Delta_{{L_{D} }}\) and aqueous chamber thickness \(\Delta_{{A_{T} }}\)) that focus the sampled eye models on the object distances of the samples.

To calculate \(\Delta_{{L_{D} }}\) and \(\Delta_{{A_{T} }}\), we used the eye-refocusing method described in [49]. It attempts to mimic the real focusing process of the human eye and works by testing a set of modified eye models with shrunken crystalline lens diameters, adapting the radii of curvature of the lens surfaces accordingly for retaining a constant lens volume [78]. A small grid of rays is traced through each test eye to obtain a set of retinal sample locations for each model, selecting as the result the model with the least spread of the test ray grid.

To ensure that each sample eye in the training dataset is included in the highest number of different focus states possible, we sampled the focus distance separately from the other parameters. A set of samples were generated for both the focus distance and eye model parameters, using linear and uniform random sampling, respectively. To produce the final training dataset, we then obtained the feature vectors by combining the two sets of samples using a Cartesian product and utilized the focus estimation approach outlined above for computing the corresponding target (\(\Delta_{{L_{D} }}\) and \(\Delta_{{A_{T} }}\)) values.

5.3 Neural network structure

Due to the large number of eye parameters, we used a modified version of the residual neural network (ResNet) architecture. ResNet is a model purposefully designed for the efficient training of deep networks and has recently been adopted by Chen et al. for solving deep regression problems [79]. ResNet utilizes fully connected (dense) layers, an activation function to transform the layer outputs, and normalization layers to stabilize and speed up the training procedure. What differentiate ResNet from traditional feedforward networks are the shortcut paths that facilitate the skipping of dense layers, providing ResNet with its outstanding deep network-training performance. An overview of our modified version of the architecture by Chen et al. is shown in Fig. 5.

Fig. 5
figure 5

Schematic of our modified ResNet architecture. Several hidden blocks are placed between the input and output layers, with each comprising fully connected (dense), Mish activation, and Layer Normalization layers. The residual blocks also contain shortcut paths for facilitating the efficient training of deep networks

The rectified linear unit (ReLU) [80] activation function is commonly used by deep neural networks and was utilized by the ResNet regressor described in [79]. However, the loss of gradient information from clamping negative inputs to zero causes the Dying ReLU phenomenon, which we avoided by using the Mish activation function instead [81]:

$$ f\left( x \right) = x\tanh \left( {\ln \left( {1 + {\text{e}}^{x} } \right)} \right). $$
(7)

We only used Mish in the hidden blocks and employed a linear output in the last network layers. However, we used the \({\text{tanh}}\) function in the output layer of the eye structure estimator, because of its ability to force the network outputs within the range of valid population-based eye parameters if the target columns are transformed into the range \(\left[ { - 1, 1} \right]\).

Finally, instead of the Batch Normalization (BN) [82] approach utilized in [79], we used Layer Normalization (LN) [83] based on our experimental results. Furthermore, unlike the original ResNet regressor proposed in [79], we placed the normalization layers both after the activation function and before the shortcut connections, because this placement performed the best in our experiments.

5.4 Discriminator-based eye estimator training

With the highly nonlinear relationship between the eye model parameters and aberration coefficients and the possibility of inducing the same aberrations by multiple different model parametrizations, simply comparing the true and predicted eye model parameters is dissatisfactory for describing the goodness of an eye estimation. To overcome this issue, we utilized our discriminator network during the training of the eye estimator for computing the proper inputs of the loss function. Discriminators are often employed during the training of neural networks, and although training the discriminator in parallel with the network is a common practice, pretraining the discriminator has been shown to substantially improve and stabilize the training process [84, 85].

Therefore, we also pretrained our discriminator network and proceeded with the training of the eye structure estimator afterward. An overview of our process is shown in Fig. 6. During each training step of the eye estimator, we used the discriminator to compute the aberration coefficients (\(\hat{Z}_{2 - 28}\)) corresponding to the predicted eye parameters (\(\hat{\rho }_{1 - 45}\)) for the input aberration coefficients of the training sample (\(Z_{2 - 28}\)). We then computed the training loss by evaluating the network’s loss function using \(Z_{2 - 28}\) and \(\hat{Z}_{2 - 28}\) as inputs. This approach measures the goodness of the estimation using the functional performance of the eye model instead of the raw model parameters. Consequently, the predicted eye models can reproduce the input aberrations significantly better than a network trained via the naïve approach.

Fig. 6
figure 6

Our discriminator-based approach for computing the training loss of the eye structure estimator. The discriminator is used for estimating the aberrations of the eye parameters predicted by the network. The training loss is then calculated by evaluating the loss function with the input (target) and predicted aberrations

6 Improved GPU-based kernel interpolation

6.1 On-axis aberrations

The texture-based PSF interpolation approach described in [49] works by storing each precomputed PSF in every possible blur radius and sampling the PSF texture twice per channel. This approach requires six texture samples to approximate a single chromatic PSF value, significantly reducing the overall throughput of the interpolation process. Furthermore, the memory requirements are also considerably high, increasing the cost sampling and making it non-trivial to extend the layout with off-axis object locations.

To overcome these issues, we used the object-space distance as the z-axis of the 3D PSF texture. This approach substantially reduces the memory footprint and facilitates the computation of a full chromatic PSF sample using only a single texture lookup. Our layout and the computation of a single depth layer are visualized in Fig. 7.

Fig. 7
figure 7

Our on-axis PSF texture layout for a single PSF wavelength. Each object distance \(d_{k}\) in the precomputed PSF grid is assigned a depth layer. Next, transitional layers are inserted between neighboring slices, using the differences in blur sizes for determining the number of layers inserted. Each layer is computed by interpolating the corresponding PSFs of the grid along the blur size and object distance axes

Each 2D layer of our PSF texture corresponds to a distinct object distance and is constructed by computing the interpolated blur radius and neighboring precomputed PSFs along the blur size and object distance axes for each channel. The PSFs are also aligned to the center of the texture, with empty regions filled with zeros and the texture dimensions computed from the maximum possible PSF sizes.

To sample the PSF texture for an output pixel \(p_{o}\) and sample pixel \(p_{s}\), we calculate the 2D sample coordinates using the image-space coordinates of \(p_{s}\) relative to \(p_{o}\) and utilize it as an offset from the texture center. The object-space depth of \(p_{s}\) is then used to calculate the third sample coordinate based on the smallest and largest PSF object distances. Because the texture is padded with zeros, samples falling outside the PSF will naturally not contribute to the output, which is critical for properly handling the varying per-channel blur radii stemming from chromatic aberration.

6.2 Non-uniform depth sampling

The sampling of the object-space depth parameter is crucial when constructing the texture layers. A dense sampling causes excessive texture sizes, wasting GPU memory and negatively impacting performance. However, a large distance between two slices can lead to blur size differences exceeding one, producing visible artifacts. To avoid these problems, we employ non-uniform sampling, whereby we take a set of layers corresponding to the object-space depths of the PSF grid and insert a varying number of layers in-between. The number of transitional layers is computed as:

$$ L_{d}^{k} = \lceil | {R_{k + 1} - R_{k} } | \rceil - 1, $$
(8)

where \(d_{k}\) is the \(k\) th object distance in the precomputed PSF grid, \(L_{d}^{k}\) is the number of transitional layers inserted after the layer corresponding to \(d_{k}\), and \(R_{k}\) is the largest blur radius (across all wavelengths) of the precomputed PSFs at \(d_{k}\).

This sampling scheme guarantees that the total number of depth slices is minimal and ensures that the blur size difference between two neighboring slices never exceeds one, resulting in fast and artifact-free PSF interpolations. The layers in Fig. 7 were computed using this approach, with \(L_{d}^{k} = 4\).

6.3 Off-axis aberrations

A significant benefit of our on-axis texture layout is that the reduced sample counts and memory footprint facilitate its extension with off-axis PSFs by using the on-axis texture as a block corresponding to a single horizontal and vertical angle. Multiple blocks are simply laid out horizontally and vertically for multiple input directions. We start with one layer corresponding to each unique combination of precomputed incidence angles and object distance (denoted \(\left( {h_{i} ,v_{j} ,d_{k} } \right)\)) and place transitional layers in-between along the three axes, with the number of layers inserted determined as follows:

$$ L_{h}^{i} = \mathop {\max }\limits_{{\begin{subarray}{*{20}c} {j = 0, \ldots ,N_{v} - 1} \\ {k = 0, \ldots ,N_{d} - 1} \\ \end{subarray} }} L\left( {R_{i,j,k} , R_{i + 1,j,k} } \right), $$
(9)
$$ L_{v}^{j} = \mathop {\max }\limits_{{\begin{subarray}{*{20}c} {i = 0, \ldots ,N_{h} - 1} \\ {k = 0, \ldots ,N_{d} - 1} \\ \end{subarray} }} L\left( {R_{i,j,k} , R_{i,j + 1,k} } \right), $$
(10)
$$ L_{d}^{k} = \mathop {\max }\limits_{{\begin{subarray}{*{20}c} {i = 0, \ldots ,N_{h} - 1} \\ {j = 0, \ldots ,N_{v} - 1} \\ \end{subarray} }} L\left( {R_{i,j,k} , R_{i,j,k + 1} } \right), $$
(11)
$$ L\left. {\left( {R_{1} ,R_{2} } \right.} \right) = \lceil {\left| {R_{1} - R_{2} } \right|} \rceil - 1, $$
(12)

where \(R_{i,j,k}\) is the largest blur radius (across all wavelengths) of the precomputed PSFs at \(\left( {h_{i} ,v_{j} ,d_{k} } \right)\), \(N_{h}\), \(N_{v}\), and \(N_{d}\) are the number of horizontal, vertical, and depth samples in the PSF grid, and \(L_{h}^{i}\), \(L_{v}^{j}\), and \(L_{d}^{k}\) are, respectively, the numbers of transitional layers inserted along the horizontal, vertical, and depth axes after the layer at \(\left( {h_{i} ,v_{j} ,d_{k} } \right)\). This layout is visualized in Fig. 8 with \(L_{h}^{i} = 2\), \(L_{v}^{j} = 1\), and \(L_{d}^{k} = 1\).

Fig. 8
figure 8

Our off-axis PSF texture layout, an extension of our on-axis layout shown in Fig. 7, for a single wavelength. A layer is assigned to each unique PSF parameter combination \(\left( {h_{i} ,v_{j} ,d_{k} } \right)\). Transitional layers are then inserted along each axis and the neighboring precomputed PSFs are interpolated to produce the final texture values

Aberrations of off-axis object locations are often much higher, considerably increasing the sizes of their PSFs. Combined with the increased number of texture layers, the PSF texture dimensions grow exponentially, making the total memory requirements unsuitable in many scenarios. To alleviate this issue, we apply a layer reduction term to \(L\) in (12):

$$ L^{\prime}\left( {R_{1} ,R_{2} } \right) = \left( {s \cdot L\left( {R_{1} ,R_{2} } \right)} \right)^{ - p} \cdot L\left( {R_{1} ,R_{2} } \right), $$
(13)

where \(s\) and \(p\) are user-controllable parameters. Alternatively, a fixed number of transitional layers can also be used (\(L^{\prime}\left( {R_{1} ,R_{2} } \right) = C\), with \(C\) a configurable parameter), but our dynamic formulation in (13) is much more robust to coarse PSF parameter sampling and large blur size differences.

To obtain an interpolated off-axis chromatic PSF sample, we determine the closest two horizontal and vertical blocks, take a sample from all four blocks using the on-axis sampling strategy described in Sect. 6.1, and produce the output using a bilinear interpolation of the four samples, based on the fractional block indices corresponding to the sample.

We note that continuously storing blocks of PSF pixels is a viable alternative, whereby each block corresponds to a single PSF pixel and the number of blocks equals the largest PSF size. Although this layout facilitates the off-axis PSF computation using a single hardware-accelerated texture sample, it displayed highly suboptimal memory access patterns in our experiments, significantly degrading the rendering performance. Furthermore, the fragment-merging step of tiled PSF splatting causes fractional image-space sample locations, which necessitates the manual sampling and interpolation of the PSF across the visual field, further reducing the overall rendering throughput. Consequently, we consider this layout unsuitable for practical applications.

6.4 Variable pupil sizes and focus distances

To facilitate the on-the-fly change of the pupil size and focus distance parameters, Csoba and Kunkli extended their PSF grid with the corresponding dimensions and sampling [49] and constructed the PSF texture in each frame using the current pupil and focus settings. Because the inclusion of the incidence angles extends the PSF parameter space by two additional dimensions, computing a single pixel of our per-frame off-axis PSF texture requires 192 samples (64 per channel) from the precomputed PSF grid, which enormously increases the cost of the per-frame PSF construction process.

To overcome this problem, during precomputation, we construct a GPU texture for each combination of pupil size and focus distance sample using the strategy outlined in Sect. 6.3. To construct the per-frame texture, we sample the four relevant neighboring textures (selected using the current pupil and focus settings of the simulation) and use bilinear interpolation to compute the PSF pixel. Finally, to keep the memory costs tractable, we exclude transitional blocks from the caches and per-frame texture when utilizing dynamic eye settings. Consequently, the precomputed and per-frame PSF textures share the same dimensions, facilitating the sampling of the caches using a single texture lookup, significantly reducing the length of per-frame PSF texture preparation.

7 Results

7.1 Estimators

We first evaluate our new neural network-based estimators. We used the Python programming language for data generation, the TensorFlow [86] programming library to construct and train our neural networks, the C++ binding of TensorFlow (with a CPU backend) to utilize the trained networks in our rendering framework, and the patternsearch optimizer of MATLAB [87] to implement the previous GPS-based eye estimation method of Csoba and Kunkli [49] for comparing our results against. Finally, we used an AMD Ryzen 7 1700 3.00 GHz CPU for data generation and an NVIDIA TITAN Xp graphics card for network training.

7.1.1 Training data generation

We generated 2,000,000 samples for the aberration-based datasets, which took approximately five days per on-axis dataset (discriminator and eye structure estimator) and seven days for the off-axis dataset. Regarding the focused eye estimator, we generated 40 focus distance and 25,000 eye parameter samples, resulting in a dataset of 1,000,000 samples and taking approximately six days to generate. The statistical properties of the datasets are provided in Appendix 2.

Our experiments with smaller datasets displayed a considerable decrease in network accuracies, and thus, increasing the size of the training datasets would likely be beneficial to the precision of our estimators. Our largest dataset consumes about 1.1 GB of memory; therefore, space is not a limiting factor at all. However, reducing the length of the data generation procedure would be necessary for increasing the dataset sizes while keeping the computational costs tractable.

7.1.2 Network training

Before training, we split the training datasets into training (85%) and validation sets (15%) using uniform random sampling. The feature and target columns were then standardized as \(\hat{x} = \left( {x - \overline{x}} \right)/\sigma_{x}\), where \(x\) and \(\hat{x}\) are the original and standardized columns, and \(\overline{x}\) and \(\sigma_{x}\) are the mean and standard deviation of \(x\) across all samples. However, to ensure that predicted eye parameters are valid and because of the \({\text{tanh}}\) output function of the eye estimator, we normalized the target columns of the eye estimator into the range \(\left[ { - 1, 1} \right]\) (as described in Sect. 5.3) as \(\hat{x} = - 1 + 2\left( {x - x_{{{\text{min}}}} } \right)/\left( {x_{{{\text{max}}}} - x_{{{\text{min}}}} } \right)\), where \(x_{{{\text{min}}}}\) and \(x_{{{\text{max}}}}\) denote, respectively, the minimum and maximum values of \(x\) across all samples.

To train the neural networks, we employed the Rectified Adam optimizer [88], a modified version of the commonly used Adam algorithm [89]. We also utilized the lookahead mechanism [90] for improving the training speed and weight decay [91] for regularization. Each network was trained for 30 epochs using mini-batches of size 1024 and exponentially decaying learning rates and weight decay rates. For the training loss, we utilized the mean absolute error (MAE) function:

$$ {\text{MAE}}\left( {y,\hat{y}} \right) = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {\hat{y}_{i} - y_{i} } \right|, $$
(14)

where \(y_{i}\) and \(\hat{y}_{i}\) are, respectively, the \(i\) th columns of the true and predicted output vectors. Finally, training took two and a half hours (discriminator), six hours (eye estimator), three hours (aberration estimator), and one hour (focus estimator).

7.1.3 Ablation study

We tested the following network configurations for evaluating the accuracy of our estimators and training process:

  • FFN: A feedforward network with hidden blocks comprising a dense, BN, and ReLU layer.

  • ResNet [79]: The ResNet regressor proposed by Chen et al., with ReLU activations, BN normalization, and the placement of the BN layers following the model in [79].

  • ResNet (ours): Our modified ResNet architecture proposed in Sect. 5.3, with Mish activation, LN normalization, and our modified LN layer placement.

The networks were constructed such that the numbers of network parameters are similar in each network corresponding to the same estimator, which ensures that our results demonstrate the differences stemming from the network structures.

The results, summarized in Table 3, clearly demonstrate that our modified ResNet model performed the best for all estimators. Despite the ResNet regressor by Chen et al. being more accurate than the classic feedforward approach, our modified ResNet architecture showed significant further improvements. Additionally, unlike some of the tested network configurations, all the estimators built using our proposed architecture were sufficiently accurate for our eye structure and aberration estimation purposes, demonstrating the benefits of our proposed network configuration.

Table 3 Properties related to the structure and training of our neural networks

In general, off-axis aberration estimation was the most challenging problem, owing largely to aberrations being significantly larger for off-axis object locations, as demonstrated by the statistical properties of the datasets in Appendix 2. The MAE differences between the discriminator and aberration estimator are consistent with this observation.

Finally, we also trained the eye structure estimator without the discriminator for evaluating our proposed discriminator-based training approach. We used the discriminator on the trained network to obtain the MAE of the aberrations on the validation dataset. The result, included in Table 3, demonstrates that the MAE of the network trained using our proposed approach (0.009) is an order of magnitude smaller than that of the straightforward approach (0.076), verifying that our proposed loss computation method is vital for estimating the physical eye structure with sufficient accuracy.

7.1.4 Computational performance

We measured the computation times of the eye structure, focus, and aberration estimation steps using the previous and our new approach. We used several different conditions with distinct aberration profiles; emmetropia, myopia, and astigmatism for eyes with only low-order aberrations, and keratoconus, cataract, and post-LASIK surgery to evaluate arbitrary aberration compositions. For PSF parameter sampling, we used \(N_{\lambda } = 3\) wavelengths, \(N_{A} = 6\) pupil sizes, and \(N_{f} = 7\) focus distances for focus estimation and on-axis aberrations, and \(N_{h} = 19\) horizontal angles, \(N_{v} = 11\) vertical angles, a relaxed eye, and a 5 mm pupil for off-axis aberrations.

The results, shown in Table 4, demonstrate the suitability of our approach for interactive applications, with the combined length of the computations consistently staying in the sub-second regime. The most time-consuming step was aberration estimation, resulting from the large number of network evaluations. Compared to the previous approach, our new solution was at least three orders of magnitude faster in all computations, with the largest difference displayed in eye structure estimation. The cost of the previous process is the large number of aberration computations, which is fully circumvented by our new approach, whereby the eye structure is estimated using a single neural network evaluation.

Table 4 Computation times (in seconds) of the eye and aberration estimation steps using the previous and our proposed approach for various eye conditions, with the number of invocations shown in parentheses

7.1.5 Estimation accuracy

Despite our main goal being to reduce computational costs, we evaluated the six eye conditions described in Sect. 7.1.4 for verifying that our approach retains an accuracy similar to the previous method and testing our method on realistic inputs. Our results, listed in Table 5, contain the mean absolute errors produced by the previous and our proposed approach and the mean absolute target values as reference.

Table 5 Mean absolute target values and estimation errors produced by the previous and our new eye and aberration estimation procedures for the different eye conditions evaluated in this paper

Our proposed approach achieved very high accuracy in these realistic scenarios, with the eye estimator showing similar accuracy as the previous solution, despite the substantially shorter computation times. Furthermore, our neural network-based approach even outperformed the previous method in one scenario (cataract), demonstrating the robustness of our proposed solution. In the other tasks, the previous method formed the baseline and thus produced no errors. For our approach, the most challenging problem was estimating focus, owing largely to the nonlinearities of the eye parameter changes that comprise the network targets. However, our estimators achieved high accuracies, making our proposed approach perfectly suitable for real-world applications, especially considering the significant reduction in computational costs compared to the previous method.

7.2 Rendering

In the remainder of the section, we evaluate our improved on-axis and new off-axis rendering approach. To this end, we utilized the six eye conditions described in Sect. 7.1.4 and a 50° vertical field-of-view (FOV) with a 1280 × 720 resolution. The input images were rendered using a rasterization-based pipeline and pinhole camera model, with vision simulation performed as a post-processing filter. Alternatively, other image-forming methods (such as ray tracing or RGB-D cameras) could be used to obtain the input images, provided that per-pixel color and depth information is available.

Regarding the input test scenes, we utilized a simple setup (Primitives) and a more realistic test scene (San Miguel) to evaluate our proposed solution, which correspond to different application scenarios and display different overall characteristics. Simulations for other input scenes and eye configurations are also available as supplementary material.

7.2.1 On-axis aberrations

We implemented the tiled PSF-splatting approach described in [49] using the C++ programming language and OpenGL graphics library [92]. We utilized a GPU-based implementation [76] of the ENZ approach [74] to compute the PSFs and our new neural network-based solution to obtain the aberration coefficients; we used \(N_{d} = 33\) object distances, \(N_{A} = 6\) aperture diameters, \(N_{f} = 7\) focus distances, and \(N_{{\uplambda }} = 3\) wavelengths to construct the grid, resulting in 4,158 unique PSFs and taking approximately fifteen seconds to compute.

To compare our results against, we implemented the original PSF texture layout proposed in [49] as well as our new, significantly improved approach. We also implemented Barsky’s layered algorithm [58], using \(N_{d} = 41\) object distances, \(N_{{\uplambda }} = 3\) wavelengths, and the ENZ PSF model. Finally, to produce the ground-truth reference simulations, we utilized a CPU-based method that uses the dense PSF grid, whereby each input pixel was assigned to a depth-based bin and convolved with the corresponding PSF. The binning process resulted in 595 (Primitives) and 419 (San Miguel) depth bins, taking approximately eleven minutes to compute each image. The simulations are shown in Figs. 9 and 10.

Fig. 9
figure 9

On-axis simulations for the three eye conditions comprising only low-order aberrations

Fig. 10
figure 10

On-axis simulations for the three eye conditions comprising mixed aberrations

We first compare the per-frame PSF texture sizes and texture computation times of the layout proposed in [49] and our improved approach. The results are summarized in Table 6. As demonstrated by the results, our depth-based texture layout significantly reduces the number of texture layers, with our dynamic layer insertion strategy ensuring that the number of layers is minimal. Consequently, the memory footprint of the texture is also greatly reduced, significantly shortening the per-frame computation costs.

Table 6 Properties of the previous and our proposed PSF texture layouts

Next, we measured the total rendering times for Barsky’s method [58] as well as the tiled approach with the previous [49] and our new PSF interpolation strategies. The results, summarized in Table 7, clearly demonstrate that our proposed method significantly outperforms both previous approaches. Since Barsky’s algorithm relies on CPU-based convolution and object detection, its computation time is not suitable for interactive applications. However, our proposed solution achieves a large speedup compared to the previous tile-based approach as well, with the rendering performance more than doubled in all cases, owing mainly to the reduced memory footprint and texture sample counts. Critically, our proposed method also facilitates the simulation of arbitrary eye conditions in fully real-time environments, a task impossible with the previous approaches.

Table 7 Total rendering times of the two previous approaches and our proposed on-axis aberration simulation method

We also evaluated the accuracy of all approaches by computing the peak signal-to-noise ratio (PSNR) for all outputs using the ground-truth simulations as references. The results, summarized in Table 8, verify that our proposed approach achieves very low error levels and accurately approximates the true PSF. Additionally, our proposed method significantly outperformed Barsky’s method in all scenarios, despite the lower number of depth-dependent PSFs used in the coarse grid of our approach. This behavior can be attributed mostly to the lack of proper per-pixel PSF approximation, and the handling of partial occlusion being limited to the edges of depth slices in Barsky’s algorithm. Finally, our proposed texture layout caused no meaningful increase in the errors compared to the tile-based approach of Csoba and Kunkli either, despite the substantial reduction in memory requirements and computational costs. Our new method essentially works by discarding all unused data from the PSF texture, leading to no significant loss of information.

Table 8 PSNRs for the simulations generated using the two previous and our proposed on-axis approaches

Finally, we also highlighted a few areas with the largest deviations from the references in Figs. 9 and 10 for all outputs. As can be seen, our proposed approach approximates the ground-truth reference simulations with high accuracy, producing almost identical outputs to the ground-truth references. Although the previous method by Csoba and Kunkli is capable of the same accuracy, our new method achieves these results with significantly less memory and computation costs. The limitations of Barsky’s approach are also apparent, as the inaccurate convolution kernels and the lack of per-pixel treatment of partial occlusion led to visible deviations from the references.

7.2.2 Off-axis aberrations

We extended our implementation of the tiled PSF-splatting approach described in Sect. 7.2.1. with our proposed off-axis PSF layout. Due to the significantly larger amount of PSF data, we considered two distinct use cases:

  • quality mode: The PSF grid was sampled using \(N_{h} = 31\), \(N_{v} = 21\), \(N_{d} = 9\), \(N_{A} = 1\), \(N_{f} = 1\), and \(N_{{\uplambda }} = 3\) samples, resulting in 17,577 total PSFs and taking ca. two minutes to compute. Transitional layers were utilized in the per-frame PSF texture using (13), with \(s = 1\) and \({\text{p}} = 1/2\) chosen empirically, providing a good balance between quality and texture memory.

  • dynamic mode: The PSF grid was sampled with \(N_{h} = 23\), \(N_{v} = 13\), \(N_{d} = 9\), \(N_{A} = 5\), \(N_{f} = 5\), and \(N_{{\uplambda }} = 3\) samples, resulting in 201,825 total PSFs and taking ca. twenty minutes to compute. No transitional layers were used in the per-frame PSF texture.

To compare our results against, we implemented the algorithm of Rodríguez Celaya et al. [26]. We constructed a PSF grid using the ENZ PSF model with \(N_{h} = 5\), \(N_{v} = 3\), \(N_{d} = 2, \) and \(N_{{\uplambda }} = 3\) samples (covering the near and far plane of the input 50° vertical FOV) and performed convolution with per-pixel PSF interpolation to obtain the output. We also implemented the layered algorithm by Gonzalez-Utrera [27] with \(N_{h} = 11\), \(N_{v} = 11\), \(N_{d} = 3, \) and \(N_{{\uplambda }} = 3\) PSF grid samples. The input images were split into three depth-based layers, which were convolved using a per-pixel interpolation of the off-axis PSFs and composited to obtain the outputs. Finally, we utilized a CPU-based method using the dense PSF grid for producing the ground-truth reference simulations. Each input pixel was assigned to a unique bin (based on depth and incidence angles) and convolved with the corresponding PSF. The binning process resulted in 194,243 (Primitives) and 127,548 (San Miguel) PSF bins, with the total computation taking several hours per image. The simulations are shown in Figs. 11 and 12.

Fig. 11
figure 11

Off-axis simulations for the three eye conditions comprising only low-order aberrations

Fig. 12
figure 12

Off-axis simulations for the three eye conditions comprising mixed aberrations

First, we evaluate the properties of the per-frame PSF texture, which are summarized in Table 9. The only memory cost of the quality setup is the per-frame PSF texture, which is smaller in the dynamic setup, owing to the smaller PSF grid and omission of transitional layers. The dynamic setup also requires several cached PSF textures, the main bottleneck of the setup. However, it also represents the worst-case scenario of memory requirements, demonstrating the suitability of our approach for commodity graphics hardware. Moreover, the PSF grid sampling can also be reduced for a smaller memory footprint of the cache. Finally, because the aperture and focus parameters are static, the quality setup also facilitates the skipping of the per-frame texture-building step, which is an extra cost in the dynamic setup.

Table 9 Per-frame PSF texture properties for the two off-axis configurations and various eye conditions evaluated in this paper

Next, we measured the total rendering times of the previous algorithms and the two setups of our proposed approach. The results, summarized in Table 10, demonstrate that both existing approaches are highly inadequate for interactive environments, with the computations taking at least one hour per image. The computation times also vary highly with the input eye condition. Our new approach, however, displays performance characteristics suitable even for real-time applications, with a very small difference between the two setups. Furthermore, the rendering times are also much more consistent across all eye conditions, with the main differentiating factor proving to be the input scene. Overall, Primitives appeared to be more challenging, owing largely to the higher number of memory transactions needed to access the per-frame PSF texture, as demonstrated by the higher number of depth bins generated by the ground-truth algorithm. Critically, however, our proposed method also facilitates the interactive change of eye settings, which was impossible with the previous approaches.

Table 10 Total rendering times and PSNRs of the two previous approaches and the two setups of our proposed off-axis aberration simulation method

We also evaluated the accuracy of all methods using the ground-truth reference images, with the resulting PSNRs summarized in Table 10. The results for the outputs generated using our approach indicate perfect suitability for real-world applications, with the quality setup consistently outperforming the dynamic setup due to the significantly more information in the per-frame PSF texture. However, compared to the previous methods, our proposed approach achieved significantly higher accuracy, owing mostly to the ability of our new method to utilize substantially higher amounts of PSF information, properly handle per-pixel partial occlusion, and much more accurately approximate the true, per-pixel PSF of the simulated eye.

A few areas with the largest mismatches between the reference images and our simulations are also highlighted in Figs. 11 and 12, using the quality setup of our approach. These regions clearly demonstrate that the previous methods are unable to faithfully simulate the true blurriness levels and PSF shapes of the simulated eye, with the object boundaries also highly morphed in many regions. However, both configurations of our new approach closely approximate the ground-truth simulations and avoid all these artifacts. The varying sharpness over the visual field is also faithfully replicated by our approach, which is critical to properly evaluate the performance of the simulated eye and study the effects of conditions like relative peripheral hyperopia (which was outlined in Sect. 1). The previous techniques are unable to properly handle such conditions and give a false representation of the true optical performance of the simulated eyes.

Finally, rendered videos using our approach are also available as supplementary material, which demonstrate the stability and lack of interpolation artifacts under motion, and provide further examples for the outputs of our method.

8 Conclusion

In this paper, we presented an efficient solution for simulating the central and peripheral visual aberrations of the human eye. We first proposed a novel, neural network-based method for estimating the physical structure and unknown aberration coefficients of the simulated eye. Our new method is substantially faster than the previous approach and operates in the sub-second regime, facilitating the fully interactive exploration of visual aberrations. We also described an improved PSF interpolation strategy for a tiled PSF-splatting approach for efficiently computing the per-pixel PSFs and extended it for peripheral vision as well. We demonstrated that our proposed solution significantly outperforms previous convolution-based approaches both in terms of rendering speed and memory requirements, without sacrificing rendering quality. Critically, our new solution also facilitates the simulation of off-axis aberrations with high accuracy and an approximately real-time performance profile, which substantially increases the plausibility of the resulting vision simulations and was previously impossible with such low computational costs.

Regarding future research directions, other neural network architectures and dataset generation strategies could lead to further improvements in accuracy and computation times, and the recent advancements in hardware-accelerated ray tracing could facilitate the efficient generation of much larger datasets. Extending the eye estimator with multiple input aberration measurements (such as off-axis aberrations) could be beneficial as well. Experimenting with different eye elements (such as gradient-index lenses and intraocular lens implants) is straightforward and could further increase the applicability of our model, which our rendering method naturally supports. Regarding rendering, reducing the GPU memory usage of the precomputed off-axis PSF cache would make it easier to support dynamic pupil and focus changes. Furthermore, our approach could be adapted for other optical systems to provide even more application areas. Finally, experimenting with neural network-based solutions for the full simulation of aberrated vision could be a promising future research direction, with the DeepFocus method of Xiao et al. [35] being a great starting point.