Neural Distortion Fields for Spatial Calibration of Wide Field-of-View Near-Eye Displays

We propose a spatial calibration method for wide Field-of-View (FoV) Near-Eye Displays (NEDs) with complex image distortions. Image distortions in NEDs can destroy the reality of the virtual object and cause sickness. To achieve distortion-free images in NEDs, it is necessary to establish a pixel-by-pixel correspondence between the viewpoint and the displayed image. Designing compact and wide-FoV NEDs requires complex optical designs. In such designs, the displayed images are subject to gaze-contingent, non-linear geometric distortions, which explicit geometric models can be difficult to represent or computationally intensive to optimize. To solve these problems, we propose Neural Distortion Field (NDF), a fully-connected deep neural network that implicitly represents display surfaces complexly distorted in spaces. NDF takes spatial position and gaze direction as input and outputs the display pixel coordinate and its intensity as perceived in the input gaze direction. We synthesize the distortion map from a novel viewpoint by querying points on the ray from the viewpoint and computing a weighted sum to project output display coordinates into an image. Experiments showed that NDF calibrates an augmented reality NED with 90$^{\circ}$ FoV with about 3.23 pixel (5.8 arcmin) median error using only 8 training viewpoints. Additionally, we confirmed that NDF calibrates more accurately than the non-linear polynomial fitting, especially around the center of the FoV.


Introduction
Near-eye displays (NEDs) provide an immersive visual experience in virtual reality (VR) and augmented reality (AR) by overlaying virtual images directly onto the user's view. Improving NED design to enhance its performance inevitably faces various trade-offs between optics design and image quality. Achieving a wide field of view (FoV), a key requirement for NEDs to enhance immersion [1], tends to compromise the form factor [2] or the image distortion to employ image expanding optics. Modern NEDs employ a variety of off-axis optics for wide FoV such as curved beam splitter [3][4][5], Holographic Optical Elements (HoE) [6,7], and polarization-based pancake optics [8,9]. While they improve the FoV and potentially reduce the display form factor, these complex, off-axis optics cause non-linear, viewpoint-dependent image distortion (Fig. 1, a). Due to this distortion, a rectangle image on display appears distorted as a curved surface on the viewpoint. Especially with wide-FoV NEDs, large image distortion occurs at the periphery of the display, causing the image to constantly sway as the eyes rotate and move, called pupil swim [10]. This image distortion and pupil swim can break the reality of virtual objects and, in the worst case, cause severe headaches and nausea.
This paper focuses on modeling this non-linear, viewpoint-dependent image distortion on wide-FoV NEDs. Let and be the coordinate systems of the display and retinal image (or viewpoint camera image), respectively. To formulate the image distortion, we want to know a map function that indicates which coordinates on the display image u ∈ R 2 appear on the coordinates in the retinal image u ∈ R 2 , as shown in Fig. 1 (b). Especially for wide-FoV NEDs, this map varies not only with u but also with the translation and rotation of the eye. In this paper, we denote an eye pose by a 6D vector p = [v, t] ∈ R 6 determined from the 3D rotation vector v ∈ R 3 and the position vector t ∈ R 3 . Using these notations, a mapping function we focus on : R 8 → R 2 denotes as u = (u , p). (1) Theoretically, we can estimate from the optical prescription of the NED. However, the deformation of the optical system and assembly is inevitable due to the manufacturing process and aging. Thus, calibration of is necessary for practical use. In the case of NED with a typical beam splitter, the light emitted from each display pixel is perceived as a point light source at the viewpoint ( Fig. 2, a). Hence, conventional work [11] recovers by estimating the 3D position of the point source of each display pixel using triangulation, then projecting it into a retinal image at a novel viewpoint. However, in the case of wide-FoV NEDs, the virtual light source on each display pixel distribute in space due to the complex optics design of the projection system ( Fig. 2,  b). Therefore, accurate mapping with wide-FoV NEDs requires estimating the light field formed by the complex optics design.
Furthermore, in wide-FoV NEDs, image distortion changes dynamically with the viewpoint, making image distortion correction more challenging. Modeling and correcting static image distortion in VR-NED is a well-established technique [12,13]. As a study of dynamic image distortion correction, the mainstream approaches extend the polynomial model for static image distortion correction to handle translations and rotations of the eye. Although some studies [14][15][16] have dealt with eye translation, the number of coefficients is insufficient to represent dynamic image distortion. Moreover, these studies do not take into account image distortions due to eye rotation, except for the method of approximating light ray field directly with a Gaussian polynomial fitting [17].
On the other hand, ray-tracing-based approaches [5,10] model image distortion by simulating the multi-stage refraction and reflection of the light rays passing through the optical system. Although these methods achieve high accuracy, they take a lot of computation time and are not practical for interactive VR/AR applications. As a hybrid approach, concurrent with our work, Guan et al. applied nonlinear dimensionality reduction to pre-traced light rays in a lens design application to simulate viewpoint-dependent image distortion in real time [18].
In contrast, we propose the implicit representation model for the viewpoint-dependent image distortion, which is completely different from the previous approaches. Our Neural Distortion Field (NDF) learns a distortion map directly from a set of observed images without explicitly simulating the light field or optical aberration as polynomial models. NDF is an extension of the Neural Radiance Fields (NeRF) [19], a neural network-based representation model developed for novel view synthesis from multi-view color images. NeRF implicitly learns viewpoint-dependent light reflections and refractions and synthesizes novel-view images. Similarly, NDF is a neural network-based representation of the behavior of light rays passing from each pixel of a display through NED optics. By using volumetric rendering for NDF representation, we can synthesize a non-linear, viewpoint-dependent distortion map from a novel viewpoint.
The key novelty of our work is in applying NeRF's implicit, neural net-based view synthesis method to image distortion correction. In the field of holography, Neural Holography [20,21] implicitly represents optical misalignment, wave propagation, and display characteristics, and has significantly improved the quality of displayed images and processing time. Similarly, our method aims to provide an another solution based on implicit function representation for the problem of image distortion modeling in NED, providing advantages such as improved processing time and accuracy. Furthermore, it is expected that NDF can be incorporated into existing ray-tracing based image distortion correction and extended to a hybrid distortion representation model, similar to [18]. As a proof-of-concept and a preliminary step of the further model, in this paper we evaluate the accuracy of image distortion reproduction by a fully implicit NDF model.

Contributions.
Our main contributions include the following: • We propose an NDF, the neural net model that can implicitly learn complex, view-pointdependent image distortion maps of NEDs directly from observed images.
• Experiments with using an off-the-shelf wide-FoV AR-NED show that NDF can simulate image distortion as accurately as or better than conventional non-linear polynomial mapping.
• We discuss improvements of NDF for applications in wide FoV HMDs and other optics designs and provide future research directions on image distortion correction with implicit representation models.

Methods
In this section, we describe the basic NDF pipeline. Fig. 3

Neural Distortion Field for Distortion Map Representation
First, we describe our NDF representation. Briefly, when we look at a point from a certain direction, NDF returns the display pixel coordinate where the perceived light ray comes from and its intensity. NDF is represented as a multi-layer perceptron (MLP) Θ , whose inputs are 5D coordinates (spatial position x = [ , , ] T ∈ R 3 and viewing direction ( , )), and outputs are display coordinates u and the intensity of the light source. In practice, we express the (Right) When we use a curved mirror to expand the FoV, the perceived pixel gradually deviates from the point light source. For this simulation, we used a 2D ray optics simulator [22]. (b) Diagram of the relationship between the rays from the viewpoint and the perceived light source from the NED in the wide-FoV case in (a). In this case, we can model the perceived light sources as multiple translucent curved displays in space.
viewing direction as a 3D Cartesian unit vector d ∈ R 3 , i.e., Θ : R 6 → R 3 . Note that later in Sec. 2.3, we encode the input position x into higher-order dimension , i.e., Θ : R +3 → R 3 . We consider the ray connecting the eye position t and each pixel on the retinal image u (Fig. 3, a). This ray denotes as r( ) = t + d. Since d ∈ R 3 is direction of the ray that moves in conjunction with eye rotation v, the ray r( ) is essentially determined from u and p = [v, t]. When we sample a set of position and viewing direction (r( ), d) on the ray at as an input of NDF, NDF Θ outputs the display coordinate and the intensity (u , ) (Fig. 3, b).
Qualitatively, as the light from the microdisplay passes through the optical system, reflections and refractions create numerous transparent display surfaces in space, as shown in Fig. 3 (b). From this, NDF can be regarded as implicitly learning these multiple, translucent display surfaces formed on the space.

Distortion Map Reconstruction from Neural Distortion Field
NDF outputs the display coordinates u and its intensity , which are the source of the light perceived at (r( ), d). By computing a weighted sum of the outputs along the ray r( ), we can estimate the display pixel that is the source of the light perceived at each pixel on the retinal image u (Fig. 3, c).ū denotes this weighted sum of the display pixel coordinate.
Here, we sample points along the ray r( ), which indexes as { } =1 in order of proximity from the viewpoint. NDF outputs {(u , )} =1 from these sampling points as input. Using the outputs, we calculateū as where = +1 − is the distance between adjacent samples. Note that from Sec. 2.1, since the ray r is determined from u and p, Eq. (2) satisfies the form of Eq. (1).

Optimizing Neural Distortion Field
By applying Eq. (2) to the entire field of view, the display coordinate system is mapped as a 2D manifold on the retinal image ( Fig. 3, d). To train NDF, we back-propagate the difference . The output coordinates of the NDF represent which display pixels reach the eye at the input position and viewing direction. (c) We sum the output of the NDF at each point on the ray weighted by the intensity, and (d) estimate the subpixel-wise display coordinates perceived at the eye coordinate that consists of the ray. During training, we compute the loss between the estimating coordinate and the ground truth, and the loss is back propagated to NDF.
between the ground truth map obtained from several viewpoints and the map synthesized from Eq. (2). Let u * (r) denote the ground truth of the display coordinates for each ray r. From definition, the number of r in single retinal image is equal to the number of pixels in . In practice, we randomly sample a batch of r from each pixel at each optimization iteration, then compute the total-squared loss L: where R denotes the set of randomly sampled rays. The remainder of this subsection introduces improvements to more accurately simulate image distortion: positional encoding (Sec. 2.3.1) and deviation map learning (Sec. 2.3.2).

Positional Encoding
Instead of training the NDF using Eq. (3), we introduce the technique called positional encoding, which is also used in the original NeRF, to facilitate the neural network to capture higher-order image distortions. We encode the input position r( ) ∈ R 3 into higher-dimension vector (r( )) ∈ R . Position encoding is generally represented by a combination of trigonometric functions [23], similar to the Fourier transform: By introducing the encoding, we redefine the MLP function as Θ : R +3 → R 3 and total-squared loss L as: Note that, compared to the original NeRF, which targets natural images, NDF deals with distorted image coordinates that vary relatively smoothly in space. Thus, we are interested in the impact of encoding to high frequencies in NDF. In later experiments (Sec. 5.5), we evaluate the accuracy with different of the position encoding in NDF.

Learning of Deviation Map from a Reference Viewpoint
In general, we normalize the raw data in the range [0, 1] to promote the training of neural networks. In our NDF, the output of the neural network is the image coordinates. For example, for a Full HD display, the range of raw output values in the horizontal direction is [0,1920]. In this case, we should multiply the output value of the neural network by approximately 2.0 × 10 3 . This operation allows very small rounding errors (< 0.001) in the neural network to significantly affect the final results (< 2 pixel). In the case of NeRF, even if the color changes slightly due to scaling, there is no significant perceptual difference. In the case of NDF, however, this difference appears as a perceptually significant distortion.
To avoid this, we set a reference eye posep near the center of the eye box, and we use the measured display coordinatesû at the reference viewpointp as the reference map. Then, we train the neural net Θ using the deviation Δu = u −û , instead of using the raw u . In our training data set (Sec. 4), the range of Δu is [-41.0, 39.5]. Thus, the scaling factor is 80.5, and we can reduce the effect of rounding errors in the neural network to 1/25 compared to the case using raw data.

Implementation
To demonstrate our concept of implicit distortion map generation, we implemented NDF on top of mip-NeRF framework [24] implemented in JAX [25]. mip-NeRF streamlines NeRF rendering by extending the NeRF query to be an expectation over a spatial region rather than a point, resulting in highly accurate and fast image reconstruction with reduced parameters. Note that we currently select mip-NeRF based on the ease of implementation, training speed, and accuracy. Hence, although performance can be improved by building on other NeRF frameworks, the underlying NDF concept (Sec. 2) remains unchanged.
Sampling Strategy on mip-NeRF. Instead of sampling each point on a ray, mip-NeRF samples a conical frustum connecting the viewpoint position and the pixel area. As a result, mip-NeRF reduces unpleasant aliasing artifacts and improves the detail representation capability of NeRF. With this improvement, the cone frustum around x is considered a multivariate Gaussian distribution, and the mean value E[ (x)] within the frustum is used as (integral) positional encoding.
Architecture. As an intensity network, we use an MLP with eight fully connected ReLU layers of 256 channels each. Then, we connect another MLP with four fully-connected ReLU layers of 128 channels each as the coordinate network in the latter stage. The neural network architecture we chose utilizes the same configuration as NeRF [19], where this work is based. The original NeRF and derivative studies have adopted the same network architectures for controlled experiments, and our paper follows this convention. In NeRF, increasing the number of layers and channels from this number does not result in significant improvements in accuracy. To accommodate NDF, we change the dimension of the output layer from the 3D color c to the 2D coordinate Δu from the mip-NeRF code base.
In the original NeRF, to reduce the influence of the view direction on the output intensity , in the connection part of the network, E[d] is input later after we extract from the network at the former stage. We also adopt the same two-stage architecture for NDF, because we consider that the directivity of light emitted from a display does not change significantly with minute changes in angle.
In mip-NeRF, the activation functions used to generate the color c (in NDF, map Δu ) and intensity were sigmoid and softplus, respectively. There are possible candidates for the activation function. As the activation function for color (in NeRF) or coordinate (in NDF) output, mip-NeRF used sigmoid to suppress the output value c to the [0, 1] floating-point RGB color space. Instead, we consider that the piecewise-linear function such as ReLU is appropriate for the activation function, because NDF outputs the coordinate value Δu . Also, as the activation function for intensity output, the original NeRF uses SoftPlus. Instead, we consider Sigmoid as the possible candidate because we consider it would be better to adopt a stochastic model, considering that the light emitted from each point of the display is gradually dispersed from 100%. Based on these hypotheses, we evaluate the impact on accuracy when learning with different activation functions in Sec. 5.5.
Training. We train NDF by Adam [26] with a batch size of 1024 and a learning rate { } that is annealed logarithmically from 0 = 5 · 10 −4 to = 5 · 10 −6 . We train all NDF models up to 5.0 × 10 5 iterations, where the training error is no smaller on a logarithmic scale. Later in Sec. 5.5, we evaluate the relationship between the number of training iterations and accuracy in detail.
Our network takes about 4 hours to train on an NVIDIA RTX 3090 GPU and about 20 seconds to generate the whole distortion map. We can accelerate the map generation in real time with the latest NeRF architecture, as later discussed in Sec. 6.

Data Acquisition
To compare NDF with other mapping estimation methods, we acquire a data using a commercial wide FoV AR-NED. Sec. 4.1 describes the hardware setup for capturing NEDs from different viewpoints. Then, Sec. 4.2 describes the viewpoint camera locations and sampling intervals for training and testing. Finally, Sec. 4.3 describes how to obtain the correspondence between viewpoint image coordinates and display coordinates at each viewpoint.

Hardware Setup
Figure 4 (a) shows the hardware setup of our experiment. We use a Meta 2 (Meta Company, 90 • FoV) as a wide-FoV AR-NED with curved beam combiners, a Dell U2718Q as a background display, and two Blackfly S Color 12.3 MP USB3 cameras as a viewpoint camera and a world camera, respectively. We mount the AR-NED and the world camera on a composite translation stage, which moves in x-, y-, and z-direction, respectively. We fix the position of the viewpoint camera and the background display with printed 3D jigs to move the OST-HMD with respect to them. In other words, the viewpoint camera position is translated relative to the NED, and the viewpoint camera position is translated relative to the NED, and the world camera position is treated as the origin. To prevent the background display reflects the light from room lights, we cover the entire setup with black cloth.

Viewpoint Camera Positions
Before obtaining the coordinate transformation map between the HMD and the viewpoint camera, we set the measuring viewpoint positions for both training and testing. Fig. 5 shows the viewpoint camera positions. For training, we get the data from 125 viewpoints at the are vertices of grid cubes inside an eyebox cube divided into 4 3 cubes, as shown in Figure 5 (a). Each grid cube is 3 mm on one side, so the entire eyebox cube is 12mm on one side. We use these 5 3 = 125 datasets as training data. In the experiment, we also evaluated the accuracy of each method on a dataset with a wider interval (i.e., fewer viewpoints). With a gap of 6 mm, the number of study viewpoints is 3 3 = 27. Also, with an interval of 12 mm, we use only the 2 3 = 8 viewpoints that make up the eyebox as training data.
We get the data from 48 viewpoints for testing, as shown in Fig. 5 (b). We sample 12 points from each diagonal of a 12 mm square cube representing the entire eye box and use these 48 points as viewpoint positions for testing. This sampling interval for testing is based on [16], which is designed so that the distribution of test points covers the entire eyebox and is as uniform as possible.
To detect the poses of the viewpoint camera, we display a 38.85 mm AR marker of a 4×4 binary pattern on the background display. Then, we obtain the viewpoint camera pose p as a relative posture with the world camera as the origin. Since we manually adjust the translating stage position this time, there is a slight discrepancy between the ideal viewpoint position described above and the actual measurement position. However, this slight discrepancy from the ideal does not affect the training process because we train the MLP with the viewpoint positions obtained from the actual measurements.

Obtaining Map at Each Viewpoint
On each viewpoint, we obtain the correspondence between the coordinate system on the viewpoint camera and the display coordinate system. Let be the number of training viewpoints and {p * = [v * , t * ]} =1 be the set of eye pose at the training viewpoints. At each training viewpoint p * , we obtain the ground truth of the mapping function from u to u , which denotes * : R 2 → R 2 . This * can also be regarded as given the eye posture p * , i.e., u = * (u ) = (u , p * ).
To establish a set of the map at the training viewpoints { * } =1 , we display the gray-code pattern images and capture them on the viewpoint camera at first. Then, we obtain the discrete correspondence between viewpoint coordinates and display coordinates from the gray code image as a look-up table (LUT). At viewpoint p * , denotes the number of pairs for which we can obtain a correspondence between the coordinates and {(u , u )} =1 denotes the LUT of coordinate pairs.
Then we apply Gaussian kernel regression to interpolate this LUT as a continuous polynomial function * [17]. We express * as a Gaussian polynomial model: where is Gaussian radial basis vector, is the number of basis functions, is the kernel width, { } is the center of Gaussian kernel (randomly chosen from {u }), and A = [ 1 , · · · , , · · · , ] T is a × coefficient matrix. After that, we determine A using the regularized least-square estimator: where is a × design matrix defined as [ ] = (u ), is the regularization parameter, I is a × identity matrix, and U = [u 1 , · · · , u , · · · , u ] T . We implement Eq. (8) using MATLAB R2022a. We repeat the above operations for all training viewpoints {p * } =1 to obtain a set of ground truth maps { * } =1 . Additionally, we obtain the ground truth maps for all test viewpoints for evaluation.

Experiments
Using the acquired dataset, we compared the accuracy of NDF with other map interpolation methods (Sec. 5.1) and evaluate the performance of NDF. We applied each method to the dataset and generated a map at the test viewpoints. Then, we evaluated quantitatively with respect to the reprojection error with respect to the ground-truth (Sec. 5.2). After that, we evaluated the reproducibility of dynamic image distortion with respect to the viewpoint image in the FoV (Sec. 5.3) and the spatial distribution of the test viewpoints (Sec. 5.4), respectively. Finally, we evaluated the difference in accuracy when changing the network configuration of NDF (Sec. 5.5).

Interpolation Methods for Comparison
First, prior to the evaluation, we briefly discuss each of the other interpolation methods being compared. The problem set in this paper is to interpolate a map : R 8 → R 2 at a novel viewpoint p from the ground truth maps at the training viewpoints { * } =1 (Eq. (6)). We implemented 3 interpolation methods in addition to NDF: (i) 3D reconstruction, (ii) linear interpolation and (iii) Gaussian (non-linear) polynomial interpolation. Note that, except for the 3D reconstruction-based interpolation, we ignore the eye rotation v. In other words, we trainˆ(u , t) : R 5 → R 2 , instead of the complete .

(i) 3D Reconstruction of Virtual Display Surface.
Assuming that each pixel u on the display image forms a virtual display surface in space, we recover the 3D point of each pixel by triangulation and bundle adjustment [11]. Then we estimate the map by re-projecting the reconstructed 3D surface onto the image plane u at the new viewpoint p. As discussed in Sec. 1, this model assumes each pixel as a point light source.
(ii) Linear Interpolation. We take the 8 viewpoints on the cubic grid containing the new viewpoint position t from the training data (Fig. 4, b). Then, we estimateˆ(u , t) using tri-linear interpolation using * (u ) at the 8 vertices of the cubic grids.

Reprojection Error between Distortion Models
We trained maps using each method with = 8, 27, 125 training viewpoints, and we calculated the reprojection error of the output display coordinates of each pixel at 48 test viewpoints. We used only pixels within the area where the AR-NED can be seen on each viewpoint image for error calculation. In addition to the per-pixel error, we calculated the angular error from the viewpoint at each pixel. NDF in this subsection were trained with the position encoding = 16 and the output coordinate and intensity activation functions ReLU and SoftPlus, respectively. Figure 6 shows the pixel errors and median angular error for different training viewpoints and interpolation methods. In = 8, the mean errors are (i) 18.86 pixel (31.96 arcmin), (ii) 15.60 pixel (24.86 arcmin), (iii) 3.64 pixel (5.99 arcmin), (iv) 3.23 pixel (5.79 arcmin). From the results, we confirm that NDF can recover maps with accuracy comparable to non-linear polynomial-based fitting with very few training viewpoints. Also, the deteriorating accuracy of the (i) 3D reconstruction-based method confirms that the spatially distributed model of light sources assumed by the (iv) NDF correctly approximates the optical model of the wide-FoV NED.
From Fig. 6, while nonlinear polynomial-based methods do not change significantly as the number of training viewpoints increases, NDF shows an improvement in accuracy as the number of training viewpoints increases: (iii) 2.63 pixel (5.38 arcmin) vs. (iv) 3.13 pixel (5.42 arcmin) in = 27, and (iii) 3.61 pixel (6.46 arcmin) vs. (iv) 2.17 pixel (4.25 arcmin) in = 125. Moreover, in all cases of = 8, 27, 125, (iii) Gaussian polynomial fitting has a larger error variance in the test viewpoints than (iv) NDF. From the result, we assume that in the case of polynomial-based explicit optimization, the more points are trained, the more they overfit the map at the center of the eyebox. In contrast, in NDF, the implicit function representation uniformly performs the optimization. Later we quantitatively analyse this difference in error distribution between test viewpoint positions in Sec. 5.4.

Error Distribution in the Field of View
We analyzed the properties of the distribution of reprojection errors perceived in the FoV by comparing (iii) Gaussian polynomial fitting and (iv) NDF. Fig. 7 shows the difference between the ground truth and the actual transformation of the display coordinate system using the estimated distortion map. From Fig. 7 (b), the estimated map with Gaussian fitting is accurate vertically but has relatively large horizontal deviations. In contrast, the estimated map from NDF (Fig. 7 (c))  shows a uniform fit in both horizontal and vertical directions near the center of the FoV, although the error is larger than Gaussian at the periphery of the FoV.
To evaluate the error distribution within the FoV, we calcuated pixel-wise average of the reprojection error in all test viewpoint images, as shown in Fig. 8. While the Gaussian fitting does not have a smooth distribution of errors, NDF has smaller errors from the center to the lower right of the FoV, and the pixels with the largest errors are concentrated only at the periphery of the FoV. We also confirmed that the errors of NDF were better for 55 % of the total FoV pixels. This result shows that NDF learns the distortion of the target AR-NED well.
The error tends to be larger in the periphery of the FoV in NDF. This is likely due to the fact that NDF learns not only the map but also the intensity of the map on a pixel-wise, i.e., the shape of the FoV. As a result, NDF cannot cope with abrupt changes of the intensity at its periphery, resulting in large errors in the weighted sum (Eq. (2)). This problem could be addressed by combining NDF with some explicit model, e.g., explicitly defining the display surface in space in advance and sampling the surrounding area with NDF. Such a combination of NDF and explicit models to improve accuracy is further discussed in Sec. 6.

Error Distribution depending on Viewpoint Position
Next, we evaluated the accuracy of the reconstruction of the distortion map with respect to changes in viewpoint p. With Gaussian fitting and NDF trained on 8 viewpoints, we calculate the reprojection error of each pixel test viewpoint position that is correspond to Fig. 5 (b). Fig. 9 (a) shows the median of the reprojection error on the entire pixels of the field of view at each test viewpoint position. To further clarify the difference between the two methods, Fig. 9 (b) shows the distribution of the difference of the NDF reprojection error minus the Gaussian, as in Fig. 8 (c). From Fig. 9 (b), NDF shows better results at viewpoints far from the center of the eyebox. This means that Gaussian overfits the training data near the center of the eye box, while NDF is able to reproduce the distribution of the distortion map uniformly across the entire eye box. From the result, we confirmed that NDF can reproduce the distortion map robustly even when the viewpoint position changes.

Error Analysis of Different Network Architecture
Finally, we evaluate the effects of different network parameters on accuracy. It is known that the number of layers and channels in the network has little effect on accuracy, while the number of dimensions of the position encoding, the activation function, and the number of training steps significantly impact accuracy [23,27]. Thus, we trained NDFs under different conditions while varying these parameters and evaluated their accuracy on the training dataset ( = 8).

Training step.
We compared networks trained with a different number of training steps from 1.0 × 10 5 to 5.0 × 10 5 . During the experiment, the dimension of encoding = 16, the activation function of the coordinate MLP, and the intensity MLP were fixed to ReLU and SoftPlus, respectively. Figure. 10 (a) shows the error of NDF at different numbers of training steps. When the number of steps is increased by 1.0 × 10 5 , the mean error changes as follows: {3.61, 3.78, 3.75, 3.72, 3.23} pixel. The results show that the minimum error value decreases as the number of training steps increases. However, there is no change in the intermediate error value for all viewpoints, confirming that the accuracy of NDF is as good as the Gaussian fitting even at 1.0 × 10 5 training steps. This result indicates that the NDF is already able to represent image distortion well in the early stages of training. One possible reason for the result is that the image distortion reproduced by NDF is simpler in structure than that of NeRF, which targets natural images.

Input dimension of Positional Encoding.
We compared networks trained with a different encoding dimension ( = 6, 16). We fixed number of learning steps at 5.0 × 10 5 . Figure 10 (b, i.) and (b, ii.) show the training results with the dimension of positional encoding set to 16 and 6. The mean errors of (b, i.) and (b, ii.) are 3.23 pixel and 3.62 pixel, respectively. We previously assumed that reducing the number of encoding dimensions would not change the accuracy of NDF as discussed in Sec. 2.3.1. However, from the result, as with the original NeRF, increasing the number of encoding dimensions reduced the error. This can be attributed to the fact that the current NDF learns not only the distortion of the image but also the range visible from the display in FoV, which results in higher frequencies at the periphery of the NED.

Selection of Activation Functions.
Finally, we trained the network with different combinations of activation functions for the coordinate MLP (ReLU or Sigmoid) for u and the intensity MLP (SoftPlus or Sigmoid) for . Figure. 10 (b, i.), (b, iii.) and (b, iv.) show that the errors when varying the combination of output activation functions. The mean errors of (b. i.). (b, iii.), and (b, iv) are 3.23 pixel, 3.60 pixel, and 5.36 pixel, respectively. From the result, as expected in Sec. 3, the accuracy was greatly improved by using ReLU for the output activation function of u . In constrast, changing the activation function of from Softplus to Sigmoid did not significantly improve the accuracy. This result indicates that the virtual display surfaces of the wide-FoV NEDs targeted in this paper form multiple images through repeated multistage reflections and refractions.

Limitation and Future Work
From experiments, NDF can synthesize novel-view distortion maps for the wide-FoV AR-NED with accuracy equal to or better than explicit polynomial fitting models. Our fully implicit, MLP-based approach is a completely different from existing works for distortion correction of NEDs. While the current NDF model is still rough around the edges, it has many possibilities for improvements and future research directions. This section discusses both such limitations and potential research directions.

Real Time Dynamic Distortion Correction.
As mentioned in Sec. 1, dynamic distortion correction is one of the most important issues for NEDs, yet unsolved. In particular, real time distortion correction in response to eye tracking is required to make dynamic adjustments that are imperceptibly fast for the user. Our NDF is compatible with real time distortion map generation. Thanks to neural network-based architecture, NDF can be GPU-accelerated. Moreover, since NED has almost the same configuration as NeRF, acceleration methods proposed in the research field of NeRF can be applied almost directly to NED. For example, InstantNeRF [28] uses hash tables to adapt multi-resolution position encoding to GPU calculations, enabling the generation of 1920 × 1080 pixel resolution images at tens of milliseconds. Since our initial desire in this paper is to verify our NDF concept and InstantNeRF is implemented on a customized CUDA kernel that is hard to customize, we currently have not implement NDF on its architecture. However, in theory, it is possible to run NDF on the real time framework.
Combination with Explicit Models. In this paper, we defined NDF as a fully implicit model that does not assume an a priori optical model. However, since NDF is essentially a ray-castingbased method, it can be extended to a hybrid model, which combines NDF with conventional distortion correction methods that trace rays on explicitly defined optical models. In the field of NeRF, some methods have been proposed to recover both the 3D shape and viewpoint-dependent texture of an object with high accuracy by intensively sampling points close to the object surface [29,30]. In the same way, by intensively sampling points near the focal plane in a roughly defined optical design of NEDs, NDF can improve the distortion map's accuracy while fine-tuning the actual optical property of the NEDs.

Correction of Chromatic Aberration and Viewpoint-Dependent Blur.
By extending the number of output dimensions, NDF could be utilized to calibrate various pixel-wise viewpointdependent properties, such as chromatic aberration [31] and viewpoint-dependent blur [32]. Although increasing the number of output dimensions may make learning difficult to converge, it can also include the correlation of each property, for example, implicitly learned, viewpointdependent color-mixing matrix for chromatic aberration.
Applying NDF tor Other Severely Distorted Optics. Although this paper only discusses the application of NDF to wide-FoV NEDs, it is expected that NDF can be applied to other non-smooth and extremely distorted optical systems, just as NeRF can be applied to images with abrupt changes in adjacent pixel values. For example, NDF may be applied to aerial displays that form images using a special beam combiner [33], or acquire correspondence between 3D scene and image coordinates in dynamic projection mapping.

Conclusion
We proposed NDF, an MLP-based distortion map generation method for wide-FoV NEDs. NDF implicitly learns virtual display surfaces as light source distributions in viewpoint-dependent space, a mutually complementary concept to explicit, geometric optics models. Experiments show that NDF can synthesize distortion maps with an error of about 5.8 arcmins using only 8 training viewpoints, which is competitive with non-linear polynomial fittings. We also confirmed that NDF produces maps with better accuracy around the center of the FoV, and the accuracy improves as the number of training viewpoints increases. NDF has the potential for higher accuracy by combination with explicit optical models, and real time distortion correction with GPU optimization. We hope that our new approach will facilitate subsequent research and contribute to the realization of an immersive virtual experience that combines a wide field of view with perfect spatial consistency.