Towards real-time photorealistic 3D holography with deep neural networks

The ability to present three-dimensional (3D) scenes with continuous depth sensation has a profound impact on virtual and augmented reality, human–computer interaction, education and training. Computer-generated holography (CGH) enables high-spatio-angular-resolution 3D projection via numerical simulation of diffraction and interference1. Yet, existing physically based methods fail to produce holograms with both per-pixel focal control and accurate occlusion2,3. The computationally taxing Fresnel diffraction simulation further places an explicit trade-off between image quality and runtime, making dynamic holography impractical4. Here we demonstrate a deep-learning-based CGH pipeline capable of synthesizing a photorealistic colour 3D hologram from a single RGB-depth image in real time. Our convolutional neural network (CNN) is extremely memory efficient (below 620 kilobytes) and runs at 60 hertz for a resolution of 1,920 × 1,080 pixels on a single consumer-grade graphics processing unit. Leveraging low-power on-device artificial intelligence acceleration chips, our CNN also runs interactively on mobile (iPhone 11 Pro at 1.1 hertz) and edge (Google Edge TPU at 2.0 hertz) devices, promising real-time performance in future-generation virtual and augmented-reality mobile headsets. We enable this pipeline by introducing a large-scale CGH dataset (MIT-CGH-4K) with 4,000 pairs of RGB-depth images and corresponding 3D holograms. Our CNN is trained with differentiable wave-based loss functions5 and physically approximates Fresnel diffraction. With an anti-aliasing phase-only encoding method, we experimentally demonstrate speckle-free, natural-looking, high-resolution 3D holograms. Our learning-based approach and the Fresnel hologram dataset will help to unlock the full potential of holography and enable applications in metasurface design6,7, optical and acoustic tweezer-based microscopic manipulation8–10, holographic microscopy11 and single-exposure volumetric 3D printing12,13. A deep-learning-based approach using a convolutional neural network is used to synthesize photorealistic colour three-dimensional holograms from a single RGB-depth image in real time, and termed tensor holography.

Holography is the process of encoding a light field 14 as an interference pattern of variations in phase and amplitude. When properly lit, a hologram diffracts an incident light into an accurate reproduction of the original light field, producing a true-to-life recreation of the recorded three-dimensional (3D) objects 1 . The reconstructed 3D scene presents accurate monocular and binocular depth cues, which are difficult to simultaneously achieve in traditional displays. Yet, creating photorealistic computer-generated holograms (CGHs) power-efficiently and in real time remains an unsolved challenge in computational physics. The primary challenge is the tremendous computational cost required to perform Fresnel diffraction simulation for every object point in a continuous 3D space. This remains true despite extensive efforts to design various digital scene representations 3,[15][16][17][18] and algorithms for the detection of light occlusions 19 .
The challenging task of efficient Fresnel diffraction simulation has been tackled by explicitly trading physical accuracy for computational speed. Hand-crafted numerical approximations based on look-up tables of precomputed elemental fringes [20][21][22] , multilayer depth discretization [23][24][25] , holographic stereograms [26][27][28][29] , wavefront recording plane (alternatively intermediate ray sampling planes) 30,31 and horizontal/vertical-parallax-only modelling 32 were introduced at a cost of compromised image quality. Harnessing rapid advances of graphics processing unit (GPU) computing, the non-approximative point-based method (PBM) recently produced colour and textured scenes with per-pixel focal control at a speed of seconds per frame 2 . Yet, PBM simulates Fresnel diffraction independently for every scene point, and thus does not model occlusion. This prevents accurate recreation of complex 3D scenes, where the foreground will be severely contaminated by ringing artefacts due to the unoccluded background (Extended Data Fig. 1d). This lack of occlusion is partially addressed by light-field rendering 3,29,33 . However, this approach incurs substantial rendering and data storage overhead, and the occlusion is only accurate within a small segment (holographic element) of the entire hologram. Adding a per-ray visibility test during Fresnel diffraction simulation ideally resolves the problem, yet the additional cost of an occlusion test, access for neighbour points and conditional branching slow down the computation. This qualityspeed trade-off is a trait shared by all existing physically based approaches and fundamentally limits the practical deployment of dynamic holographic displays.
We resolve this dilemma with a physics-guided deep-learning approach, dubbed tensor holography. Tensor holography avoids the explicit approximation of Fresnel diffraction and occlusion, but imposes underlying physics to train a convolutional neural network (CNN) as an efficient proxy for both. It exploits the fact that propagating a wave field to different distances is equivalent to convolving the same wave field with Fresnel zone plates of different frequencies.
As the zone plates are radially symmetric and derived from a single basis function using different propagation distances, our network accurately approximates them through successive application of a set of learned 3 × 3 convolution kernels. This reduces diffraction simulation from spatially varying large kernel convolutions to a set of separable and spatially invariant convolutions, which runs orders of magnitude faster on GPUs and application-specific integrated circuits (ASICs) for accelerated CNN inference. Our network further leverages nonlinear activation (that is, ReLU or the rectified linear unit 34 ) in the CNN to handle occlusion. The nonlinear activation selectively distributes intermediate results produced through forward propagation, thus stopping the propagation of occluded wavefronts. We note that although the mathematical model of the CNN is appealing, the absence of a large-scale Fresnel hologram dataset and an effective training methodology impeded the development of any learning-based approach. Despite recent successful adoption of CNNs for phase retrieval [35][36][37] and for recovering in-focus images or extended depth-of-field images from optically recorded digital holograms [38][39][40] , Fresnel hologram synthesis, as an inverse problem, is more challenging and demands a carefully tailored dataset and design of the CNN. So far, the potential suitability of CNNs for the hologram synthesis task has been demonstrated for only 2D images positioned at a fixed depth 41,42 and for post compression 43 .

Hologram dataset of tensor holography
To facilitate training CNNs for this task, we introduce a large-scale Fresnel hologram dataset, MIT-CGH-4K, consisting of 4,000 pairs of RGB-depth (RGB-D) images and corresponding 3D holograms. Our dataset is created with three important features to enable CNNs to learn photorealistic 3D holograms. First, the 3D scenes used for rendering the RGB-D images are constructed with high complexities and large variations in colour, geometry, shading, texture and occlusion to help the CNN generalize to both computer-rendered and real-world captured RGB-D test inputs. This is achieved by a custom random scene generator (Fig. 1a), which assembles a scene by randomly sampling 200-250 triangle meshes with repetition from a pool of over 50 meshes and assigning each mesh a random texture from a pool of over 60,000 textures from publicly available texture synthesis datasets 44,45 with augmentation (see Methods for more rendering details). Second, the pixel depth distribution of the resulting RGB-D images is statistically uniform across the entire view frustum. This is crucial for preventing the learned CNN from biasing towards any frequently occurring depths and producing poor results at those sparsely populated ones when a non-uniform pixel depth distribution occurs. To ensure this property, we derived a closed-form probability density function (PDF) for arranging triangle meshes along the depth axis (z axis):

Article
where z near and z far are the distances from the camera to the near and far plane of the view frustum, C is the number of meshes in the scene and α is a scaling factor calibrated via experimentation. This PDF distributes meshes exponentially along the z axis (Fig. 1a, top) such that the pixel depth distribution in the resulting RGB-D images is statistically uniform (Fig. 1a, bottom; see Methods for derivation and comparison with existing RGB-D datasets). Here we set z near and z far to 0.15 m and 10 m, respectively, to accommodate a wide range of focal distances (approximately a 6.6-diopter range for the depth of field). Third, the holograms computed from the RGB-D images can precisely focus each pixel to the location defined by the depth image and properly handle occlusion. This is accomplished by our occlusion-aware point-based method (OA-PBM).
The OA-PBM augments the PBM with occlusion detection. Instead of processing each 3D point independently, the OA-PBM reconstructs a triangle surface mesh from the RGB-D image and performs ray casting from each vertex (point) to the hologram plane (Fig. 1b). Wavefronts carried by the rays intersecting the surface mesh are excluded from hologram computation to account for foreground occlusion. In practice, a point light source is often used to magnify the hologram for an extended field of view (Extended Data Fig. 3a); thus, the OA-PBM implements configurable illumination geometry to support ray casting towards spatially varying diffraction cones. Figure 2b visualizes a focal stack refocused from the OA-PBM-computed holograms, in which clean occlusion boundaries are formed and little to no background light leaks into the foreground (see Methods for a comparison with PBM results and OA-PBM implementation details).
Combining the random scene generator and the OA-PBM, we rendered our dataset at wavelengths of 450 nm, 520 nm and 638 nm to match the RGB lasers deployed in our experimental prototype. The MIT-CGH-4K dataset is also rendered for multiple spatial light modulator (SLM) resolutions (see Methods for details) and will be made publicly available.

Neural network of tensor holography
Our CNN model is a fully convolutional residual network. It receives a four-channel RGB-D image and predicts a colour hologram as a six-channel image (RGB amplitude and RGB phase), which can be used to drive three optically combined SLMs or one SLM in a time-multiplexed manner to achieve full-colour holography. The network has a skip connection that creates a direct feed of the input RGB-D image to the penultimate residual block and has no pooling layer for preserving high-frequency details (see Fig. 1c for a scheme of the network architecture; see Methods for performance analysis and comparisons with other architectures). Let W be the width of the maximum subhologram (Fresnel zone plate) produced by the farthest object points to the hologram. We note that the minimal receptive field aggregated from all convolution layers should match W to physically accurately predict the target hologram. Yet, W of the target hologram varies according to the relative position between the hologram plane and the 3D volume, and can often reach hundreds of pixels (see Methods for derivation), resulting in too many convolution layers and slowing down the inference speed. To address the issue, we apply a pre-processing step to compute an intermediate representation (midpoint hologram), which reduces the effective W and losslessly recovers the target hologram.
The midpoint hologram is an application of the wavefront recording plane 30 . It propagates the target hologram to the centre of the view frustum to optimally minimize the distance to any scene point, thus reducing the effective W. The calculation follows the two steps shown in Extended Data Fig. 3. First, the diverging frustum V induced by the point light source is mathematically converted to an analogous collimated frustum V′ using the thin-lens formula describing the magnification of the laser beam (see Methods for calculation details).
The change of representation simplifies the simulation of depth-of-field images perceived in V into free-space propagation of the target hologram to the remapped depth in V′. Let Here, . The reduction is a result of eliminating the free-space propagation shared by all the points, and the target hologram can be exactly recovered by propagating the midpoint hologram back for a distance d − ′ mid . In our rendering configuration, where the collimated frustum V′ has a 6-mm optical path length, using the midpoint hologram as the CNN's learning objective minimizes the convolution layers to 15.
We introduce two wave-based loss functions to train the CNN to accurately approximate the midpoint hologram and learn Fresnel diffraction. The first loss function serves as a data fidelity measure and computes the phase-corrected ℓ 2   mid mid mid m id mid m id , • denotes the mean and ||⋅|| p denotes the ℓ p vector norm applied on a vectorized matrix output. The phase correction computes the signed shortest angular distance in the polar coordinates and subtracts the global phase offset, which exerts no impact on the intensity of the reconstructed 3D image.
The second loss function measures the perceptual quality of the reconstructed 3D scene observed by a viewer. As ASM-based wave propagation is a differentiable operation, the loss is modelled as a combination of the ℓ 1 distance and total variation of a dynamic focal stack, reconstructed at two sets of focal distances that vary per training iteration The random sampling within each bin prevents overfitting to stationary depths, enabling the CNN to learn true 3D holograms. The attention mask directs the CNN to focus on reconstructing in-focus features in each depth-of-field image. Figure 2f validates the effectiveness of each training loss component through an ablation study.
Our CNN was trained on a NVIDIA Tesla V100 GPU for 84 h (see Methods for model parameters and training details). The trained model generalizes well to computer-rendered (Fig. 2a, Extended Data Fig. 5), real-world captured (Fig. 2c, Extended Data Fig. 6) RGB-D inputs, and standard test patterns (Fig. 2e, Extended Data Fig. 4). The simulated focal sweep of CNN-predicted 3D holograms can be found in Supplementary Videos 1, 2, 6. Compared with the reference OA-PBM holograms, the CNN predictions are both perceptually similar (Fig. 2b) and

Article
numerically close (Fig. 2d, f). Evaluated on a single distance-point target, the output from a CNN with sufficient model capacity faithfully approximates a Fresnel zone plate (Fig. 2g), under the low-rank solution space restricted by a set of successively applied 3 × 3 convolution kernels. When all algorithms are implemented on a GPU with the CNN in NVIDIA TensorRT, and the OA-PBM and PBM in NVIDIA CUDA, the mini CNN achieves more than two orders of magnitude speed-up (Fig. 2d) over the OA-PBM and runs in real time (60 Hz) on a single NVIDIA Titan RTX GPU. As our end-to-end learning pipeline completely avoids logically complex ray-triangle intersection operations, it runs efficiently on low-power ASICs for accelerated CNN inference. In Supplementary Video 5, we demonstrate interactive mobile hologram computation on an iPhone 11 Pro, leveraging the A13 Bionic chip's neural engine. Our model has an extremely low memory footprint of only 617 KB at Float32 precision and 315 KB at Float16 precision. At Int8 precision, it runs at 2 Hz on a single Google Edge TPU. All reported runtime performance is evaluated on inputs with a resolution of 1,920 × 1,080 pixels.

Display prototype of tensor holography
We have built a phase-only holographic display prototype (see Fig. 3a for a scheme and Extended Data Fig. 8 for a version of the physical setup) to experimentally validate our CNN. The prototype uses a HOLOEYE PLUTO-2-VIS-014 reflective SLM with a resolution of 1,920 × 1,080 pixels and a pixel pitch of 8 μm (see Methods for prototype details). The colour image is obtained field sequentially 48 . To encode a CNN-predicted complex hologram into a phase-only hologram, we introduce an anti-aliasing double phase method (AA-DPM), which produces artefact-free 3D images around high-frequency objects and occlusion boundaries (see Methods for algorithm details and comparison with the original double phase method (DPM) 49,50 ). In Fig. 3b, we demonstrate speckle-free, high-resolution and high-contrast 2D projection, where the fluff of the berries can be found to be sharply reconstructed. In Fig. 3c

Discussion
Our results present evidence of using CNNs for real-time, photorealistic 3D CGH synthesis from a single RGB-D image, a task that was traditionally considered to be beyond the capabilities of existing computational devices. Our multi-resolution, large-scale Fresnel hologram dataset, created by the tailored random scene generator and the OA-PBM, will enable a wide range of conventional image-related applications to be transferred to holography: examples include super-resolution, compression, semantic editing of holograms and foveation-guided holographic rendering. Ultimately, it provides a testbed for both commercial and academic research fields that will benefit from real-time, high-resolution CGH, for example, consumer holographic displays for virtual and augmented reality, hologram-based single-shot volumetric 3D printing, optical trapping with substantially increased foci and real-time simulation for holographic microscopy. Tensor holography itself can be further improved by directly learning phase-only holograms to discover an optimal encoding, avoiding explicit complex-to-phase-only conversion. In addition, though the RGB-D input is inexpensive to compute and memory efficient, it provides accurate 3D depiction from only a single perspective. Thus, extending our pipeline to support true volumetric 3D input (voxel grid, dense light fields and general point cloud) could expedite the synthesis of holograms that support view-dependent effects and observation under large baseline movement (see Methods for expanded discussion). Finally, the rapid development of ASICs will soon make high-frame-rate tensor holography viable on mobile devices, enabling untethered real 3D viewing experiences and substantially lowering the cost and barrier to entry for holographic content creation.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-020-03152-0.

OA-PBM
The OA-PBM assumes a general holographic display setting, where the RGB-D image is rendered with perspective projection and the hologram is illuminated by a point source of light co-located with the camera. This includes the support of collimated illumination, a special case where the point light source is located at infinity and the rendering projection is orthographic. During ray casting, every object point defined by the RGB-D image produces a subhologram at the hologram plane. The maximum spatial extent of a subhologram is dictated by the grating equation gives a closed-form solution to the PDF associated with z 1 , z 2 , …, z C′ . Although it is required by definition that C′ ∈ ℤ + , where ℤ + denotes the set of positive integers, equation (12) extrapolates to any positive real number no less than 1 for C′. In practice, calculating an average C′ for the entire frame is non-trivial, as meshes of varying shapes and sizes are placed at random x-y positions and scaled stochastically. Nevertheless, C′ is typically much smaller than the total number of meshes C, and well modelled by using a scaling factor α such that C′ = C/α. Equation (1) is thus obtained by applying this equation to equation (12). On the basis of experimentation, we find setting α = 50 results in a sufficiently statistically uniform pixel depth distribution for 200 ≤ C ≤ 250. Extended Data Fig. 2 shows a comparison of the resulting RGB-D images and histograms of pixel depth between our dataset and the DeepFocus dataset. The depth distribution of the DeepFocus dataset is unevenly biased to the front and rear end of the view frustum. This is due to both unoptimized object depth distribution and sparse scene coverage that leads to overly exposed backgrounds.
We generated 4,000 random scenes using the random scene generator. To support application of important image processing and rendering algorithms such as super-resolution and foveation-guided rendering to holography, we rendered holograms for both 8 μm and 16 μm pixel pitch SLMs. The image resolution was chosen to be 384 × 384 pixels and 192 × 192 pixels, respectively, to match the physical size of the resultant holograms and enable training on commonly available GPUs. We note that as the CNN is fully convolutional, as long as the pixel pitch remains the same, the trained model can be used to infer RGB-D inputs of an arbitrary spatial resolution at test time.
Finally, we acknowledge that an RGB-D image records only the 3D scene perceived from the observer's current viewpoint, and it is not a complete description of the 3D scene with both occluded and non-occluded objects. Therefore, it is not an ideal input for creating holograms that are intended to remain static, but being viewed by an untracked viewer for motion parallax under large baseline movement or simultaneously by multiple persons. However, with real-time performance first enabled by our CNN on RGB-D input, this limitation is not a concern for interactive applications and particularly with eye position tracked, as new holograms can be computed on-demand on the basis of the updated scene, viewpoint or user input to provide an experience as though the volumetric 3D scene was simultaneously reconstructed. This is especially true for virtual and augmented-reality headsets, where six-degrees-of-freedom positional tracking has become omnipresent, and we can always deliver the correct viewpoint of a complex 3D scene for a moving user by updating the holograms to reflect the change of view.
However, the low rendering cost and memory overhead of RGB-D representation is a key attribute that enables practical real-time applications. Volumetric 3D representations (dense point cloud, voxel grid, light fields) at the same spatial resolution generally consume orders of magnitude more data. The increased rendering, memory, input/output and data streaming cost alone have made them much less practical for real-time applications with current graphics hardware (that is, a 1080P light field video with only 8 × 8 views is already four times the data of an 8-K video), not including proportionally increased hologram computation cost, which dominates the total cost. The additional points (objects) offered by these representations, however, are either occluded or out of the frame of the current viewpoint. Consequently, they contribute little to no wavefront to the perceived 3D image of the current view. Beyond computer graphics, the RGB-D image is readily available with low-cost RGB-D sensors such as Microsoft Kinect or integrated sensors of modern mobile phones. This further facilitates utilization of real-world captured data, whereas high-resolution full 3D scanning of real-world-sized environments is much less accessible and requires specialized high-cost imaging devices. Thus, the RGB-D representation strikes a balance between image quality and practicality for interactive applications.

CNN model architecture, training, evaluation and comparisons
Our network architecture consists of only residual blocks and a skip connection from the input to the penultimate residual block. The architecture is similar to DeepFocus 51 , a fully convolutional neural network designed for synthesizing image content for varifocal, multifocal and light field head-mounted displays. Yet, our architecture ablates its volume-preserving interleaving and de-interleaving layer. The interleaving layer reduces the spatial dimension of an input tensor through rearranging non-overlapped spatial blocks into the depth channel, and the de-interleaving layer reverts the operation. A high interleaving rate reduces the network capacity and trades lower image quality for faster runtime. In practice, we compared three different network miniaturization methods in Extended Data Fig. 4b: (1) reduce the number of convolution layers; (2) use a high interleaving rate; and (3) reduce the number of filters per convolution layer. At equal runtime, approach 1 (using fewer convolution layers) produces the highest image quality for our task; approach 3 results in the lowest image quality because the CNN model contains the lowest number of filters (240 filters for approach 3 compared with 360 or 1,440 filters for approaches 1 and 2, respectively), while approach 2 is inferior to approach 1 mainly because neighbouring pixels are scattered across channels, making a reasoning of their interactions much more difficult. This is particularly harmful when the CNN has to learn how different Fresnel zone kernels should cancel out to produce a smooth phase distribution. Given this observation, we ablate the interleaving and de-interleaving layers in favour of both performance and model simplicity.
All convolution layers in our network use 3 × 3 convolution filters. The number of minimally required convolution layers depends on the maximal spatial extent of the subhologram. Quantitatively, successive application of x convolution layers results an effective 3 + (x − 1) × 2 convolution. Solving for the maximum subhologram width W = 3 + (x − 1) × 2 yields [(W − 3)]/2 + 1 minimally required convolution layers. In Extended Data Fig. 3, we demonstrate the calculation of the midpoint hologram, which reduces the effective maximum subhologram size through relocating the hologram plane. First, the holographic display magnified by the point light source is unmagnified to its collimated illumination counterpart. The original view frustum V and the unmagnified view frustum V′ are related by the thin-lens equation 1/d′ = 1/d + 1/f, where f, d and d′ are the distance between the point light source and the hologram, the hologram and a point in V, and the hologram and the same point mapped to V′ respectively. Then, the target hologram is propagated to the centre of the unmagnified view frustum V′ following equation (2). As the resulting midpoint hologram depends on only the thickness of the 3D volume, it leads to a substantial reduction of W if the relative distance between the hologram plane and the 3D volume is far. For example, in our rendering setting, we assume a 30-mm eyepiece magnifies a collimated frustum between 24 mm and 30 mm away, effectively resulting in a magnified frustum that covers from 0.15 m to infinity for an observer that is one focal length behind the eyepiece. If the hologram plane is co-located with the eyepiece (30 mm to the far clipping plane), using the midpoint to substitute the target hologram reduces the maximum subhologram width by ten times from 300 pixels to 30 pixels, resulting in 15 convolution layers as minimally required. In practice, we find using fewer convolution layers than the theoretical minimum only moderately degrades the image quality (Fig. 2d). This is because the use of the phase initialization of Maimone et al. 2 allows the target phase pattern to be mostly occupied by low-frequency features and absent from Fresnel-zone-plate-like high-frequency patterns. Thus, even with reduced effective convolution kernel size, such features are still sufficiently easy to reproduce.
We reiterate that the midpoint hologram is an application of the wavefront recording plane (WRP) 30 as a pre-processing step. In physical-based methods, the WRP is introduced as an intermediate ray-sampling plane placed either inside 52 or outside 30,53 the point cloud to reduce the wave propagation distance and thus the subhologram size during Fresnel diffraction integration. Application of multiple WRPs was also combined with the use of precomputed propagation kernels to further accelerate the runtime at the price of sacrificing accurate per-pixel focal control 19,54 . For fairness, the GPU runtimes reported for the OA-PBM and PBM baseline in Fig. 2d have been accelerated by putting the WRP to a plane that corresponds to the centre of the collimated frustum.
Our CNN is trained on a 384 × 384-pixel RGB-D image and hologram pairs. We use a batch size of 2, ReLu activation, attention scale β = 0.35, number of depth bins T = 200, number of dynamic focal stack k fix = 15 and k float = 5 for the training. We train the CNN for 1,000 epochs using the Adam 55 optimizer at a constant learning rate of 1 × 10 −4 . The dataset is partitioned into 3,800, 100 and 100 samples for training, testing and validation. Extended Data Fig. 4a quantitatively compares the performance of our CNN with U-Net 56 and Dilated-Net 57 , both of which are popular CNN architectures for image synthesis tasks. When the capacity of the other two models is configured for the same inference time, our network achieves the highest performance. The superiority comes from the more consistent and repetitive architecture of our CNN. Specifically, it avoids the use of pooling and transposed convolution layers to contract and expand the spatial dimension of intermediate tensors, thus the high-frequency features of Fresnel zone kernels are more easily constructed and preserved during forward propagation.
In Extended Data Fig. 4c, we evaluate our CNN on two additional standard pattern (USAF-1951 and RCA Indian-head) variants made by the authors. The CNN-predicted holograms can reproduce a few-pixel-wide patterns as shown by the magnified in-focus insets. In Extended Data Figs. 5, 6, we show four additional complex scenes (two computer rendered and two real-world captured) and the CNN predicted holograms.

AA-DPM
The double phase method encodes an amplitude-normalized complex hologram Ae ∈ℂ ϕ M N i × (0 ≤ A ≤ 1) into a sum of two phase-only holograms at half of the normalized maximum amplitude: There are many different methods to merge decomposed two phase-only holograms into a single phase-only hologram. The original DPM 50 uses a checkerboard mask to select interleaving phase values from the two phase-only holograms. Maimone et al. 2 first discard every other pixel of the input complex hologram along one spatial axis and then arrange the decomposed two phase values along the same axis in a checkerboard pattern. The latter method produces visually comparable results, but reduces the complexity of the hologram calculation by half via avoiding calculation at unused locations. Nevertheless, for complex 3D scenes, they produce severe artefacts around high-frequency objects and occlusion boundaries (Extended Data Fig. 7, left). This is because the high-frequency phase alterations presented at these regions become under-sampled due to the interleaving sampling pattern and disposal of every other pixel. Although these artefacts can be partially suppressed by closing the aperture and cutting the high-frequency signal in the Fourier domain, this leads to substantial blurring. Although sampling is inevitable, we borrow techniques employed in traditional image subsampling to holographic content and introduce an AA-DPM. Specifically, we first convolve the complex hologram by a Gaussian kernel G σ ( ) W G to obtain a low-pass-filtered complex hologram Āe ∈ℂ ϕ M N i¯× : where * denotes a 2D convolution operator, W G is the width of the 2D Gaussian kernel and σ is the standard deviation of the Gaussian distribution. In practice, we find setting W G no greater than 5 and σ between 0.5 and 1.5 is generally sufficient for both the rendered and captured 3D scenes used in this paper, while the exact σ can be fine-tuned based on the image statistics of content. For flat 2D images, σ can be further tuned down to achieve sharper results. The slightly blurred Āe ϕ i¯ avoids aliasing during sampling and allows the Fourier filter (aperture) to be opened wide, thus resulting in a sharp and artefact-free 3D image. We also add a global phase offset to Āe ϕ i¯ to centre the mean phase around half of the full phase-shift range of the SLM (3π in our case

  
This alternating sampling pattern yields a high-frequency, phase-only hologram, which can diffract light as effectively as a random hologram, but without producing speckle noise. Extended Data Fig. 7 compares the depth-of-field images simulated for the AA-DPM and DPM, where the AA-DPM produces artefact-free images in regions with high-spatial-frequency details and around occlusion boundaries. The AA-DPM can be efficiently implemented on a GPU as two gather operations, which takes less than 1 ms to convert a 1,920 × 1,080-pixel complex hologram on a single NVIDIA TITAN RTX GPU.

Holographic display prototype
Our display prototype (Extended Data Fig. 8) uses a Fisba RGBeam fibre-coupled laser and a single HOLOEYE PLUTO-2-VIS-014 liquidcrystal-on-silicon reflective phase-only SLM with a resolution of 1,920 × 1,080 pixels and a pitch of 8 μm. The laser consists of three precisely aligned diodes operating at 450 nm, 520 nm and 638 nm, and provides per-diode power control. The prototype is constructed and aligned using a Thorlabs 30-mm and 60-mm cage system and components. The fibre-coupled laser is mounted using a ferrule connector/ physical contact adaptor, placed at a distance that results in an ideal diverging beam (adjustable based on the desired field of view) and linearly polarized to the x axis (horizontal) to match the incident polarization required by the SLM. A plate beam splitter mounted on a 30-mm cage cube platform splits the beam and directs it towards the SLM. After SLM modulation, the reconstructed aerial 3D image is imaged by an achromatic doublet with a 60-mm focal length. An aperture stop is placed about one focal length behind the doublet (the Fourier plane) to block higher-order diffractions. The radius of its opening is set to match the extent of the blue beam's first-order diffraction. We emphasize that this should be the maximum radius as opening it further includes second-order diffraction from the blue beam. A 30-mm to 60-mm cage plate adaptor is then used to widen the optical path and an eyepiece is mounted to create the final retinal image. In this work, a Sony A7 Mark III mirrorless camera with a resolution of 6,000 × 4,000 pixels and a Sony 16-35 mm f/2.8 GM lens is paired to photograph and record video of the display (except Supplementary Video 4). Colour reconstruction is obtained field sequentially with a maximum frame rate of 20 Hz that is limited by the SLM's 60-Hz refresh rate. A Labjack U3 USB DAQ is deployed to send field sequential signals and synchronize the display of colour-matched phase-only holograms. Each hologram is quantized to 8 bits to match the bit depth of the SLM. For the results shown in Fig. 3b, Extended Data Figs. 9, 10a, we used a Meade Series 5000 21-mm MWA eyepiece. For the results shown in Fig. 3c, d, Supplementary Videos 3, 4, Extended Data Fig. 10b, we used an Explore Scientific 32-mm eyepiece. The photograph was captured by exposing each colour channel for 1 s. The long exposure time improves the signal-to-noise ratio and colour accuracy. Supplementary Video 3 was captured at 4 K/30 Hz and downsampled to 1080P. Supplementary Video 4 was captured by a Panasonic GH5 mirrorless camera with a Lumix 10-25 mm f/1.7 lens at 4 K/60 Hz (a colour frame rate of 20 Hz) and downsampled to 1080P. No post sharpening, denoising or despeckling was applied to the captured videos and photographs. Finally, our setup can be further miniaturized to an eyeglass form factor as demonstrated by Maimone et al. 2 .

Data availability
Our hologram dataset (MIT-CGH-4K) and the trained CNN model will be made publicly available (on GitHub) along with the paper.

Code availability
The code to evaluate the trained CNN model will be made publicly available (on GitHub) along with the paper. Additional codes are available from the corresponding authors upon reasonable request.