Toward the next-generation VR/AR optics: a review of holographic near-eye displays from a human-centric perspective

Wearable near-eye displays for virtual and augmented reality (VR/AR) have seen enormous growth in recent years. While researchers are exploiting a plethora of techniques to create life-like three-dimensional (3D) objects, there is a lack of awareness of the role of human perception in guiding the hardware development. An ultimate VR/AR headset must integrate the display, sensors, and processors in a compact enclosure that people can comfortably wear for a long time while allowing a superior immersion experience and user-friendly human–computer interaction. Compared with other 3D displays, the holographic display has unique advantages in providing natural depth cues and correcting eye aberrations. Therefore, it holds great promise to be the enabling technology for next-generation VR/AR devices. In this review, we survey the recent progress in holographic near-eye displays from the human-centric perspective.


INTRODUCTION
The near-eye display is the enabling platform for virtual reality (VR) and augmented reality (AR) [1], holding great promise to revolutionize healthcare, communication, entertainment, education, manufacturing, and beyond. An ideal near-eye display must be able to provide a high-resolution image within a large field of view (FOV) while supporting accommodation cues and a large eyebox with a compact form factor. However, we are still a fair distance away from this goal because of various tradeoffs among the resolution, depth cues, FOV, eyebox, and form factor. Alleviating these tradeoffs, therefore, has opened an intriguing avenue for developing the next-generation near-eye displays.
The key requirement for a near-eye display system is to present a natural-looking threedimensional (3D) image for a realistic and comfortable viewing experience. The early-stage techniques, such as binocular displays [2,3], provide 3D vision through stereoscopy. Despite With binocular vision, humans have a horizontal field of vision FOV of almost 200°, in which 120°[ISP]is a binocular overlap. Figure 2 shows the horizontal extent of the angular regions of the human binocular vision system [17]. The vertical field of vision is approximately 130°. Adaption for the eye's natural field of vision is vital for an immersive experience. For example, there is a consensus that the minimal required FOVs for VR and AR displays are 100° [ISP]and 20°, respectively [18]. With the eye's visual field, the photoreceptor density significantly varies between the central and peripheral areas [19]. The ability of the eye to resolve small features is referred to as visual acuity. The visual acuity is highly dependent on the photoreceptor density. The central area of the retina, known as the fovea, has the highest photoreceptor density, and this area spans a FOV of 5.2°. Outside the fovea, the visual acuity drops dramatically. The uniform distribution of photoreceptors is a result of natural evolution, accommodating the on/off-axis resolution of the eye lens. Therefore, to efficiently utilize the display pixels, a near-eye display must be optimized to provide a varying resolution that matches the visual acuity across the retina.
As a 3D vision system, the human eyes provide depth cues, which can be further categorized into physiological and psychological cues. Physiological depth cues include accommodation, convergence, and motion parallax. Vergence is known as the rotation of each eye in opposite directions so that the eyes' lines of sight converge on the object, while accommodation is the ability of adjusting optical power of the eye lens so the eye can focus at different distances. For near-eye displays, it is crucial to present images at locations where the vergence distance is equal to the accommodation distance. Otherwise, the conflict between the two will induce eye fatigue and discomfort [20]. Psychological depth cues are enabled through our experience of the world, such as linear perspective, occlusion, shades, shadows, and texture gradient embedded in a 2D image [21].
Last, the human eye is constantly moving to help acquire, fixate, and track visual stimuli. To ensure the displayed image is always visible, the exit pupil of the display system-the eyebox-must be larger than the eye movement range, which is ~12 mm. For a display system, the product of the eyebox and FOV is proportional to the space-bandwidth product of the device, and it is finite. Therefore, increasing the eyebox will reduce the FOV and vice versa. Although the device's space-bandwidth product can be improved through complicated optics, it often compromises the form factor. To alleviate this problem, common strategies include duplicating the exit pupil into an array [22][23][24] and employing eye-tracking devices [25]. enabling a small form factor because of its simple optics-the key components consist of only a coherent illumination source and an SLM. The SLM is essentially a pixelated display that can modulate the incident light's amplitude or phase by superimposing a CGH onto the wavefront.
For AR near-eye displays, the device must integrate an optical combiner, and its form factor must be accounted for. Most early-stage holographic near-eye displays use a beam splitter as an optical combiner, which combines the reflected light from the SLM and the transmitted light from real-world objects [ Fig. 3(a)]. The modulated light from the SLM can be guided directly to the eye pupil through free-space propagation [26] or pass through an additional 4 f system for hologram relay [ Fig. 3(b)] [27] or frequency filtering [28,29]. Ooi et al. also embedded multiple beam splitters into an element for see-through realization with a large eyebox in an electro-holography glass [30].
Despite being widely used in proof-of-concept demos, the beam splitter is not an ideal choice regarding the form factor because of its cubic shape. As an alternative, the holographic optical element (HOE) is thin and flat, and it functions like a beam splitter. As a volume hologram, the HOE modulates only the Bragg-matched incident light while leaving the Bragg-mismatch light as is. When being implemented in a holographic near-eye display, the HOE redirects the Bragg-matched light from the SLM to the eye pupil and transmits the light from the real-world scene without adding additional optical power. Moreover, the ability to record diverse wavefront functions into the HOE allows it to replace many conventional optical components, such as lenses and gratings, further reducing the volume of the device.
The first implementation of the HOE in a near-eye display was performed by Ando et al. in 1999 [33]. In a holographic near-eye display, the HOE is usually deployed in an off-axis configuration [ Fig. 4(a)], where the light from the SLM impinges on the HOE at an oblique incident angle. The HOE can be fabricated as a multi-functional device, integrating the functions of mirrors and lenses. For example, the HOE has been employed as an elliptically curved reflector [31,34] and a beam expander [35]. Curved/bent HOEs have also been reported to reduce the form factor and improve system performance in FOV and image quality [36]. To further flatten the optics and miniaturize the system, Martinez et al. used a combination of a HOE and a planar waveguide to replace the off-axis illumination module [37]. A typical setup is shown in Fig. 4(b), where the light is coupled into and out of the waveguide, both through HOEs [32,38,39]. Noteworthily, conventional HOEs are fabricated by two-beam interference [40]. Limited by the small refractive index modulation of the holographic materials induced in the recording process, conventional HOEs work for only monochromatic light with a narrow spectral and angular bandwidth. The recent advances in metasurface [41,42] and liquid crystals [43,44] provide alternative solutions. For example, Huang et al. demonstrated a multicolor and polarization-selective all-dielectric near-eye display system by using a metasurface structure as the CGH to demultiplex light of different wavelengths and polarizations [41,42].

B. Visual Comfort
1. Eyebox-In lensless holographic near-eye displays, the active display area of the SLM determines the exit pupil of the system and thereby the eyebox. However, to add the eye relief, we must axially shift the exit pupil from the SLM plane. We can use a 4 f relay for this purpose [ Fig. 5(a)] and magnify the pupil. In this configuration, the CGH can be computed directly using numerical diffraction propagation algorithms, such as a Fresnel integral method or an angular spectrum method [47]. However, the use of the optical relay increases the device's form factor. Alternatively, we can rely on only the CGH to shift the exit pupil [ Fig. 5(b)]. In this case, we can use a double-step diffraction algorithm to compute the CGH, where the desired eyebox location serves as an intermediate plane to relay the calculated hologram.
For a comfortable viewing experience, the near-eye display must provide an eyebox that is greater than the eye movement range. Active pupil tracking and passive pupil replication are two main strategies. A representative schematic of active pupil tracking is shown in Fig.  5(c). According to the detected eye position, a micro-electro-mechanical-system (MEMS) mirror changes the incident angle of light on the SLM, adding a linear phase to the diffracted wave. After being focused by another HOE, the wavefront converges at the exit pupil, which dynamically follows the eye's movement [45]. While the active method requires additional components to track the eye pupil's movement, the passive method directly replicates the exit pupil into an array, thereby effectively expanding the eyebox. Park and Kim [48] demonstrated the expansion of the eyebox along the horizontal axis by multiplexing three different converging beams on a photopolymer film. Later, they extended this work to the vertical axis by leveraging the high-order diffractions of the SLM [ Fig. 5(d)] [46]. Jeong et al. developed another passive eyebox expansion method using a custom HOE fabricated by holographic printing [49]. The resultant method expands the eyebox along both horizontal and vertical axes while maintaining a large FOV (50°). The passive eyebox expansion method using holographic modulation has also been implemented in Maxwellian displays [50,51]. The conventional Maxwellian display optics uses refractive lenses to reduce the effective exit pupil into a pinhole, thereby rendering an all-in-focus image. In contrast, a holographic Maxwellian display replaces the refractive lenses with holograms, modulating the wavefront into an array of focused "pinholes" for eyebox expansion. The duplication of a pinhole-shaped pupil was accomplished by multiplexing concave mirrors into a single waveguide HOE [ Fig. 6(a)] [50] or numerically encoding the hologram with multiple offaxis converging spherical waves [ Fig. 6

Speckle
Noise-Holographic displays commonly use lasers as the illumination source due to the coherence requirement. However, the use of lasers induces the speckle noise, a grainy intensity pattern superimposed on the image. To suppress the speckle noise, we can adopt three strategies: superposition, spatial coherence construction, and temporal coherence destruction.
In the superposition-based method, we first calculate a series of holograms for the same 3D object by adding statistically independent random phases to the complex amplitude, followed by sequentially displaying them on the SLM within the eye's response time [ Fig.  7(a)]. Due to the spatial randomness, the average contrast of the summed speckles in the reconstructed 3D image is reduced, leading to improved image quality [52]. This method has been used primarily in layer-based 3D models [27,53], where the calculation of the holograms is computationally efficient. Alternatively, we can use high-speed speckle averaging devices, such as a rotating or vibrating diffuser. Figure 7(b) shows an example of superposition by placing a rotating diffuser in front of the laser source. In this case, the speckle patterns are averaged within the SLM's frame time at the expense of an increased form factor.
The speckle noise is caused by the destructive spatial interference among the overlapped imaged point spread function [54]. We can alleviate this problem by actively manipulating spatial coherence constructions. Within this category, the most important method is complex amplitude modulation, which introduces constructive interference to the hologram. Rather than imposing random phases on the wavefront, complex amplitude modulation uses a "smooth" phase map, such as a uniform phase, for constructive interference among overlapped imaged spots. To synthesize a desired complex amplitude field using the phaseonly or amplitude-only CGH, we can use analytic multiplexing [55,56], double-phase decomposition [57,58], double-amplitude decomposition [26,29], optimization-enabled phase retrieval [59][60][61], or neural holography [62]. Figure 7(c) shows an example where the complex amplitude wavefront is analytically decomposed into two-phase holograms and loaded into different zones of an SLM. The system then uses a holographic grating filter to combine the holograms and reconstruct complex amplitude 3D images [29]. The major drawback of this method is the requirement that the complex wavefront contains only low frequencies, resulting in a small numerical aperture at the object side. This increases the depth of field and thereby weakens the focus cue. As an alternative solution, Makowski et al.
introduced a spatial separation between the adjacent image pixels to avoid overlaps in the image space, thereby eliminating spurious interferences [54].
Last, using a light source with low temporal coherence can also suppress the speckle noise. Partially coherent light sources, such as a superluminescent light-emitting diode (sLED) and micro LED (mLED), are usually employed for this purpose [63]. Also, a spatial filter can be applied to an incoherent light source to shape its spatial coherence while leaving it temporally incoherent. As an example, the spatially filtered LED light source has been demonstrated in holographic displays to reduce the speckle noise [64,65]. Recently, Olwal et al. extended this method to an LED array and demonstrated high-quality 3D holography [66]. The drawbacks of using a partially coherent light source include reduced image sharpness and a shortened depth range. However, the latter can be compensated for by reconstructing the holographic image at a small distance near the SLM, followed by using either a tunable lens [67] or an eyepiece [68] to extend the depth range.
3. Accommodation-The accommodation cue in holographic displays can be accurately addressed by computing the CGH based on physical optical propagation and interference. In physically based CGH calculation, the virtual 3D object is digitally represented by a wavefront from point emitters [31,69] or polygonal tiles [70]. These two models usually require a dense point-cloud or mesh samplings to reproduce a continuous and smooth depth cue. Although there are many methods to accelerate the CGH calculation from the point-cloud or polygon-based 3D models, such physically based CGHs face challenges in processing the enormous data in real time. Moreover, occlusion is often not easily implemented in physically based CGHs, limiting virtual 3D objects to simple geometries.
The advance of computer graphics has led to two image-rendering models that can help a holographic display produce the accommodation cue more efficiently: the layer-based model and the stereogram-based model. The layer-based model renders a 3D object as multiple depth layers, followed by propagating the layer-associated wavefronts to the hologram through fast-Fourier-transform (FFT)-based diffraction [71][72][73]. To render a continuous 3D object with finite depth layers, Akeley et al. developed a depth-weighted blending (or depthfused) algorithm [74,75]. This algorithm makes the image intensity at each depth plane proportional to the dioptric distance of the point from that plane to the viewer along a line of sight (Fig. 8) using a linear [76] or nonlinear [77] model. Other rendering methods include using an optimization algorithm to compute the layer contents that best match the imaged scene at the retina when the eye focuses at different distances [78], or using color to binary decomposition of a 3D scene into multiple binary images for digital micromirror device (DMD)-based display [79]. As an alternative solution to create continuous depth views by multiplane images, we can also consider displaying high density stacks of depth/focal planes from CGHs using high-speed focus-tunable optics [80,81].
Despite being computationally efficient, the layer-based model has difficulties in rendering the view-dependent visual effects, such as occlusions, shading, reflectance, refraction, and transparency. In contrast, the holographic stereogram model can provide both the accommodation cue and view-dependent visual effects. As illustrated in Fig. 9(a), this model first calculates the light field of a 3D object using a ray-based propagation method. Then it computes the CGH by converting the light field into a complex wavefront [82]. Simply put, the CGH is spatially partitioned into small holographic elements, referred to as "hogel," that direct light rays (plane waves) to varied directions to form the correspondent view images. Like the light field display, the holographic stereogram requires a choice of hogel size, imposing a hard tradeoff between spatial and angular resolutions. Noteworthily, this tradeoff has been recently mitigated by using two non-hogel-based methods. The first method is to encode a compressed light field into the hologram [83]. The second method uses an overlapadded stereogram (OLAS) algorithm to convert the dense light field data into a hologram [ Fig. 9(a)] [84,85], enabling more efficient computation and improving image quality.
The Maxwellian display is another strategy to mitigate the VAC by completely removing the accommodation cues [87,88]. Because the display needs to render only an all-in-focus image, the computational cost is minimized. In a holographic Maxwellian near-eye display, the complex hologram is a superimposed pattern of a real target image and a spherical phase. The light emitted from this hologram enters the eye pupil and forms an all-in-focus image on the retina, as depicted in Fig. 9(b) [51,89]. The flexible phase modulation of the CGH allows correcting wavefront errors in the pupil to produce Maxwellian retinal images for astigmatic eyes [89]. Lee et al. also developed a multi-functional system where they can switch between the holographic 3D view and the Maxwellian view using a switchable HOE [90] or simultaneously display both using temporal multiplexing [91]. Later, they further improved this method by using a foveated image rendering technique [92].

Full-Color
Display-Displaying colors is challenging for holographic near-eye displays because the CGH is sensitive to the wavelength. There are two methods for fullcolor reproduction: temporal [86,93] and spatial division [94,95]. Temporal division calculates three sub-CGHs for the RGB components of a 3D object, followed by displaying them sequentially on a single SLM. Accordingly, an RGB laser source illuminates the sub-CGH at the corresponding wavelength [28,31,93]. Figure 10(a) shows a typical setup where each sub-CGH is sequentially illuminated from the RGB laser source. The double-phase encoding and frequency grating filtering are employed for complex amplitude modulation in each color channel [86]. In contrast, spatial division simultaneously displays three CGHs for RGB colors on three SLMs [94] or different zones of a single SLM [95], which are illuminated by the lasers of the corresponding wavelengths. The reconstructed RGB holographic images are then optically combined and projected onto the retina. However, due to the use of multiple SLMs and lasers, the resultant systems generally suffer from a significant form factor. To address this problem, Yang et al. developed a compact color rainbow holographic display [96], where they display only a single encoded hologram on the SLM under white LED illumination. A slit spatial filter is then used in the frequency domain to extract the RGB colors.
For AR applications, several full-color waveguide HOEs have been developed to combine virtual and real-world images. To enable the transmission of multiple wavelengths, we can fabricate a multilayer structure in the HOE recording, each layer responding to a different color [97]. Alternatively, a metasurface component can also be used for color mixing [41]. An example is shown in Fig. 10(b). The RGB illumination light is coupled into the waveguide through a single-period grating at different incident angles. After being transmitted to the eye side and coupled out through a binary metasurface CGH, the light recombines and forms a multicolor holographic image in the far field.

A. Field of View
The FOV is defined as an angular or distance range over which the object spans. In holographic displays, the SLM (loaded with a Fresnel or Fourier hologram) generally locates at the exit pupil of the display system, so the diffraction angle θ max of the SLM determines the maximum size of the holographic image and, therefore, the system's FOV. Given the SLM pixel size p and the cone angle of the incident beam θ in , the maximum diffraction angle can be calculated as [98] θ max = sin −1 (λ/2p + 2 sin (θ in /2)) ≈ sin −1 (λ/2p) + θ in . (1) If we consider plane-wave illumination, we can simplify Eq. (1) as θ max = sin −1 (λ/2p). To create an eye relief for comfortable wearing, a typical near-eye display usually uses an eyepiece to relay the wavefront on the SLM to the eye pupil, as shown in Fig. 11(a). Given a distance, d slm , between the SLM and the eyepiece, the eye relief, d eye , can be calculated as d eye = |1/(1/f − 1/d slm )|, where f is the focal length of the eyepiece. The FOV can then be calculated as the diffraction angle of the relayed SLM through the eyepiece: where M = d eye /d slm is the magnification of the eyepiece.
The typical pixel size of an SLM varies from 3 μm to 12 μm. Therefore, the diffraction angle θ max is generally less than 5° under plane-wave illumination. To expand the FOV, we can use two strategies to increase the diffraction angle for a given SLM: pupil relay and spherical-wave illumination [68]. The first strategy uses a 4 f relay system to de-magnify the SLM [ Fig. 11(b)]. The imaged pixel size becomes smaller, leading to a larger diffraction angle θ max according to Eq. (1) and, therefore, a larger FOV. The second strategy uses a spherical wavefront to illuminate the SLM [ Fig. 11(c)] [99], increasing the diffraction angle and thereby the FOV. In another implementation, Su et al. used an off-axis holographic lens to replace spherical-wave illumination, introducing the quadratic wavefront modulation after the SLM [100].
Despite being simple to implement, both strategies above sacrifice the eyebox size because of a finite space-bandwidth product. To alleviate this tradeoff, we can use multiple SLMs in a planar or curved configuration [101][102][103][104][105][106][107]. However, this method cannot be readily applied to the near-eye display because of a large form factor. A more practical approach is to use SLM with reduced pixel size, thereby allowing more pixels to be packed in the same area. Further reducing the pixel size of current liquid crystal on silicon (LCoS)-or DMD-based SLMs is challenging due to the manufacturing constraint [108][109][110]. In contrast, the dielectric metasurface holds great promise in this aspect-it can encode CGHs with a pixel size at the subwavelength scale (~300 nm), therefore enabling a much larger FOV (~60 deg) than the current systems. Moreover, the metasurface hologram can control the polarization state of light at the pixel level [111], allowing more degrees of freedom for increasing the space-bandwidth product. As a novel technique, the dielectric metasurface holography still faces many challenges, such as a lack of algorithms for nanoscale and vectorial diffraction, a high-cost and time-consuming fabrication process, and inability to dynamically modulate the wavefront.
As an alternative solution, temporal multiplexing can be employed to increase the FOV at the expense of a reduced frame rate. Figure 12(a) illustrates a representative system using a temporal division and spatial tiling (TDST) technique [98]. Two CGHs with spatially separated diffractions are displayed on the SLM sequentially. The output images are then tiled on a curved surface, increasing the horizontal FOV by a factor of four. A similar method that uses resonant scanners has also been reported [112,113]. Figure 12(b) shows another temporal-multiplexing setup that utilizes the high-order diffraction of an SLM [114]. Li et al. tiled different diffraction orders of the SLM in the horizontal direction and passed a specific order at a time through a synchronized electro-shutter. Three-times improvement in horizontal FOV has been demonstrated.

B. Resolution and Foveated Rendering
Most current commercial near-eye displays provide an angular resolution of 10-15 pixels per degree (ppd), whereas the acuity of a normal adult is about 60 ppd. The pixelated image jeopardizes the immersive experience. The foveated display technique has been exploited to improve the resolution in the central vision (~± 5°) while rendering the peripheral areas with fewer pixels. When combined with eye tracking, the foveated display can provide a large FOV with an increased perceived resolution. The common implementation is to use multiple display panels to create images of varied resolutions [115,116]. The foveal and peripheral display regions can be dynamically driven by gaze tracking using the traveling micro-display and HOE [117].
In holographic near-eye displays, foveated image rendering has been used primarily to reduce the computational load [118][119][120][121][122][123]. Figures 13(a) and 13(b) illustrate examples of using foveated rendering for generating CGHs for the point-cloud- [118][119][120], polygon-(mesh) [121], and multiplane-based models [123]. All these implementations render a highresolution image only for the fovea, while reducing resolution in peripheral areas, thereby significantly reducing the computation time for the CGH. To render a foveal image that follows the eye's gaze angle, we must update the CGH accordingly. Figures 13(c) and 13(d) show the difference in image resolution when the eye gazes at different locations for the polygon- [121] and multiplane-based [123] models, respectively.

Gesture and Haptic Interaction-Human-computer interaction is indispensable for
enhancing the immersion experience. Gesture or haptic feedback has been applied widely in VR systems when a user experiences virtual objects in the displayed environment, providing a more realistic sensation to mimic the physical interaction process [124]. While haptic techniques rely on wearable devices such as haptic gloves, the gesture is more interactive to handle the virtual object in real time through hand and finger movements, which can be detected by a motion sensor. A state-of-the-art interactive bench-top holographic display was reported by Yamada et al. [125]. They used the CGH to display a holographic 3D image while employing a motion sensor (Leap Motion Inc.) to detect hand and finger gestures with high accuracy. However, the real-time CGH display with gesture interaction has yet to be explored in holographic near-eye displays.

Real-Time CGH Calculation-
For current holographic near-eye displays, CGHs are usually calculated offline due to the enormous 3D data involved. For real-time interaction, fast CGH calculation is critical, which can be achieved through fast algorithms and hardware improvement.
To calculate the CGH, we commonly use algorithms based on the point-cloud-, polygon-, ray-, and multiplane-based models. In the point-cloud model, the 3D object is rendered as a large number of point sources. The calculation of the correspondent CGH can be accelerated using the lookup table (LUT) [126][127][128] and wavefront recording plane (WRP) methods [129]. Yet, it still requires significant storage and a high data transfer rate. The polygonbased model depicts the 3D object as aggregates of small planar polygons [70]. It is faster than the point-cloud by lowering the sampling rate and using FFT operation [130,131]. However, the reconstructed image suffers from fringe artifacts [132]. The ray-based model is faster than the point-cloud and polygon in CGH calculation [133,134]. Nonetheless, the capture or rendering of the light field incurs additional computational cost. So far, the multiplane model has been considered the best option for real-time interactive holographic near-eye displays because it involves only finite FFT operations between the planes and thereby offers the fastest CGH calculation speed [27]. Noteworthily, a recent study showed that exploring machine learning can further boost CGH calculation for the multiplane model, enabling real-time generation of full-color holographic images at a 1080p resolution [62].
On the other hand, advances in hardware, such as the deployment of field-programmable gate arrays (FPGAs) and graphic process units (GPUs), can increase the CGH calculation speed as well [135]. Recently, a powerful FPGA-based computer named "HORN" was developed for fast CGH calculation [136][137][138]. The HORN can execute CGH algorithms several thousand times faster than personal computers. For example, it can calculate the hologram of 1024 × 1024 pixels resolution from 1024 rendered multiplane images in just 0.77 s [139]. The practical use of FPGA-based approaches is hampered by the long R&D cycle of FPGA board design. Alternatively, GPUs can be used for the same purpose. For instance, NVIDIA Geforce series GPUs can calculate a CGH of 6400 × 3072 pixels from a 3D object consisting of 2048 points in 55 ms [140]. As another example, using the layerbased CGH algorithm, Gilles et al. demonstrated real-time (24 fps) calculation of CGH of 3840 × 2160 pixels [141]. We expect the high-performance GPU will be the enabling tool for real-time interactive holographic displays. Although we herein discuss only offline computing, real-time inline processing can be made possible by synergistic integration of novel algorithms and hardware.

OUTLOOK
We envision that the next-generation holographic near-eye displays will be as compact as regular eyeglasses, meeting the computing needs for all-day use. The ability to create a comfortable and immersive viewing experience is the key. As a rule of thumb, we must adapt the hardware design for the human vision system at both low and high levels. Lowlevel vision refers to a psychophysical process in which the eye acquires visual stimuli, and it involves both the physical optics of the eye and anatomical structure of the retina. In contrast, high-level vision refers to a psychometric process in which the brain interprets the image. It describes the signal processing in the visual cortex, such as perception, pattern recognition, and feature extraction. Figure 14 shows a hierarchical structure of signal processing of the human visual system.

A. Adaption to Low-Level Vision
To adapt to low-level vision, we must match the FOV, resolution, and exit pupil of the display system with eye optics. For holographic near-eye displays, the major challenge lies in the tradeoff between the FOV and exit pupil (eyebox)-the product of these two factors is limited by the total number of pixels of the SLM. Based on the analysis in Section 4.A, under plane-wave illumination (θ in = 0) and paraxial approximation (sin −1 x ≈ x), the product of the FOV and eyebox along one dimension can be derived as where N is the pixel numbers of the SLM in one dimension. Figure 15 shows the relationship between the FOV and eyebox when using an SLM of varied pixels according to Eq. (3) (λ=532 nm). In general, the larger the format of the SLM, the better the balance we can reach between the FOV and the eyebox. Conventional holographic displays are based on phase-only SLMs. However, the manufacturing cost of large-format phase-only SLMs is high, restricting their practical use in consumer products. In contrast, the amplitude-only SLM, such as an LCD or a DMD, is much more cost efficient. Also, recent studies demonstrated that a complex wavefront could be effectively encoded into an amplitude-only CGH, allowing it to be displayed on an amplitude-only SLM [53,123,142]. Therefore, we envision that amplitude-only large-format SLMs will be mainstream devices in future holographic near-eye displays. For example, with a state-of-the-art 8 K DMD [143,144], we can potentially achieve an 85° horizontal FOV with a 3 mm × 3 mm eyebox. An alternative direction is to exploit superresolution by mounting masks with fine structures on a regular SLM, thereby increasing the diffraction angle without compromising the total diffraction area. For example, the use of a diffusive medium [145], a random phase mask [146,147], and a non-periodic photon sieve [148] have been reported for this purpose. However, using these additional masks complicates image reconstruction and degrades image quality. To solve this problem, we can explore the machine-learning-based method, which has been demonstrated effective in complex-media-based coherent imaging systems [149,150].
The resolution provided by a holographic display is determined by the modulation transfer function (MTF) of the eye at a specific pupil size (i.e., the smaller of the exit pupil of the display and eye pupil). Adapting to the eye's native resolution, therefore, means rendering an image with a varied resolution that matches with the radial MTF of the eye. The foveated technique has been demonstrated highly effective in this regard, significantly reducing the computational cost and thereby holding promise to enable real-time human-computer interaction. Noteworthily, as demonstrated in a recent work [151], the efficiency and quality of foveated rendering can be further improved by the deep-learning-based method. In addition, a study shows that the human eye possesses a lower resolution in response to the blue color compared to red and green [152,153]. The future holographic near-eye displays can take this fact into consideration when rendering different resolutions for the RGB channels.
For visually impaired eyes such as myopia and hyperopia, adapting to the eye's native resolution also means correcting for the defocus/astigmatism of the eye lens. Currently, most holographic near-eye displays do not provide this function, and users with visual impairment must wear extra prescription glasses. Holographic near-eye displays have a unique advantage in correcting for eye aberrations by simply adding a compensating phase map to the hologram, and this direction is yet to be explored.

B. Adaption to High-Level Vision
Adaption to high-level vision requires a thorough understanding of human perception, particularly regarding the way signals are processed in the visual cortex. Despite being equally important, there is a lack of awareness of the role that high-level vision plays in visual perception. An interesting fact is that a "perfect" image from the perspective of lowlevel vision may not be a "perfect" image from the perspective of high-level vision. For example, M. Banks' group reported that the chromatic aberration induced by eye optics improves the realism perceived by the brain [154]. The same group also revealed that the blurred point spread function caused by the aberrated crystalline lens, which is considered a downside of low-level vision, actually helps the brain interpret the image to form 3D perception [155]. As another example, Wetzstein's group recently showed that the rendering of ocular parallax-a small depth-dependent image shift on our retina-improves perceptual realism in VR displays [156]. To reflect this principle in holographic near-eye displays, we must optimize the hardware for "perceptual realism" rather than "photorealism." For instance, to create perfect visual stimuli from the perspective of high-level vision, we must reproduce the "desired" aberrations in the reconstructed image when calculating the CGH (Fig. 16). Despite being studied extensively, there are still many unknowns regarding human perception. Further investigation of this area, therefore, requires synergistic efforts from both optical engineers and neuroscientists.

C. Challenges and Perspectives
Despite decades of engineering effort, the challenge of developing real-time, high-quality, and true (i.e., pixel-precise) 3D holographic displays remains. This challenge is caused by two problems: first, the limited space-bandwidth product offered by current SLMs and, second, the lack of fast algorithms that can generate high-quality holograms from 3D pointclouds, meshes, or light fields in real time.
The limited space-bandwidth product imposes a fundamental tradeoff between FOV or image size and eyebox or viewing zone size in near-eye and direct-view displays. Current 1080p or 4 k SLM resolutions are ready for near-eye display applications, but they are far from being able to support direct-view displays. For the latter application, large-scale panels with sub-micrometer-sized phase-modulating pixels or coherent emitter arrays are required, which is out of reach for today's opto-electronic devices. However, progress has been made with solid-state LiDAR systems for autonomous driving. These LiDAR systems are optimized for beam steering applications, but similar to holographic displays, they require an array of coherent sources. Thus, we hope that efforts on continuing to push the resolution and size of coherent arrays, for example, via advances in laser or photonic waveguide technology, will eventually translate into display applications.
Improving CGH algorithms has been a goal of much recent work. Yet, achieving highquality true 3D holographic image synthesis at real-time framerates has been a challenging and unsolved problem for decades. While conventional display technologies, such as liquid crystal or organic LED displays, can directly show a target image by setting their pixels' states to match those of the image, holographic displays cannot. Holographic displays must generate a visible image indirectly through interference patterns of a reference wave at some distance in front of the SLM-and when using a phase-only SLM, there is yet another layer of indirection added to the computation. This challenge requires ultra-precise and automatic calibration of holographic displays. Moreover, to the best of our knowledge, no existing algorithm has been demonstrated to generate, transmit, and convert 3D image data into a hologram offering true 3D display capabilities. One of the most promising approaches to solve these long-standing challenges is to combine methodology developed for modern approaches to artificial intelligence (AI) with the physical optics of holography [62]. AIdriven CGH techniques show promise in solving many of these long-standing challenges.

CONCLUSION
The demands for AR/VR devices have been ever increasing. The ability to create a comfortable and immersive viewing experience is critical for translating the technology from lab-based research to the consumer market. Holographic near-eye displays have unique advantages in addressing this unmet need by providing accurate per-pixel focal control. We presented a human-centric framework to review the advances in this field, and we hope such a perspective can inspire new thinking and evoke an awareness of the role of the human visual system in guiding future hardware design.  Holographic optical element (HOE) as an optical combiner. (a) Off-axis HOE geometry (included here by permission from [31], 2017, ACM). (b) Waveguide geometry (reproduced with permission from [32], 2015, OSA). Eyebox relay through (a) 4 f system and (b) CGH. Eyebox expansion through (c) pupil tracking (included here by permission from [45], 2019, ACM) and replication (reproduced with permission from [46], 2020, OSA). Eyebox expansion in a Maxwellian display using holographic modulation through (a) HOE (reproduced with permission from [50], 2018, OSA) and (b) encoding CGH with multiplexed off-axis plane waves (reproduced with permission from [51], 2019, Springer Nature). Speckle noise suppression by (a) superposition of multiple CGHs [53], (b) rotating diffuser, and (c) complex amplitude modulation using a single CGH (reproduced with permission from [29], 2017, OSA).  Color holographic near-eye display using (a) time division (reproduced with permission from [86], 2019, OSA) and (b) metasurface HOE (reproduced with permission from [41], 2019, OSA).  Enlarging the FOV through (a) temporal division and spatial tiling (reproduced with permission from [98], 2013, OSA) and (b) temporal division and diffraction-order tiling [114]. Foveated image rendering of (a) point-cloud and polygon-based models (reproduced with permission from [121], 2019, OSA) and (b) multiplane-based model [123]. The foveal content changes according to the eye gaze angle in (c) polygon-based model (reproduced with permission from [121], 2019, OSA) and (d) multiplane-based model [123].