Dual focal plane augmented reality interactive display with gaze-tracker

: Stereoscopic augmented reality (AR) displays have a ﬁxed focus plane and they suﬀer from visual discomfort due to vergence-accommodation conﬂict (VAC). In this study, we demonstrated a biocular (i.e. common optics for two eyes and same images are shown to both eyes) two focal-plane based AR system with real-time gaze tracker, which provides a novel interactive experience. To mitigate VAC, we propose a see-through near-eye display mechanism that generates two separate virtual image planes at arm’s length depth levels (i.e. 25 cm and 50 cm). Our optical system generates virtual images by relaying two liquid crystal displays (LCDs) through a beam splitter and a Fresnel lens. While the system is limited to two depths and discontinuity occurs in the virtual scene, it provides correct focus cues and natural blur eﬀect at the corresponding depths. This allows the user to distinguish virtual information through the accommodative response of the eye, even when the virtual objects overlap and partially occlude in the axial direction. The system also provides correct motion parallax cues within the movement range of the user without any need for sophisticated head trackers. A road scene simulation is realized as a convenient use-case of the proposed display so that a large monitor is used to create a background scene and the rendered content in the LCDs is augmented into the background. Field-of-view (FOV) is 60 × 36 degrees and the eye-box is larger than 100 mm, which is comfortable enough for two-eye viewing. The system includes a single camera-based pupil and gaze tracker, which is able to select the correct depth plane based on the shift in the interpupillary distance with user’s convergence angle. The rendered content can be distributed to both depth planes and


Introduction
There is a rapid growth in the development of three-dimensional (3D) virtual and augmented reality (AR) displays in recent years.Inward rotation of the eyes (convergence) and focus control or accommodation are neurally coupled in natural vision [1].In conventional see-through near-eye displays, a pair of parallax images for two eyes are rendered in flat displays so that a virtual image is created at different depths according to the amount of binocular disparity using separate optics for the two eyes.In this case, a conflict occurs between vergence action of the eyes and accommodation distance, known as vergence-accommodation conflict (VAC) [2].Many recent studies show that VAC is one of the most significant causes of visual discomfort [3] and it might even lead to error in the perception of the scene geometry [4].Displaying AR content without visual discomfort is a challenging task from a technological perspective.Visual discomfort including headaches, eye strain and motion sickness associated with available head-mounted displays (HMDs) create less than desirable viewing experience.Mimicking natural vision requires rendering correct focus cues together with the natural blurring of images [5].Several solutions are provided in the literature to mitigate VAC problem in conventional HMDs.Methods that attempt to minimize the visual discomfort associated with VAC can be categorized [6] as Maxwellian view displays [7], vari-focal plane displays [8], multifocal plane (MFP) displays [9], integral imaging-based displays [10], computational multilayer displays [11], and computational holographic displays [12].While all these methods reduce visual discomfort, they bring about their drawbacks such as optical system complexity, narrow FOV, and small eye-box.
Computational holographic displays are the only solution which can provide all the natural depth cues and visual comfort but at the cost of high computational requirements.Recently launched Magic Leap 1 product is a wearable binocular headset that uses 6 layers of waveguides as optical relays in order to provide two depth planes [13].Despite sophisticated depth sensors and pupil trackers, the system cannot render information at different depth simultaneously and cannot render objects at arm's length since the depths of focal planes are 1 m and 3 m [14].Thus, even this sophisticated system cannot provide comfortable interaction with objects at arm's length due to severe VAC and visual discomfort.The entire scene is rendered in one depth plane, which is selected based on the gaze information computed using the pupil trackers.
We designed and implemented a dual-focal plane biocular AR prototype using two separate displays rendered at 25 cm and 50 cm distances to allow for an interactive wide FOV display using an integrated gaze tracker.Biocular means both eyes share the same optics for viewing while binocular means separated optics are used for each eye.In our proposed biocular display, the two eyes are presented with an identical image and there should be no conflict between the cues for accommodation and vergence.It is important to note that biocular displays are still three dimensional, but because all elements of the image are presented with no binocular disparity, the perception is of a flat three-dimensional image located at a specified depth [15].Our system has an advantage over fixed focus AR systems since two focal planes are simultaneously present without the conflict between the vergence and accommodation.Furthermore, the biocular implementation provides full motion parallax within the movement range of the user (about 5 cm in both horizontal and vertical directions), which significantly enhances the 3D feeling and removes the complexity and inaccuracies related to head trackers needed in binocular displays.This second point is especially worthwhile to emphasize, because during interaction with arm's length objects, even minor inaccuracies of head trackers lead to noticeable vibrations and jumps in object positions, creating disturbing effects.Mitigation of such effects results in a significant increase in implementation complexity of head tracking units.Section 2 describes the system design and optical simulations, Section 3 gives experimental results for the display, Section 4 is dedicated to integrated pupil and gaze tracker and the interactive display demonstration.

System design and simulations
In this section, we discuss the design process of a spatially-multiplexed dual-focal plane biocular augmented reality prototype.To eliminate the error in the perception of the scene geometry in displays, correct focus cues must be enabled in different depth levels.The depth resolution of a human is estimated as 1/7D (D: diopters) due to the limited depth of field of the eye [16].Thus, adjacent depth planes must be separated 1/7D for a continuous depth of field and the range of human accommodation is at most 4 diopters for a typical near point.In this case, a display with full accommodative range requires 28 image planes [17].However, such an implementation is not feasible due to lack of available transparent display technologies and high computational power requirement.Furthermore, stacking a large number of displays is not practical.Therefore, we designed a dual-focal-plane see-through display since it allows users to distinguish virtual objects in different depths through accommodation action of the eye.Whereas it causes a discontinuity for continuous depth scenes, it is a feasible application in terms of form factor and it presents a novel experience for users.In today's consumer HMD applications, interaction in the arms-length without VAC is a challenging task.Thus, distances to the two depth planes are selected as 25 cm (2D) and 50 cm (4D).
The system provides correct focus cues in two focal planes with natural blur effect.Although the proposed display does not require focus tunable lenses and dynamic components, depth perception for the dual-focal plane is easily observable and it is shown in the experimental results section.

Design
The dual-focal plane is acquired by using two separate planar LCDs placed at different optical distances with respect to the single magnifier Fresnel lens.Both LCDs are totally viewed by both eyes so virtual images do not contain any parallax.Therefore, hardware wise, our proposal is less demanding with respect to conventional parallax-based approaches.Optical distances are optimized in ZEMAX to minimize the spot radius of different field points of the image.Virtual content is separated into two sections based on their depth information.The closer section to the viewer's eye appears at 25 cm (4D).The rear section is displayed at 50 cm (2D).Images are superimposed into real word through a planar beam splitter.A camera is placed towards to eye to track depth of user's convergence depth.Figure 1 illustrates the schematic drawing of the system.In Fig. 1, the far display (sketched with a solid red color) is placed at 125 mm from the lens.The image of the far display (shown with transparent red color) is formed 50 cm away from the viewer's eye.The physical distance between the near display (solid blue colored display in Fig. 1) and the lens is 86 mm.The distance of the image of the near display to the user's eye is 25 cm.The distance between the eyes and the combiner is 40 mm.The combiner is placed 33 mm below the lens.The camera is used to track the user's gaze depth.

is a challengin 50 cm (4D).
The syste Although the components, the experimen

Design
The dual-foca distances with both eyes so proposal is le distances are image.Virtua closer section cm (2D  We use two Topfoison TF60010A 1440 × 2560 6.0" TFT LCD panel with a refresh rate of 60 Hz to display rendered images.These displays can be driven by a single portable computer simultaneously.As a magnifier, a 6" (152.4 mm) focal length acrylic Fresnel lens with 1.5 mm thickness from Edmund Optics is used.We used a 50/50 flat combiner in front of the user's eyes with a thickness of 2 mm.We intentionally chose a very thin beam-splitter to eliminate the ghost artifacts and augment the contents displayed on the LCDs.
We use Unity 3D game engine and C# programming language to render the virtual scene.Two virtual cameras are placed in the same virtual 3D position inside Unity 3D with different culling masks.Each virtual camera renders the scene with the associated culling mask and displays the rendered content on the corresponding LCD.Virtual objects that are supposed to be closer than 3 diopters rendered in the near display and virtual objects with a depth of more than 3 diopters are rendered in the far display.In this way, the rendered objects are distributed over one of the available fixed virtual depth planes.

Optical simulations
In this section, we evaluate several attributes of our display by simulations.All simulations are performed in ZEMAX optical design software.In all simulations, we used a ZEMAX model of Fresnel lens from Edmund's Optics official website [18].The 3D optical layout of the setup is shown in Fig. 2. In Fig. 2, the blue solid lines represent the rays emanating from the near display.The rays emanating from the far display is illustrated by solid red colors, as evident in Fig. 2.  Eye-box size is defined as a space in which the user's head is decentered in this space and still, a horizontal and a vertical FOV of at least 10 degrees is available for both eyes [19].Based on this definition, the designed AR display has an eye-box of 105 mm in the horizontal axis and 66 mm in the vertical axis.
For analyzing the optical distortion of the system, we assume that one of the user's eyes is in the center of the eye-box.Distortion value of a single field point in percent determined as where y chief is chief ray height and y ref is reference ray height for the undistorted ray as provided by ZEMAX.Distortion values are recorded for distinct points in the available instantaneous FOV.Then, we assume eyes are in the natural position which means each eye is shifted by IPD/2 or -IPD/2 in the horizontal axis.The simulation results of the optical distortion are shown in Fig. 3.The spot diagram of the system in Fig. 4 shows the spot size variation for red, green, and blue wavelengths at 9 different field points at the center, edges, and corner points to cover the entire FOV.The RMS spot radius varies from 367 µm to 667 µm while the diffraction limited airy spot diameter is 164 µm for the 25 cm virtual image distance.This corresponds to 1.5-3.0As observed in Fig. 3 and Fig. 4, the proposed system suffers from optical distortion and chromatic aberration.Since the full images on both LCDs are seen by both eyes, distortion and chromatic aberration compensation is not performed, i.e. any correction that improves the right eye image quality will result in an equal degree of degradation in the left eye image quality in the horizontal axis.While the angular resolution is 10x below the retinal resolution limit and maximum optical distortion is 11%, the experimental setup has good image quality and effectively demonstrates the natural blurring effect as discussed below.

Experimental results
In this section, we present the experimental results of the proposed system.We designed a road scene as a convenient use-case of the proposed AR display.A large LCD monitor (Dell 21.5" 1920 × 1080 pixels at a distance of 50 cm (2D) is used as a background scene display.The virtual scene augmented on the background scene is rendered in Unity 3D.We get two video outputs simultaneously from Unity 3D and display them in LCDs synchronously.Furthermore, a chinrest is placed to fix the head position for the gaze tracking process.A photo of the implemented bench-top prototype hardware is illustrated in Fig. 5.
As observ chromatic abe and chromatic right eye ima quality in the limit and max and effectivel

Experim
In this section road scene as 21.5" 1920x1 The virtual sc video outputs Furthermore, photo of the im system.We de arge LCD mon ckground scene Unity 3D.We LCDs synch aze tracking pr in Fig. 5.While Fig. 6 illustrates the contents shown in both LCDs that correspond to two frames, Fig. 7 shows the experimental results captured with a camera.The frames shown in Fig. 6 are displayed by the proposed system and a background scene is displayed using a monitor.Scenes, provided by the display, are captured with Nikon D5300 DSLR camera.The camera aperture was set to f/10 with an exposure time of 1/800 seconds.
In Fig. 7  implemented display provides correct depth cues.The user can distinguish the super-imposed virtual objects, even when they are in the same line of sight due to the clear depth perception.

Gaze tracker for interaction
We also added a gaze tracker to interact with the virtual objects.The system uses a single camera (Goldmaster V-52 webcam) with the resolution of 640 × 480 pixels and it does not require any infra-red illumination.The gaze tracker utilizes a feature-based estimation [20] to make it less sensitive to variations in illumination and viewpoint.Furthermore, we aimed an easy calibration process.The calibration process assumes a fixed head position.The output of the algorithm is a single digit output, indicating the display at which the user converges for the central region in the virtual scene.
To clarify the assumptions in the gaze tracking algorithm, two schematic drawings are provided in Fig. 8.A geometrical model is developed to estimate the convergence depth of a user, as shown in Fig. 8(a).Assuming the IPD of the user is 65 mm and using the geometry shown in Fig. 8(a), the rotation angle difference of each eye is computed as: require any in make it less s easy calibratio the algorithm the central reg To clarify provided in F user, as show shown in Fig.
where the radius of the eye is assumed 12 mm.Horizontal 640 pixels across the camera correspond to 128 mm for the fixed head and camera position in our setup, therefore the amount of shift ∆IPD corresponds to approximately 8 pixels.Since 2 diopter shift from 50 cm to 25 cm correspond to 8 pixels, the camera limits the minimum measurable gaze to ∼0.25 diopters.The real limit comes from the noise and the algorithms.The accuracy can be improved by using a camera with a smaller FOV, larger pixel count, and frame averaging.
The main steps of the calibration, gaze tracking, and the interaction process are shown in Fig. 9(a).The user looks at the near display for a while.The tracking algorithm finds the locations of the eye center.The average of the eye-center locations is recorded and IPD N is extracted in terms of pixels.Afterward, the same process is repeated for the far display and IPD F is recorded for the user.After acquiring near and far IPDs, a threshold is determined for robust decision making in the gaze tracking system.Determining convergence depth requires robust pupil center estimation for each eye and accurate IPD calculation in each frame.The details of iris tracking and estimating the accurate location of the eye center algorithm are shown in Fig. 9(b).
extracted in t IPD F is record robust decisio robust pupil c details of iris shown in Fig.   10.For each frame, we extract a region of interest (ROI) for each eye by using the cascade classifiers as illustrated in Fig. 10(a).Full color and different color channel images were tried, and the red channel gave the best results in the experiments.The red-channel image, which is presented in Fig. 10(b), is segmented into two classes -iris, and background -through thresholding via the optimized threshold value.After morphological operations on the segmented image, the rough eye center position is determined and presented in Fig. 10(c).The resulting image is transformed into polar coordinates around the initial value of the eye center which as shown in Fig. 10(d).In this case, the iris fills the left-side of the polar transformed image.Strong vertical edge points, indicated in the region between dashed red lines in Fig. 10(d), correspond the accurate radial edges of the iris.Figure 10(e) demonstrates the transformed image in the cartesian coordinates.In this figure, white points correspond to the accurate radial edges [21] of the iris found in the previous step.The best-fitting ellipse and its center are acquired through a direct least-squares method [22] which is shown in Fig. 10(e).Ellipse's center corresponds to the accurate eye center position in 2D and they are projected into the original frame as illustrated in Fig. 10(f).
The gaze position of the user is estimated at each frame.We used the information of the gaze depth plane to interact with 3D virtual objects placed on the different planes of convergence in real-time.To demonstrate the interaction, we created a proof-of-concept game using the Unity3D game engine [23].The gaze tracker script is developed in Python programming language utilizing OpenCV library [24].To establish communication between Unity3D game engine and the gaze tracker script, we used the NodeJS Mosca library of the MQTT broker [25].We send binary gaze depth information via a specific topic to the broker.Then, the broker sets the gaze state flag in Unity3D.The gaze state flag is used in C# script to change the virtual content on the proper display.The screenshots of the game together with acquired IPD values are illustrated in Fig. 11.As evident in Fig. 11, two texts are displayed at different depths (or displays).User changes the in Fig. 10(d).vertical edge correspond th image in the radial edges [ are acquired Ellipse's cent into the origin     The gaze tracking system is tried by 4 different users.IPD distribution for convergence in the near and far display for the users is presented in Fig. 12(a).Distribution in each interval is represented by a box by taking the data extracted from 180 frames into consideration.On each box, the central red line indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively.The whiskers extend to the most extreme data points and the outliers are plotted individually using the red '+' symbol [26].Thresholds are determined in a way that the system can decide which display user is looking at.The threshold value changes between users automatically since the users have different nominal IPDs.Then, we repeated the gaze tracking and decision-making experiment 10 times by asking the User-4 to look at the near and the far display consequently.We recorded the IPD value at each frame while the user is converged in the near or far display and each interval consists of 60 frames.Results are shown in Fig. 12(b).A pre-determined IPD threshold value, found in the calibration process, is illustrated by the dashed red line.It can be clearly observed that we acquired a robust gaze tracker system with only a few outliers.Our implemented gaze tracker is optimized for a limited number of people in certain illumination conditions.Thus, the results might differ in different illumination conditions and for people with extreme IPDs or different eye shapes.Furthermore, the robustness of the system needs to be studied further to test head orientation and other variations.

Conclus
Nearly AR di them avoid re demonstrated length.We im whether the u by letting the biocular 60.0 mm in the ho in ZEMAX sh the eyes at th observed in t integrated wit distance.The study wide F constitutes a interaction wi simulation, tr provides full m to two depth liquid crystal can be miniat provide larger

Conclusion
Nearly AR displays available in the market are fixed focus stereoscopic displays and all of them avoid rendering objects closer than 60 cm to limit VAC.In this study, we proposed and demonstrated a simple dual-focal plane AR display prototype to render objects within arm's length.We implemented a fast and accurate gaze tracker using a single camera to find whether the user is gazing at the near or far display.The presented display reduces the VAC by letting the user perceive objects in two different depth planes.We achieved overlapping biocular 60.0 degrees horizontal and 35.5 degrees vertical FOV with the eye-box size of 105 mm in the horizontal axis and 66 mm in the vertical axis.Optical distortion analysis realized in ZEMAX shows that maximum distortion does not exceed 11% for the natural position of the eyes at the corners of the image.Depth perception and natural blur effect can be clearly observed in the experiments for two different focal planes.Furthermore, gaze tracker is integrated with the rendering system to automatically change the content based on the gaze distance.The experimental results show that the proposed AR display system is promising to study wide FOV and gaze-tracker based interactive AR displays.The proposed system constitutes a convenient table-top display especially targeted for 3D visualization and interaction with virtual objects within arm's length, and may be useful in medical, gaming, simulation, training, design applications, and as a vision research tool.Currently, the system provides full motion parallax and VAC free visual experience, but virtual content is restricted to two depth planes.Using time multiplexing schemes in combination with fast switchable liquid crystal lenses, the number of virtual displays can be increased.Furthermore, the system can be miniaturized in terms of form-factor for use as a head-worn display or modified to provide larger eye-relief and larger focal plane depths for use as a head-up display.

Fig. 1 .
Fig. 1.Schematic illustration of the optical layout Fig. 2. 3-D O ay achieves an for both near ZEMAX.as a space in w rtical FOV of e designed AR cal axis.al distortion of Distortion val % eight and stortion values we assume eye D/2 in the hor g. 3. Fig. 3(a) ated in the cen values of the the image disto sition.The dis rture of the Fre

Figure 3 (
a) illustrates the distortion values of different field points, while the eye is located in the center of eye-box.In Fig.3(b), the eyes are shifted by 32.5 mm and the distortion values of the different field points along the FOV is reported for each eye.It is evident that the image distortion does not exceed 11% in the entire FOV when eyes are in their natural position.The distortion value increases at the edges of the eye-box due to the limited clear aperture of the Fresnel lens.

Fig. 3
Fig. 3 box.b off-ce The spot d blue waveleng entire FOV.T airy spot diam cycles/degree and glass opti

. 4 .Fig. 3 .
Fig. 3. a) The distortion values of different field points, while the eye is in the center of eye-box.b) The distortion values of different field points, while eyes are in the natural position (i.e.off-centered by IPD/2 = 32.5 mm) Fig. 3 box.b off-ce The spot d blue waveleng entire FOV.T airy spot diam cycles/degree and glass opti

. 4 .Fig. 4 .
Fig. 4. The spot radius sizes for 9 sample points, which are selected as the center, edge and corner points of the display to cover the entire FOV.In this ZEMAX simulation, the eye is shifted by IPD/2 and the wavelengths are chosen to support the visible light spectrum (red color with 656.3 nm, green color with 587.6 nm and blue color with 486.1 nm).Positions of the field points in the FOV are shown in bottom-right.
Fig. 5 simult While Fig Fig. 7 shows t displayed by Scenes, provi aperture was s In Fig. 7 left), warning technical situ shown in Fig blurred, and t results show distinguish th to the clear de

Fig. 5 .
Fig. 5.The prototype hardware including 2 LCDs for two focal planes and camera for simultaneously tracking two eyes and computing the gaze distance

Figure 8 (
b) shows the IPD shift between the near and far focus planes.Ellipses represent the left and the right eyes.The eye pupil positions are color coded.The blue dots indicate the position of pupils when the user converges at the near display.Locations of eye centers while gazing at far display are illustrated by red dots.Based on the model in Fig. 8(a), IPD difference will be ∆IPD = 2∆x ≈ 2 × r eye × ∆θ = 1.56 mm,

Fig. 9
Fig. 9. a) Detailed steps frame, we ext illustrated in F channel gave in Fig. 10(b), the optimized rough eye cen transformed i in Fig. 10(d).vertical edge correspond th image in the radial edges [ are acquired Ellipse's cent into the origin

Fig. 10 .
Fig. 1 image (illust green n Fig. 12(b).A strated by the ze tracker sys r a limited num in different il .Furthermore, and other varia rame.We used s placed on t n, we created a acker script is To establish c used the NodeJ ation via a spec e gaze state flag creenshots of nt in Fig. 11, tw r of the near The color of display. he a) near display a s.IPD distribu 12(a).Distrib m 180 frames d the bottom an e whiskers exte g the red '+' sy which display u ince the users sion-making ex y consequently near or far di A pre-determin dashed red li stem with onl mber of people lumination con , the robustness ations.d the informati the different p a proof-of-conc s developed in communication JS Mosca libra cific topic to th g is used in C# the game toge wo texts are dis object into th the far object and b) far display

Fig. 11 .
Fig. 11.Interaction with gaze tracking.User's gaze is at the a) near display and b) far display.Based on the user's gaze, text color changes automatically.

Fig. 12 .
Fig. 12. a) Measured IPD distribution of 4 users while looking at the near display and far display subsequently during the calibration process.Data extracted from 180 frames in each interval.Red dashed line shows the selected threshold value for each user.b) Repeated gaze tracking results with User-4 when the user repeatedly focuses on near and then far display.User-4.Each box consists of IPD measurement extracted from 60 frames.The threshold value found in the calibration process is illustrated by the dashed line.