High-speed switchable lens enables the development of a volumetric stereoscopic display

: Stereoscopic displays present different images to the two eyes and thereby create a compelling three-dimensional (3D) sensation. They are being developed for numerous applications including cinema, television, virtual prototyping, and medical imaging. However, stereoscopic displays cause perceptual distortions, performance decrements, and visual fatigue. These problems occur because some of the presented depth cues (i.e., perspective and binocular disparity) specify the intended 3D scene while focus cues (blur and accommodation) specify the fixed distance of the display itself. We have developed a stereoscopic display that circumvents these problems. It consists of a fast switchable lens synchronized to the display such that focus cues are nearly correct. The system has great potential for both basic vision research and display applications.


Introduction
Pictorial displays of three-dimensional (3D) information have widespread use in our society. Adding stereoscopic information (i.e., presenting slightly different images to the two eyes) to such displays yields a compelling 3D sensation and this has proven useful for medical imaging [1,2], cinema [3], television [4], and many other applications.
Despite the clear advantages of stereoscopic displays, there are some well-known problems [5,6]. Figure 1 illustrates the differences between viewing the real world and viewing a conventional stereoscopic display. In natural viewing, images arrive at the eyes with varying binocular disparity, so as the viewer looks from one point to another they must adjust the eyes' vergence (the angle between the lines of sight; Fig. 1a). The distance at which the lines of sight intersect is the vergence distance. The viewer also adjusts the focal power of the lens in each eye (i.e., accommodates) appropriately for the fixated part of the scene (i.e., where the eyes are looking). The distance to which the eye must be focused to create a sharp retinal image is the focal distance. Variations in focal distance create differences in image sharpness (Fig. 1c). Vergence and accommodation responses are neurally coupled: that is, changes in vergence drive changes in accommodation (vergence accommodation) and changes in accommodation drive changes in vergence (accommodative vergence) [7,8]· Vergence-accommodation coupling is advantageous in natural viewing because vergence and focal distances are nearly always identical. In conventional stereoscopic displays, images have binocular disparity thereby stimulating changes in vergence as happens in natural viewing, but the focal distance remains fixed at the display distance. Thus, the natural correlation between vergence and focal distance is disrupted (Fig. 1b,d) and this causes several problems. 1) Perceptual distortions occur due to the conflicting disparity and focus information [9]. 2) Difficulties in simultaneously fusing and focusing a stimulus occur because the viewer must now adjust vergence and accommodation to different distances [10]; if accommodation is accurate, he/she will see the object clearly, but may see double images; if vergence is accurate, the viewer will see one fused object, but it may be blurred. 3) Visual discomfort and fatigue occur as the viewer attempts to adjust vergence and accommodation appropriately [5,10,11]; Fig. 1e shows the range of vergence-accommodation conflicts that can be handled without discomfort: conflicts large enough to cause discomfort are commonplace with near viewing [12,13]. Fig. 1. Vergence distance and accommodation distance in natural viewing and with conventional stereoscopic displays. a) Plan view of a viewer and two objects in the natural environment. The viewer is fixating the far object and not the near object. The lines of sight to the far object intersect at the object. The distance to the intersection is the vergence distance. The distance to which the eye must be focused to form a sharp retinal image of the object is the focal distance; the distance to which the eyes are focused is the accommodation distance. b) Simulation of a conventional stereoscopic display of the same pair of objects. The display screen is at the same distance as the simulated far object so the vergence and focal distance of the image of the far object are the same as in a. However, the near object is presented on the display screen so its focal distance is no longer equal to the vergence distance, resulting in vergence-accommodation conflict and incorrect blur. c) Photograph of two objects like the ones depicted in a with the camera focused on the far object. Note the blurred image of the near object and the nearer parts of the ground plane. d) Photograph of two objects in which the focal distance is effectively the same as in a conventional stereoscopic display. Note the sharp image of the near object and the ground plane. e) Plot showing range of comfortable vergenceaccommodative stimuli. The abscissa represents the simulated distance, and the ordinate represents the focal distance. Stimuli that fall within the red zone will be comfortable to fuse and focus [10,12] Because of these problems, there has been increasing interest in creating stereoscopic displays that minimize the conflict between simulated distance cues and focus cues. Several approaches have been taken to constructing such displays, but they fall into two categories: 1) wave-front reconstructing displays and 2) volumetric displays. To date, none of these approaches are widely used due to some significant limitations. Wave-front reconstructing displays, such as holograms, present correct focus information but require extraordinary resolution, computation, and optics that make them currently impractical [14]. Volumetric displays present scene illumination as a volume of light sources and have been implemented as a swept-volume display by projecting images on to rotating display screen [15], and with a stack of liquid-crystal panels [16]. Each illumination point naturally provides correct disparity and focus cues, so the displays do not require knowledge of the viewer's gaze direction or accommodative state. However, they prevent correct handling of view-dependent lighting effects such as specular highlights and occlusions for more than a single point. Furthermore, these displays require a huge number of addressable voxels, which limits their spatial and temporal resolution and they have a restricted workspace. By restricting the viewing position, these displays become fixed-viewpoint volumetric displays, which have several distinct advantages over multi-viewpoint, volumetric displays [17]. By fixing the viewpoint, the graphics engineer can separate the simulated 3D scene into a 2D projection and a depth associated with each pixel. The 2D resolution of the human visual system is approximately 50 cpd [18]; by industry standards the 2D resolution of an adequate display system is about half that value. The focal-depth resolution of the visual system is not nearly so great: viewers can under optimal conditions discriminate changes in focal distance of ~1/3D [19,20], so the focal-depth resolution of an adequate display can be relatively coarse whereas a multiviewpoint display requires high resolution in all three dimensions. Thus the number of voxels that must be computed for a fixed-viewpoint display is a small fraction of that needed for a multiple-viewpoint display. Presenting the light sources at different focal distances in fixed-viewpoint, volumetric displays has been done in various ways: using a deformable mirror to change the focal distance of parts of the image [21], a set of three displays combined at the viewer's eyes via beam splitters [17], a translating micro-display [22], a translating lens between the viewer and display [23], and a non-translating lens that changes focal power [24,25]. The deformable mirror is an interesting solution, but a solution based on transmissive optics is more desirable if the device is to be miniaturized to be made wearable. The translating micro-display, deformable mirror, and translating lens require mechanical movements that greatly limit the size of the workspace and the speed of changes in focal distance. In all of these designs, it would be very challenging, if not impossible, to miniaturize them sufficiently to produce a practical, wearable device.
Here we describe a fixed-viewpoint, volumetric display that represents a significant advancement. The display presents the standard 3D depth cues including disparity, occlusion and perspective in the fashion that conventional displays do, but it also presents correct or nearly correct focus cues. A stationary, switchable lens is placed in front of the eye and is synchronized to the graphic display such that each depth region in the simulated scene is presented when the lens is in the appropriate state. In this way, we construct a temporally multiplexed image with correct or nearly correct focus cues. Liu et al. [24] and Suyama et al. [25] both employed a similar approach using variable focal length lenses, but both these displays were limited by the time response of the lens. The highest frequency either of these lenses could achieve is ~50-60Hz, so with N focal states, the frame rate becomes ~50/N Hz. As we will show, a useful system requires at least four focal states, which with the liquid-lens system yields a frame rate of 12.5Hz, and this would produce quite unacceptable flicker and motion artifacts. Our system is intrinsically much faster and thereby enables the construction of a more useful, compact, flexible, and flicker-free display. Our system has the drawbacks that the user must wear active glasses (but all other systems that we are aware of require active optics), and also has the drawback that it requires the use of polarized light, so some light is wasted. However, the results in the following sections (including the video) demonstrate useful real-time stereo display that provides nearly correct focus cues. We also present data on the required number of depth planes in this type of display; the data are relevant to both our technology and other ways (referenced above) of implementing focuscorrect stereo displays.

System information
The key technical innovation is the high-speed, switchable lens schematized in Fig. 2. The refracting element is a fixed birefringent lens. Birefringent materials have two refractive indices (ordinary and extraordinary) depending on the polarization of the incident light, so the lens has two focal lengths that are selected with a polarization modulator. If the lens is arranged such that the extraordinary axis is vertical and the ordinary axis is horizontal, incoming vertically polarized light is focused at a distance corresponding to the extraordinary refractive index.  If the light's polarization axis is rotated to horizontal before the lens, it is focused at the distance corresponding to the ordinary index. We use ferroelectric liquid-crystal modulators (FLCs) [26] to switch the polarization orientation. They act like half-wave plates whose optical axis can be rotated through ~45°. The incident polarization is therefore either aligned with or at 45° to the optical axis and hence the output polarization is rotated by either 0° or 90°. The switching between focal lengths can occur very quickly (<1ms). More focal lengths are achievable by stacking lenses and polarization modulators. With N devices, the system produces 2 N focal lengths. We have constructed a system with stacks of two devices thereby achieving four focal states. The concept of a birefringent lens was described in the patent literature [27], and realised as a single lens with two states [28]. Our novel contribution is to use a four-state system to create a stereoscopic display with nearly correct focus cues.
The lens material is calcite, which has the advantages of transparency, high birefringence (0.172) in the visible light range, and machinability. The lenses are plano-convex with a diameter of 20mm. The convex surfaces have radii of curvature of 143.3 and 286.7mm, so the four focal powers are 5.09, 5.69, 6.29, and 6.89 diopters (D), and the separations are 0.6D. A fixed glass lens (not shown) allows adjustment of the whole focal range. The number and dioptric separation of the focal states are important design features. With N focal states and average separations of ∆, the workspace is: which in our current system is 1.8D. The separations ∆ can be unequal, which might be advantageous for some applications. The degree to which the system produces retinal images similar to those created in natural viewing depends critically on the dioptric separation between the focal states, which we discuss in Section 3. We have constructed two display systems: one uses two lens assemblies and one CRT and presents separate images to the two eyes in a time-sequential fashion (Fig. 3c,d); the other uses two CRTs and lens assemblies, a pair for each eye (Fig. 3a,b), and presents images simultaneously to the two eyes. Both systems are able to present all the standard cues in modern computer-graphic images, including view-dependent lighting and binocular disparity. The lens assemblies change focal state with each refresh of the CRT(s). Each image to be displayed is split into four depth zones corresponding to different ranges of distances in the simulated scene (Fig. 2e). The presentation of each of the four zones is synchronized with the lens system. Thus, when the most distant parts of the scene are displayed, the lens system is switched to its shortest focal length so that the eyes have to accommodate far to create sharp retinal images. When nearer parts are displayed, the lens system is switched to longer focal lengths so that the eye must accommodate to closer distances to create sharp images. The system thereby creates a digital approximation to the light field the eyes normally encounter in viewing 3D scenes. We do not have to know where the viewer's eye is focused to create appropriate focus cues. If the viewer accommodates far, the distant parts of the displayed scene are sharply focused on the retinas and the near parts are blurred. If the viewer accommodates near, distant parts are blurred and near parts are sharp. In this way, focus cues-blur in the retinal image and accommodation-are nearly correct.
For all but the very unlikely case that the distance of a point in the simulated scene coincides exactly with the distance of one of the focal planes, a rule is required to assign image intensities to focal planes. We use depth-weighted blending [17] in which the image intensity at each focal plane is weighted according to the dioptric distance of the point from  I  I  I  I where D n and D f are the dioptric distances of the nearer and farther planes. Thus, pixels representing an object at the dioptric midpoint between two focal planes are illuminated at half intensity on the two planes. The corresponding pixels on the two planes lie along a line of sight, so they sum in the retinal image to form an approximation of the image that would occur when viewing a real object at that distance. The sum of intensities is constant for all simulated distances (i.e., I s = I n + I f ). The depth-weighted blending algorithm is crucial to simulating continuous 3D scenes without visible discontinuities between focal planes. We evaluate the display's and algorithm's effectiveness in creating natural retinal images in Section 3.
As mentioned earlier, we have constructed two stereoscopic display systems. In both cases, the speed limitation is the display, a cathode-ray tube (CRT) running at 180Hz with 800x600 resolution. One system, shown in Fig. 3a,b, contains two CRTs and lens systems, one for each eye. Images are delivered to the eyes via front-surface mirrors. With the CRT frame rate at 180Hz, each of the four focal states is presented at 45Hz per eye. Flicker is barely visible. The other system, shown in Fig. 3c,d, uses one CRT that presents separate images to the two eyes in a time-sequential fashion. Liquid-crystal shutter glasses [29] alternatively open and block the light path to the left and right eyes in synchrony with images on the CRT. At the 180Hz frame rate, focal states are presented at 22.5Hz per eye, producing fairly noticeable flicker. Because the speed limitation is the CRT, faster display technologies, such as DLPs and OLEDs, will eventually allow higher presentation rates and more focal states without visible flicker.

Performance evaluation
Two important considerations are the optical quality of the switchable lens assembly and how well the assembly simulates stimuli in-between focal planes. As a rough measure of optical quality, we took still photographs and videos through the system. Figure 4 shows still photographs of Russian dolls (from near to far, they are respectively Stalin, Brezhnev, Gorbachev, and Yeltsin). The lens assembly was focused successively to each of its four focal lengths in the four pictures. The optical quality of the images is subjectively good and the blur patterns are qualitatively correct for the various focal states. A video demonstration of conventional displays and the switchable display is shown in Fig. 5.
To examine optical quality quantitatively, we measured the modulation transfer function (MTF) of the birefringent lens system. Figure 6 shows the results for the four focal states. For on-axis imaging, the MTFs of our lens are excellent: for example, transfer at 26 cpd is ~0.6-0.8 depending on focal state. The MTFs are similar to that of a high-quality commercial lens [30] that we also assessed. Our lens system has not yet been optimized to minimize field or chromatic aberration, so even better quality is attainable.
The retinal images created by a volumetric display are certainly a much closer approximation to the images created by the real world than are the images created by conventional stereoscopic displays. Indeed, a volumetric display can in principle produce retinal images that are perceptually indistinguishable from the images generated by real scenes. To see how close our display comes to achieving that, we compared the retinal images formed by viewing objects at different simulated distances in the display with the retinal images formed by viewing real objects at different distances.   The video shows images captured through the system with the lens assembly in each of its four focal states. The simulated scene consists of four letter-acuity charts placed on a ground plane at different distances from the viewer. The first segment of the video shows the effect of refocusing the camera used to record the video when the switchable lens is inactive and the display becomes a conventional display; in this case, all the charts are a fixed focal distance from the camera, so they go in and out of focus simultaneously. The second segment of the video shows the effect of refocusing the camera when the switchable lens is activated; the various charts go in and out of focus separately. The third and fourth segments show an object moving in depth. In the third, the switchable lens is inactivated so the object remains equally focused as it moves in depth. In the fourth, the lens is activated, so the object goes in and out of focus appropriately.  6. Modulation transfer of the switchable lens system. Modulation transfer is plotted as a function of spatial frequency for the four focal states of the system. We printed test patterns of high-contrast square-wave gratings that ranged from 5 to 63 cpd. To measure the MTFs, we used a Canon 20D dSLR camera with a Canon 50mm, f/1.8 prime lens (set at f/4.0). The direct measurements determined the MTF of the camera plus any attenuation in the printing process. Then with the same camera, we photographed the same test patterns at the same magnification through the switchable lens system. We set the system in one of its four states and made the measurements, and then repeated this for the other focal states. The plotted MTFs are the MTFs for imaging through the switchable lens system divided by the MTFs for imaging with the camera alone. Error bars represent standard deviations. There is some variation in modulation transfer across the four focal states, but the MTFs are similar to that of a highquality digital camera.
We implemented depth-weighted blending for these calculations. Figure 7 plots modulation transfer (retinal contrast/incident contrast) for a wide variety of situations. The dioptric separation D between focal planes is plotted on the abscissa of each panel. The simulated distance D s of the object is plotted on the ordinate as a proportion of the distance between focal planes. The left, middle, and right columns represent the results for sinusoidal gratings of 3, 6, and 18 cpd, respectively. The upper and lower rows represent the results respectively for an aberration-free, diffraction-limited eye and for a typical human eye. We show results for both types of optics to make the point that the higher-order aberrations of the typical human eye make it easier to construct a display that produces retinal images indistinguishable from those produced by natural viewing. In each panel, color represents modulation transfer, red representing a transfer of 1 and dark blue a transfer of 0. Modulation transfer is maximized when object position is 0 or 1 because those distances correspond to cases in which the image is on one plane only and our simulation is perfect. When object position is 0.5, image intensity is distributed equally between the bracketing near and far planes, so the retinal images are approximations of the images formed in real-world viewing. The results for the typical eye show that small separation of ~0.4D are required to produced retinal-image contrast at 18 cpd that are within 30% of the retinal contrast produced in natural viewing, a value that would be perceptually indistinguishable [31]. With such small separation, the workspace would be quite constrained. For example, with two lenses (and therefore four focal states), it would be only 1.2D (Eq. (1). Fortunately, the perception of blur and control of accommodation are driven primarily by medium spatial frequencies (4-8 cpd) [32][33][34]. Furthermore, the contrasts of natural scenes are proportional to the reciprocal of spatial frequency [35], so such scenes contain little contrast above 4-8 cpd. Thus, the effectiveness of a volumetric display should be evaluated by the modulation transfer at 4-8 cpd. The lower middle panel in Fig. 7 shows that a plane separation of ~0.77D produces retinal-image contrasts at 6 cpd that are within 30% of real-world images and would therefore be indistinguishable [31]. With two lenses (producing four focal states), a volumetric display with depth-weighted blending should produce an excellent approximation to the real world within a workspace of 2.3D. With three lenses (eight focal states), such a display should produce the same excellent approximation within a workspace of 5.4D. This would suffice for producing a workspace extending from 18 cm to infinity. Fig. 7. Retinal-image contrast with the multi-focal display for different types of optics, spatial frequencies, and separations of the focal planes. Each panel plots plane separation in diopters on the abscissa and the position of the simulated object on the ordinate. Color (see color bar) represents modulation transfer (contrast in the retinal image divided by incident contrast). In constructing the figure, we assumed that the eye has accommodated precisely to the simulated distance. The upper row shows the calculations for diffraction-limited optics with a 4-mm aperture. The lower row shows the calculations for typical human optics (left eye of author DMH) with a 4-mm pupil; the transfer properties were determined by wave-front measurements with a Shack-Hartmann sensor. The left, middle, and right columns show the results for spatial frequencies of 3, 6, and 18 cpd, respectively.

Conclusions
We have developed a new stereoscopic display system that produces realistic blur to drive accommodation while also producing the appropriate cues to drive vergence and generate high-quality 3D percepts without discomfort and fatigue. The key technical development is the high-speed, switchable lens integrated with the computer display.
The system provides opportunities to conduct basic vision research while maintaining correct or nearly correct blur and accommodation. The technology also has several potentially important applications for situations like diagnostic medical imaging and surgery in which correct depth perception is critical. To realize those applications, the system would have to be miniaturized so that it could be worn like spectacles. This can be achieved by combining the birefringent lens and liquid crystals into a single unit. With head-tracking, the viewer would be free to move with respect to the display.