Deep learning-based incoherent holographic camera enabling acquisition of real-world holograms for holographic streaming system

While recent research has shown that holographic displays can represent photorealistic 3D holograms in real time, the difficulty in acquiring high-quality real-world holograms has limited the realization of holographic streaming systems. Incoherent holographic cameras, which record holograms under daylight conditions, are suitable candidates for real-world acquisition, as they prevent the safety issues associated with the use of lasers; however, these cameras are hindered by severe noise due to the optical imperfections of such systems. In this work, we develop a deep learning-based incoherent holographic camera system that can deliver visually enhanced holograms in real time. A neural network filters the noise in the captured holograms, maintaining a complex-valued hologram format throughout the whole process. Enabled by the computational efficiency of the proposed filtering strategy, we demonstrate a holographic streaming system integrating a holographic camera and holographic display, with the aim of developing the ultimate holographic ecosystem of the future.

Here, the additional phase terms δ/2 and −δ/2 are introduced due to the GP lens [2]. Setting ψ p = U(x, y, z p − z h ; M p x s , M p y s ) and ψ n = U(x, y, −(z n + z h ); M n x s , M n y s ), the intensity map I c at the camera sensor plane can be formulated as I c (x, y; x s , y s , z s ) = |E c | 2 = |ψ p e iδ/2 + ψ n e −iδ/2 | 2 . (S4) Capturing I c using conventional 2D cameras does not provide sufficient information to reconstruct the optical field E c . The combinations of the linear polarizer and single-shot polarization-dependent acquisition at the polarized sensor produce four different geometric phases, δ = 0, π/2, π, 3π/2 [2], yielding phase-shifting images as follows: I c,0 = ψ p ψ * p + ψ n ψ * n + ψ p ψ * n + ψ * p ψ n (S5) I c,1 = ψ p ψ * p + ψ n ψ * n + iψ p ψ * n − iψ * p ψ n (S6) Then the complex hologram H i is reconstructed from those four phase-shifting images as follows: where z r = (z n + z h )(z p − z h ) (z n + z p ) , M ′ = M p (z n + z h ) − M n (z p − z h ) (z n + z p ) . (S10) The hologram H i is the impulse response function for the original physical point source placed at (x s , y s , z s ), which effectively represents the spherical wave originating from the virtual point source positioned at (M ′ x s , M ′ y s , z r ). This implies that the captured hologram encodes the converted depth of realworld scenes instead of the actual physical depth. When the target object is placed at the physical depth z s , the image must be reconstructed at z r using d-ASM according to Eq. (S10). Only the physical depths are mentioned in the main manuscript to avoid any confusion.

Hologram formation from 3D objects
A 3D object can be described as the collection of point sources, and their contributions are incoherently summed up under incoherent illumination. Therefore, the intensity map at the sensor for an arbitrary 3D scene can be described as follows: I(x, y) = I s (x s , y s , z s )I c (x, y; x s , y s , z s ) dx s dy s dz s (S11) where I s (x s , y s , z s ) represents the intensity of the point source at (x s , y s , z s ). Again, the acquisition of this intensity map using the polarized sensor produces four phase-shifting images as follows: where k = 0, 1, 2, 3. Finally, the hologram H of the 3D object is reconstructed as follows: (S13)

Image quality degradation
Severe image quality degradation, such as color mismatches, contrast reductions, and scattered noise patterns, can be observed in the GPSIDH system. In this section, we discuss the various sources of noise. The spatial variance of the impulse response functions is the primary cause of image degradation. Figs. S2b and S2e present the real parts of the impulse response functions for LED 1 (Fig. S2a) and LED 2 ( Fig. S2d) captured by our GPSIDH setup, respectively. LED 1 is positioned at the center of the field of view (FoV), and the measured impulse response function fills up the entire sensor plane. The captured Fresnel zone pattern is shifted accordingly for the off-axis point source (LED 2), which has the same brightness as that of LED 1. A degraded diffraction efficiency is observed for LED 2, as both holograms are normalized by the same maximum value. If the focal images are reconstructed and normalized by their own maximum intensities, clear focuses can be obtained for both cases (Figs. S2c and f). A problem arises when two LEDs are captured simultaneously (Fig. S2f). The signal in the captured hologram ( Fig. S2h) is dominated by the signal from LED 1, as the diffraction efficiency is lower for the off-axis point source. This results in a lower intensity for LED 2 in the reconstructed focal image (Fig. S2i). Based on this observation, image degradation toward the outer edge of the FoV can be expected in general cases. To confirm this, for a grid test pattern (Fig. S2j), the corresponding hologram ( Fig. S2k) is captured, and the focal image (Fig. S2l) is reconstructed. It can be easily determined that the signal degrades in the boundaries where the outer rows and columns are almost invisible. Resolving this spatial variance of the impulse response functions is particularly difficult because they cannot be compensated by using a simple calibration map. The contributions from multiple point sources are intertwined in a content-dependent manner; therefore, compensating for the spatial variance of the system is not straightforward.
GP lens aberrations form another cause of image quality degradation. The GP lens is fabricated by the photoalignment of liquid crystal layers using the holographic interference technique. In this process, the lens is designed to be a half-wave retarder for achieving maximum efficiency at a particular wavelength, which induces chromatic aberrations [3,4]. The wavelength dependency of the GP lens also affects its phase modulation efficiency and focal length. Incomplete phase modulation produces an additional bias noise, and the focal length variance changes the reconstruction distance based on the wavelength.
The shot noise of the polarized sensor contributes to the system noise, and the limited light efficiency induced by the polarization filters also results in a low SNR. Moreover, the presence of high-reflectance materials deteriorates the neighboring signals and effectively reduces the SNR as the impulse response functions from multiple point sources are overlapped and incoherently summed up.
One possible approach for solving the aforementioned problems involves modeling the imperfections of the impulse response functions, the aberrations of the GP lens, camera noise, and camera-specific calibration parameters based on recently developed camera-in-the-loop approaches [5][6][7][8], which have successfully demonstrated exceptional quality improvements on displayed holograms. Instead, we approach tackling the image degradation issue by using a fully convolutional neural network, expecting that the spatially variant features of impulse response functions and hardware-specific calibrations can be learned and handled by the network. Conceptually, we propose that a neural network can be used as a postprocessing filter for hologram data, as many image processing filters are available for 2D images. In this section, we describe a procedure for matching the captured hologram and the target image displayed on the tablet display. Our initial experiments indicate that the effective ROI of the system is smaller than 600 × 600. The image quality already significantly drops outside the 400 × 400 region in the center, but we aim to enhance the image quality up to 600 × 600. At each depth, we display a 5 × 5 grid pattern as the calibration pattern on the tablet display, as shown in Fig. S3a. We adjust the size of the grid pattern so that the size of the whole pattern in the reconstructed focal image (Fig. S3b) is approximately 600 × 600 pixels. Therefore, we observe a larger calibration pattern when the display is placed farther from the camera. Due to image degradation, it is difficult to locate the four outermost corners of the calibration image in the focal image. Thus, we locate the positions of the four inner corners (indicated by yellow arrows) and extrapolate the positions of the four outermost corners. Using this information, we warp the calibration pattern image to a 600 × 600 rectangular area, as shown in Fig. S3c. We perform homography warping using Kornia, which is a differential library for PyTorch. When we capture holograms for the training dataset, high-resolution target images are resized to the same size as the calibration pattern and displayed inside the area corresponding to the calibration pattern. Then, the final warped images derived from the captured holograms are compared to the resized target images with resolutions of 600 × 600 to compute the image loss.

System parameters of the GPSIDH system
In this section, we theoretically derive the lateral and axial resolutions per the GP overlap conditions. Then, we also experimentally confirm them based on our system parameters. Finally, we derive the field of view (FoV) of the GP-SIDH system. Before we discuss the lateral and axial resolutions, we briefly show that the GP-SIDH system cannot be configured to have the perfect beam overlap condition. It should not impact our system design because we intentionally use the partial beam overlap condition to increase the FoV.   . S4 shows the image formation diagram of the GP-SIDH system. Here, z s denotes the distance between the source object and the GP lens; r p represents the radius of the holograms; and z p and z n denote the locations of the virtual images formed by the positive and negative focal lengths of the GP lens, respectively. In order to satisfy the perfect overlap condition as in FINCH, the propagation angle θ p induced by the positive focal length should be greater than the propagation angle θ n generated by the negative focal length.
Therefore, we can check the following relationship instead: According to the lens formula, we have Then, we obtain the following relationship: Since both z s and f GP are positive values, r GP /|z p | − r GP /|z n | < 0 and thus θ p − θ n < 0. Therefore, the perfect overlap condition cannot be met with the GP-SIDH systems. In the following sections, we consider only the partial overlap condition in our GP-SIDH system. Here, we theoretically derive the lateral and axial resolution of the GP-SIDH system. The representative system configuration is shown in Fig. S5. In our GP-SIDH system, f GP = 1000 mm and the target depth range is z s ∈[300 mm, 480 mm]; therefore, we consider the case when f GP > z s . In conventional digital holography, the effective lateral and axial resolutions are given as follows [9] R lateral = 0.61 λ N A · M T , (S19)

Theoretical derivation of lateral and axial resolutions of the GP-SIDH system
where M T is the transverse magnification; M A is the axial magnification, and NA is the numerical aperture (NA) of the system. The NA of the captured incoherent hologram is where r h is the hologram radius and z r is the reconstruction distance (not shown in Fig. S5) which is given as Here, z h denotes the distance between the GP lens and the image sensor. Therefore, the detailed form of the resolutions can be expressed as As the beam overlap is proportional to z h , we study the dependence of the lateral and axial resolutions on z h . We set z s to 390 mm, which is the center distance of the depth range of DeepIHC. Then, the corresponding reconstruction depth z r becomes 530 mm. In order to fully determine the axial and lateral resolutions, we need to compute the hologram radius r h , which can be determined using the following criterion: Here, r OP L , r N yquist , r CM OS represent the maximum hologram radius limited by optical path difference, sensor sampling frequency, and size of CMOS sensor, respectively. Firstly, r OP L is determined by the interference formation condition as shown in Fig. S5.
For a given r OP L , the corresponding hologram radius r p and r n at the GP lens are given as where z p and z n denote the locations of the virtual images formed by the positive and negative focal lengths of the GP lens, respectively.
Therefore, the optical path difference can be derived using a simple geometric consideration: r OP L is the maximum value that satisfies the condition in Eq. (S29). Assuming the illumination wavelength of 550 nm, the spectral width of 100 nm of our system provides r OP L of 65 mm. However, this derivation assumes that r p and r n are not limited by the aperture size of the GP lens. In case the GP lens is a limiting factor, r OP L is computed using the following formula: The GP lens used in our system has a 2-inch diameter; therefore, the aperture size of the GP lens is the limiting factor of r OP L and the final value is 26 mm. Secondly, r N yquist describes the limitation posed by the sampling rate of the image sensor, which can be derived from the following relationship between the pixel pitch ∆x of the image sensor and central wavelength λ: The center wavelength of 550 nm and the pixel pitch 3.45 µm of our system provides r N yquist of 40 mm. Lastly, r CM OS = 7 mm in our GP-SIDH system, and the final hologram radius is determined as r h = min(r OP L , r N yquist , r CM OS ) = min(26mm, 40mm, 7mm) = 7mm.
(S32) Therefore, the hologram radius is currently limited by the size of the image sensor. This indicates that the GP-SIDH system has room for enhancing lateral and axial resolutions by employing the synthetic aperture strategy.
Using the computed hologram radius, we obtain the effective lateral and axial resolutions as a function of z h in Fig. S6. As the distance z h between the GP lens and image sensor increases, the beam overlap also increases. The result shows that as the beam overlap increases, the lateral and axial resolutions increase as well. However, the increased z h leads to the reduced FoV (see Sec.

2.3). We use lp/mm unit to show the opposite trends of the resolutions and
FoV. Therefore, we use z h of 8 mm (indicated with the black dashed line) in our system as a compromise between the FoV and resolutions. For this setting, the lateral and axial resolutions are calculated as 2.5 mm and 444 mm in our system. In the following section, we also experimentally confirm the computed resolution values. We found that the axial resolution seems to be large, however, our analysis shows that the large axial resolution does not imply that objects that are separated by a distance smaller than the axial resolution cannot be differentiated. We can still observe a clear defocus effect within the depth range that is smaller than the axial resolution; therefore, we discuss the implication of the axial resolution in the context of 3D imaging.

Experimental confirmation of lateral and axial
resolutions of the GP-SIDH system

Lateral resolution
In order to validate the theoretical values of the lateral resolution, we conducted two experiments using two point sources and resolution targets. The theoretical derivation on the lateral resolution is based on the point spread functions of the point light sources. To experimentally reproduce this setting, we generate two point sources and set the lateral position difference using a beam splitter. We set three different lateral separation values of 0 mm, 2 mm, and 3 mm and reconstruct the images and corresponding intensity profiles for each case, as shown in Fig. S7. We found that the lateral separation of 2 mm exhibits the intensity profile (Fig. S7e) similar to the Rayleigh criterion, and the lateral separation of 3 mm (Fig. S7f) indicates that two point sources are already well resolvable. Therefore, we can conclude that the Rayleigh resolution is approximately 2 mm, which well matches the theoretical value of 2.5 mm. We also confirmed the lateral resolution by imaging the USAF resolution  target, as shown in Fig. S7g. The minimum resolvable pattern indicated by the white rectangle was 0.353 lp/mm, which is converted to a lateral resolution of 2 mm.

Axial resolution
We measure the axial resolution based on the full width at half maximum (FWHM) of the axial profiles of point light sources, following the approach in Ref. [10]. We captured the hologram of a point source placed at z s = 390 mm and reconstructed the axial intensity profiles (orange solid line) as a function of z r , as shown in Fig. S8a. We also simulated a hologram for the point source placed at the same position and acquired an almost identical axial intensity profile (blue solid line) in Fig. S8a. We express both intensity profiles as a function of z s in Fig. S8b. The FWHM of the experimental and simulated results are 428 mm and 482 mm, respectively. Therefore, we see that the validation results of the experiment and simulation match reasonably well with the theoretical axial resolution of 444 mm.

Implication of axial resolution in the context of 3D camera
Although we confirmed the good match between the theoretical predictions and experimental results, we found that the axial resolution of 444 mm seems to be large compared to the depth range of the DeepIHC which is 180 mm. We note that the axial resolution of the system does not fully describe the important aspect of the 3D camera system: the amount of the defocus blur. When we use the GP-SIDH system for daily-use cameras and show acquired holograms on holographic displays, the essential role of the captured holograms is to provide the visually noticeable defocus effect rather than to enable the quantification of the exact axial separation between the objects. We believe our system provides such a defocus capability as demonstrated in Fig. S9. We placed the same image at three different depths, namely, 300 mm, 390 mm, and 480 mm, and acquired the corresponding holograms using DeepIHC. The top, middle, and bottom rows present reconstructed images from the DeepIHC holograms of the target at 300 mm, 390 mm, and 480 mm, respectively. Although the depth separation between targets is well below the axial resolution, we can still clearly see the defocus effect and easily pinpoint the best-focused plane.
We also examine the simulated axial intensity profiles of the point sources placed between 300 mm and 480 mm with a separation of 30 mm, as shown in Fig. S10. This depth configuration matches the seven equally-spaced depth planes used in our training dataset capture. We see that the amount of separation is well below the FWHM, however, we can expect that the defocus blur is still observed within the depth range of 180 mm. We found that the axial resolution is not an ideal measure of the 3D camera; therefore, proper quantification of depth resolution should be investigated. This will require the consideration of the amount of required blur for dailyuse 3D cameras and an understanding of the depth perception of the camera or display users. We believe studying such issues is beyond the scope of our paper, and we leave further investigation as future work. We optimize the configuration of the incoherent holographic camera for capturing life-sized objects by enlarging the FoV. The primary way to achieve the large FoV in GP-SIDH systems is to exploit the partial beam overlap condition as shown in Fig. S11b in contrast to the perfect beam overlap condition that is typically employed in FINCH systems as shown in Fig. S11a. The FoV of both configurations is given as

Field of view
where r h is the hologram radius and z h is the distance between wavefront division device and image sensor. We can see that the partial beam overlap condition in the GP-SIDH systems provides the reduced z h and increased r h , resulting in the expanded FoV. The partial beam overlap is not an ideal condition because it leads to the degradation of the lateral resolution [9] However, considering that our main goal is to capture life-size objects and that there is an inevitable trade-off between FoV and lateral resolution, we decide to increase the FoV at the expense of the lateral resolution.
Although it looks as if the expansion of FoV can be simply achieved by placing the image sensor closer to the wavefront division device, the actual modification from the system in Fig. S11a to the system in Fig. S11b can be made only if two important conditions are satisfied: Condition 1 Reducing z h and increasing r h are physically plausible. Condition 2 The captured holograms should provide enough lateral and axial resolution. Otherwise, there is no benefit of using incoherent holographic cameras over conventional 2D cameras.
We found that the GP lens plays an important role in fulfilling those conditions. Regarding Condition 1, the GP lens can easily satisfies this condition: (1) the GP lens works with the transmission geometry unlike the LCoS SLM, which typically works with the reflection geometry, therefore we can reduce z h down to a few millimeters, and (2) GP can be fabricated large enough so that the aperture size of the GP lens does not limit the hologram size r h .
The validation of Condition 2 requires more careful consideration of the focus values of the wavefront division devices. In the following, we show that the positive and negative focal lengths of the GP lens is a key property to achieve a reasonable lateral and axial resolution for the system configuration in Fig. S11b. As shown in Sec. 2.1, the hologram radius is typically limited by the sensor size; therefore, the reconstruction distance z r is the crucial factor that determines the lateral and axial resolution of the system. The formulation of z r in Eq. (S22) is generalized for two arbitrary focal lengths f 1 and f 2 as follows [11]: We examine the best resolution condition based on three representative types of focal length pairs induced by existing wavefront division devices as shown in Figs. S12a-c. The first type (Fig. S12a) represents birefringence lenses [12] and the optical power difference between f 1 and f 2 are typically within 10%. The second type (Fig. S12b) corresponds to LCoS SLMs [13] and they have low f-numbers due to the small diffraction angle and aperture size. The third type (Fig. S12c) represents the GP lens used in our GP-SIDH system. They produce the negative and positive focal lengths with the same magnitude. As the parametric space can be huge in Eq. (S34), we fix z h to 8 mm, which is a reasonable setting to achieve the large FoV in the configuration presented Fig. S11b. We also set the object position z s to 390 mm, which is the  Table S1. Although the exact values of the resolutions can vary depending on the focal length value setting, Table  S1 indicates that GP lens is a good direction to achieve the increased lateral and axial resolution when the system is optimized to have the large FoV. We also visually examine the example holograms for three cases in Figs. S12d-f. The holograms show that higher spatial frequencies can be captured using the GP lens compared to the cases when birefringence lens or LCoS SLMs is used, indicating that higher lateral and axial resolutions can be obtained with the GP lens.

Validation of denoising images with unseen objects
We test whether objects that do not appear in the training dataset can be denoised properly during the validation stage as shown in Fig. S14. We examine that all our training images selected from the DIV2K dataset do not contain object images similar to the letters 'POLICE', the ancient statue, the red car, and the yellow helmet that appear in the target validation images in Fig. S14a. However, we can see that DeepIHC still provides the average enhancement of PSNR of 10.6 dB in the reconstructed images of the filtered hologram in

Effect of conditioning the phase map
Given that the image loss is computed only at the target focus plane, it would be interesting to test whether providing depth information is helpful to the network. An additional depth constraint is tested in the form of complex field loss [14]. For the captured hologram for the target image I i at d i , it is assumed that the phase of the hologram at depth d i is uniform and has an offset proportional to the distance from the central plane. The ground-truth amplitude A i and phase ϕ i are set as follows: For H recon =Ãe iφ , we additionally consider the complex field loss by following the approach in Ref. [14].
l comp = ∥Ã − Ae i[δ(φ,ϕ)−δ(φ,ϕ)] ∥ 2 (S36) Fig. S14 Validation of denoising images with unseen objects. a Validation target images that contain objects which did not appear in the training dataset. b Images reconstructed at the corresponding target object depths from the raw holograms. c Images reconstructed from the filtered holograms.
Figs. S15d-f present the reconstructed images obtained by using the network trained with the additional complex field loss. A slight degradation in the image quality is observed for the flat 2D object in Fig. S15d, which is acceptable. However, the network fails to filter the hologram for the miniature house scene and generates strange artifacts in Figs. S15e and S15f. Therefore, it can be inferred that leaving the phase unconstrained leads to better handling of the multidepth configuration. However, the possibility that the complex field loss might not be adequate to handle incoherent holograms must not be excluded. Therefore, better training strategies should be investigated in future works.