Phase added sub-stereograms for accelerating computer generated holography

: Phase-added stereograms are a form of sparse computer generated holograms, subdividing the hologram in small Fourier transformed blocks and updating a single coeﬃcient per block and per point-spread function. Unfortunately, these algorithms’ computational performance is often bottlenecked by the relatively high memory requirements. We propose a technique to partition the 3D point cloud into cells using time-frequency analysis, grouping the aﬀected coeﬃcients into subsets that improve caching and minimize memory requirements. This results in signiﬁcant acceleration of phase added stereogram algorithms without aﬀecting render quality, enabling real-time CGH for driving holographic displays for more complex and detailed scenes than previously possible. We report a 30-fold speedup over the base implementation, achieving real-time speeds of 80ms per million points per megapixel on a single GPU.


Introduction
Computer generated holography (CGH) comprises a set of techniques to numerically simulate diffraction of light for various applications in holography and beyond.CGH is computationally challenging because of the nature of diffraction, since every 3D scene element can potentially affect all hologram pixels.It is especially important for holographic displays which require high-resolution holograms to be calculated at video rates.
Holographic displays are often cited as the ultimate 3D display technology [1], because in principle holograms can display virtual objects which are indistinguishable from real objects: they account for all human visual cues, such as parallax, occlusion, stereopsis, eye-focusing and no exhibit accommodation-vergence conflict.Many types of CGH algorithms exist [1,2].Depending on the CGH method, features such as computational speed, visual realism, supported graphical effects and scene geometry might be traded off against each other.They can be classified as point cloud methods [3][4][5][6][7][8], layer-based methods [9][10][11], polygon methods [12][13][14][15], RGB+depth methods [16], ray-based methods [17][18][19] and sparse CGH [20][21][22].This paper will focus on "phase-added stereograms" (PAS) [23][24][25][26].They are a particular form of sparse CGH, relying on the fact that point spread functions (PSF) can be accurately approximated by only a few coefficients in a well-chosen transform domain.That way, the number of needed coefficient updates will be much lower than the hologram pixel count, speeding up calculations considerably while assuring minimal quality loss.The modus operandi of PAS is to divide the hologram into equally sized blocks and approximating the PSF in every block by a single plane wave.This corresponds to adding a single Fourier coefficient per block.Most of the effort pertaining to optimizing PAS has been dedicated to get the most accurate hologram approximation with the smallest number of coefficients.However, calculation speed is not only determined by the calculation complexity, but by bound memory throughput as well.In the PAS framework, the needed memory bandwidth per coefficient is the limiting factor over the required computational cost per coefficient.This will limit the the number of independently working threads, the usability of the cache and preclude several forms of specialized implementation with low memory or complexity requirements, such as many types of FPGAs.
To address this challenge, we propose a new algorithm for computing PAS more efficiently.By partitioning the point cloud into our newly proposed cell construction, we guarantee that all points within a cell only affect a tiny number of coefficients per PAS block, called a "sub-stereogram".This makes the algorithm cache-friendly and can substantially speeds up most PAS variants.
The paper is structured as follows: in Section 2 we first derive the optimal way for computing PAS coefficients in the Fresnel approximation regime.Then, in Section 3 we model the cell geometries by means of time-frequency analysis, characterize the tiling topology in 2D and 3D space, and model an efficient algorithm for binning the points into the different cells.In section 4, we experimentally validate the proposed solution quantitatively and qualitatively.Finally, we conclude in section 5.

Phase-added stereograms modelled with dictionary elements
In point based CGH we approximate objects by a discrete set of 3D points, which create a linear combination of point-spread functions (PSF).The Fresnel approximation of a single PSF P, laterally displaced with a translation (δ, ) and at a depth z from the hologram plane is given by where i is the imaginary unit, λ is the wavelength and a ∈ C is the point amplitude.This expression can be used to calculate a hologram H by summing over a collection of Q points P j , j ∈ {1, . . ., Q}, each with their respective coordinates (δ j , j , z j ) and amplitudes a j : Directly computing this sum is very computationally demanding, updating every hologram pixel Q times.To accelerate computations, the CGH can be computed in a transform domain where PSFs are sparse, such that only a few coefficients need to be updated to achieve a good approximation of the reference PSF.

Phase-added stereogram as a dictionary model
We will focus on phase-added stereograms, which subdivide the holograms into blocks of B × B pixels, each described by their local Fourier domain.This was chosen because PSFs have been shown to be sparse in short time Fourier transform domains, more so than e.g.wavelets [21].Every block is represented by a S × S coefficients corresponding to plane waves pieces with different frequencies, cf. Figure 1.When S>B, we have a variant of the accurate PAS method [25], effectively using an overcomplete dictionary D mn : namely, every block is represented by a linear combination of D mn , where m and n indicate the horizontal and vertical frequency, respectively.For every PSF, the goal is to locally approximate a PSF piece as closely as possible in every block by choosing the best matching D mn and assigning the right coefficient value.As S gets larger, we sample Fourier space more finely and get more D mn to choose from, so we can better approximate the carrier frequency of the PSF in any given block, increasing the quality of the PAS.This problem amounts to minimizing the energy difference between the target quadratic chirp (i.e. the PSF) as expressed in (1) and the planar wave within the block boundaries [−A, +A] by solving for the right frequencies m, n and phase delay ϕ: with half block size A = Bp 2 .If we want to solve for other blocks not centered at the origin, we can translate the system equivalently by modifying δ and , so this is a generic problem formulation.Note that we aim to minimize the difference between two phase-only functions, i.e. functions f : R → C where |f (x, y)| = 1, ∀x, y ∈ R. Thus, we can use the Euclidean distance between two points on the unit complex circle to express the distance as a function of their phase difference: For well-chosen m, n, we can assume that the phase difference between the plane wave and the target PSF chirp is small over the block region.We can therefore use the Taylor approximation cos t ≈ 1 − t 2 2 valid for small values of |t|.This makes (6) equivalent to solving resulting in the sought phase delay (and omitting the index j) where the term containing A 2 can be ignored, because it will cause the same phase delay across all blocks.For a general block centered at position (M x , M y ), we get: where [[.]] is the rounding operator.Repeating this process for every point in the point cloud and summing all computed coefficients will result in the Fourier domain representation of the hologram.We obtain the final PAS after applying a single IFFT per block and cropping the B × B top-left corner of samples, as shown in Fig. 1.Because of the cropping and frequency oversampling of the FFT we need the added term B 2S (m + n) to obtain a correct phase for ϕ in (9).

Modelling sub-stereograms
There are a total of S 2 B 2 N 2 coefficients for a hologram of N × N pixels.Since this large data structure does not fit in local caches, direct random access to update these coefficients is relatively slow.It is not straightforward to use multiple compute threads per PAS block, because threads interfere with each other requiring atomic access to variables.Hence, multi-level caching is needed for improved throughput and large amounts of memory are required, precluding its use in simpler hardware topologies such as many FPGAs.Even if such complex hardware is available, the maximum computation gain that can be obtained will still be inferior to the proposed solution; this has been analyzed in detail in [27].
The goal is to partition the point cloud into smaller groups called cells, each only affecting a small subset of the total number of coefficients.That way, these small number of coefficients can be cached for significantly improved computation throughput.Every thread is responsible for its own PAS block, so that it will never interfere with coefficients accessed by other threads (cf.Fig. 2).In the conventional approach, a block typically does not fit in cache and memory access patterns are unbounded and different across blocks, thus requiring storage in global memory.This will bottleneck the calculations.In the proposed method, only a "sub-stereogram" is active per cell.Once the sub-stereogram is fully computed, only then the data is transferred to global memory, which will considerably reduce the required global memory accesses from once per point to once per cell.In (a), the cache cannot be used for storing coefficients effectively because of space limitations, so every coefficient will have to be written to the slower global memory directly.In (b), the sub-block can be kept in cache, allowing for many fast coefficient updates.Only when a cell has been fully processed, the global memory needs to be accessed for updating the sub-stereogram.

Lozenge cells
Suppose we only want to update a K × K sub-block of the dictionary coefficients per S × S block where K<S, cf. Figure 2(b), which we will designate as a "sub-stereogram".We want to answer the following question: what is the spatial region of which the contained 3D points will solely affect the considered sub-stereogram?
This can be derived with time-frequency analysis (a.k.a.space-frequency, phase space or Wigner space).It portrays n-dimensional signals with a 2n-dimensional representation simultaneously describing time (or space) and frequency.For a one-dimensional PSF signal, this space can be drawn as a chart, where the horizontal axis shows the evolution of the signal over space and across PAS blocks, and the vertical axis shows the frequency value.The instantaneous frequency ν(x) of a (one-dimensional) PSF P(x) is given by where ∠ is the complex argument (or angle).The curve ν(x) is a line with a slope inversely proportional to z and displaced by a lateral translation δ.Thus, this curve tells us what PAS Fourier coefficient should be updated for a 3D point at a given spatial coordinate x.
The size K of a sub-block will dictate the maximal available bandwidth at any given position.Since the signal frequency is proportional to the incidence angle, the set of supported incidence angles will form a cone, as exemplified by the red cones in Fig. 3(b).We now consider a parallelogram-shaped region with a height given by K shown in Fig. 3(a).By taking the intersection of all corresponding cones, we obtain a quadrangle looking like a distorted rhombus, which we will call a "lozenge cell".Every point in 3D space will have a resulting time-frequency curve lying fully in the parallelogram if and only if it is found in the lozenge cell.Because the time-frequency curves of Fresnel modelled PSFs are lines, they are fully characterized by two points in the time-frequency diagram.We can therefore guarantee that a 3D point satisfies this condition if its PSF has an instantaneous frequency within two specific frequency bands at the extremities of the hologram.Given the irregular shape of lozenge cells, a new question arises: how do we efficiently partition any 3D point cloud into the minimal number of cells?

Partitioning space into lozenge cells
We can seamlessly tile space with lozenge cells in the 2D case using the construction shown in Fig. 4(a).We subdivide the total supported hologram viewing angle into cones at both hologram edges depending on the ratio K S , indexed by i x and j x , respectively.Every cell is thus uniquely characterized by its coordinates (i x , j x ).The lozenge cells are stacked in rows, whose index is given by r x = i x + j x , consisting of r x + 1 cells per row.Each cell in the same row also has the same minimum and maximum z values given by d[r x ] and d[r x + 2], respectively, with where F is the reciprocal of the number of chosen conic subdivision, which can be set to F = K S .Note that the union of all cells, marked in blue in Fig. 4(a), will precisely form the "aliasing-free" cone [1].Points outside of the cone will cause aliasing, inducing frequencies in the hologram surpassing the Shannon bound.We will now generalize the calculations to 3D space.Because the Fresnel PSF formulation is separable, every point now has to satisfy four conditions, one for each of the four edges of the hologram.For every hologram side, we obtain a wedge-shaped region whose edge is coincident with the hologram edge, and whose opening angle and orientation is given by the position and width of the active sub-block's frequency band.The intersection of these four wedges results in an anisotropically distorted octahedron, as shown in Fig. 5.
Unfortunately, we cannot seamlessly tile 3D space with these shapes, because it is mathematically impossible to seamlessly tile space with octahedrons.We therefore will have some redundancy if we want to fully cover 3D space, i.e. some lozenge cells will inevitably overlap.
A first solution is to use the Cartesian product of the 2D lozenge cell tiling for x and y; namely, identify and combine the cell indices for the xz-plane and yz-plane independently.However, this is redundant because many intersections will be empty: most lozenge cell are bounded in z, given by the layer values in (11); if the z-ranges of the two 2D lozenge cells in both planes do not overlap, then no points can simultaneously lie in both cells at once.
We eliminate the empty cells by taking into account the observation that every 2D lozenge cell in the xz-plane extruded along the y-axis will only intersect 2D cells in up to 3 rows in the yz-plane, provided the same frequency band partitioning is used for both dimensions.This is illustrated by the green lozenge cells in Fig. 4(b).To partition the point cloud, we need an efficient way to determine in what lozenge cell any given point (δ, , z) lies.This can be computed as follows, using the floor operator • We can linearize every valid tuple (i x , i y , j x , j y ) uniquely into a single number : = r x • (r 2 x + r x + 1) + 3i x • (r x + 1) + 1 2 r y • (r y + 1) + i y ; r x = i x + j x ; r y = i y + j y ; (13) simplifying the point storage structure into a linear array of point lists.

Experiments
The CGH can be viewed in an optical setup by encoding the computed pattern on a Spatial Light Modulator (SLM) or printed on e.g. a glass plate [28].This is done by quantizing the amplitude and/or phase of the hologram depending on the modulation capabilities of the display device; most used SLMs in electro-holography are "phase-only" or "kinoform".The object can be viewed by directly illuminating the display with a coherent light source, typically using a planar wave for the reference beam.More info on point-based CGH display setups can be found in [26,29].We investigate two aspects of the proposed algorithm in this section: (1) the impact of the parameter selection on quality and speed, and (2) measurements of the gains in calculation speed on a detailed CGH example, including numerical reconstructions from several viewpoints.

Algorithm parameter configuration
The main algorithm parameters to be chosen are the hologram subdivision block size B, the coefficient block size S and the sub-block size K for the sub-stereogram computation.
The values of B and S are inherited from the accurate PAS method [25], and will affect the approximation accuracy.We measure the approximation error using the the normalized mean-square error (NMSE): NMSE(P, P) := P − P 2 P 2 (14) where P is the reference PSF, P is the approximated PSF using the proposed PAS method and • 2 is the Euclidean norm.We report the error in Table 1 for different values of B and S, averaged over 100 random point instances.The accuracy of the PAS approximation will always improve when B decreases or S increases (when keeping the other parameter constant, respectively), because they are a superset of the possible PAS with higher B or lower S.The best choice of B and S depend on the use case, i.e. what the quality requirements are for a given application.This principle is studied as well in [25,26].The memory requirements are related by the total number of coefficients S 2 B 2 N 2 , and the number of calculations for coefficient updates is proportional to N 2 B 2 .The optimal sub-block size K will be device dependent.Larger K means fewer and bigger lozenge cells, but requires more memory.Generally, the goal is to have the largest K possible that still fits in cache.We chose K = 4 for our hardware configuration; more details on the CUDA implementation can be found in [27].

Measuring calculation speed
We calculated a hologram with a wavelength of λ = 633 nm and a pixel pitch of p = 2 µm.The resolution was 16384 × 16384 pixels, totalling to 2.56 • 10 8 pixels.The hologram was subdivided in blocks of size B = 32, with dictionary block sizes of S = 64.The sub-stereogram sub-block size was set to K = 4.
The scene was composed of the Bi-plane point cloud, consisting of 1 million points with associated intensities for the point color.The plane was centered laterally to match the hologram center, and displaced to be 20 cm from the hologram plane.The dimensions of the plane along the main axes are 33.8 × 13.2 × 23.5 mm.
The algorithm ran on a machine with an Intel Xeon E5-2687W v4 processor 3Ghz, 256 GB of RAM, a NVIDIA TITAN RTX GPU with a Windows 10 OS.The algorithm was implemented in C++17 with CUDA 10.1 for the calculations done on the GPU, enabling CUDA compute capability 7.5.
The conventional PAS implementation took 711.7 s, while the proposed solution took 20.6 s, resulting in a 34.5-fold speedup.This amounts to average speeds of 80.5 ms per 10 6 points per megapixel.The distribution of the points into the lozenge cells only took about 29 ms, so it has negligible overhead.The algorithms produce identical output (up to round-off errors due to the non-associativity of floating point addition) and thus have no quality difference.
If we take a typical high-end commercially available SLM with 4K resolution (3840×2160 pixels) and a point cloud of 5 • 10 4 points, we achieve a video rate of 30 frames per second.As a comparison, [29] reports 15 frames per second for a spatial light modulator of 1920×1080 pixels for an object represented by 6500 point cloud sources on a state-of-the-art FPGA hardware implementation of a CGH computer; this amounts to 4.95s per million points per megapixel.Our proposed system is thus 62 times faster; however, an important caveat is that both the hardware and the PSF approximation algorithms are different for this system.
Four different reconstructed views for the generated hologram are shown in Fig. 6, and compared with the brute-force ray-tracing implementation using (2) for the ground truth.

Conclusion
We propose a novel algorithm to significantly accelerate PAS calculations without quality reduction.The algorithm is compatible with most variants of PAS that use the FFT.The paper brings novel insight into how the partition of space maps to FFT coefficients, which may lead to faster CGH algorithms beyond PAS.The algorithm is especially suited for customized hardware solutions such as FPGA because of its much lower memory requirements and dependencies.We sped up PAS calculations with a factor of over 30.In doing so, we hope to bring high resolution real time video holography a step closer to realization.

Fig. 1 .
Fig. 1.Diagram of accurate phase-added stereogram blocks.On the left, the S × S block of coefficients modelling plane wave pieces, one coefficient is updated per 3D point.Two example coefficients are lit up, showing the corresponding dictionary elements.After accumulating all the coefficients, the block is transformed with an IFFT and the top-left B × B corner is cropped and added to the final hologram H.

Fig. 2 .
Fig. 2.Diagram of the memory access model, per thread.The arrow thickness indicates the amount of needed memory accesses, the red blocks indicate the last accessed coefficients as an example.In (a), the cache cannot be used for storing coefficients effectively because of space limitations, so every coefficient will have to be written to the slower global memory directly.In (b), the sub-block can be kept in cache, allowing for many fast coefficient updates.Only when a cell has been fully processed, the global memory needs to be accessed for updating the sub-stereogram.

Fig. 3 .
Fig. 3. Example of a parallelogram-shaped region in the time-frequency domain (a), and how it maps to scene space (b).Note how points in one space map to lines in the other space and vice versa.The red bandwidth intervals in (a) map to cones in (b), whose intersection forms a lozenge cell.The union of all points in the lozenge cell map to the bundle of all possible lines lying fully in the parallelogram region in (b).Three example points are shown: the dark blue and green points are in the lozenge cell, while the orange point is outside, causing its corresponding time-frequency line to (partially) lie outside of the parallelogram.

Fig. 4 .
Fig. 4. Diagram of the 3D point cloud subdivision by combining two 2D lozenge cell partitionings.Every cell in the x-z plane is characterized by its coordinates i x and j x (a).The row index is given by the sum of the coordinates, as shown for the orange cells for the example r x = 2. Every lozenge cell in (a) can have an intersection with a cell in (b), so long as they have an overlap along z, as shown by the green cells.

Fig. 5 .
Fig. 5. Diagram of 3D lozenge cell shapes.(a) The cone of allowed incidence angles (red) at any given hologram (gray) edge will form a wedge shape.(b) The intersection of all four wedge regions will form a lozenge cell (purple).(c) It's shaped as a distorted octahedron, which cannot be used to tile 3D space without overlaps.

Fig. 6 .
Fig. 6.Multiple numerical reconstructions of the hologram showing all combinations of left/right views, refocusing at the front/back of the plane, computed with ray-tracing or with PAS.Note that the proposed algorithm will produce exactly the same result as standard PAS.

Table 1 . NMSE values for generated PAS of a PSF for different values of B and S. The values were calculated for PSFs with 1024
× 1024 pixels, p = 4 µm, and λ = 633