Modern hardware accelerated point based holography

: This paper presents a novel method for accelerating the computationally intensive process of point-based holography using consumer grade hardware. By leveraging the parallel processing capabilities of graphics processing units (GPUs) and implementing optimization techniques, the proposed method significantly reduces the time required to generate complex holograms. A comprehensive analysis – including benchmarks and comparative studies – demonstrates the efficiency and effectiveness of this approach. Our findings offer promising implications for real-time applications in virtual reality, and other fields that require rapid and accurate holographic rendering.


Introduction
Holography, as introduced by Dennis Gabor almost a century ago, describes the process of recording the superimposed angular reflectance or transmission distributions caused by local diffraction of an incident wave at the surface of an object [1].Traditional analog holography has its digital counterpart in a class of algorithms that try to create phase and/or amplitude distributions of electromagnetic waves emitted from objects and captured in a plane.This computational process of generating hologram information from a given scene is referred as Computer Generated Holography (CGH).Since its first conceptualization in 1966 done by Brown and Lohmann [2], and its analytical formulation the year after by Lohmann and Paris [3], many improvements have been made in the recording and reconstruction of these digital holograms, including efficient ways to simulate the recording process on computers.This advent of CGH has been further motivated by its numerous applications in fields such as medical imaging, telecommunications, and entertainment [4][5][6][7].
One of the many applications are near-eye displays for virtual and augmented reality.These can make use of holographic effects by reducing size and weight of these displays and offer capabilities like full light-field generation that were not possible before.Further important fields for CGH are head-up displays and holographic microscopy among others.
In order to be able to present digital content as holograms, it needs to be converted into such a representation by essentially simulating the process described by Gabor [1].This process is very time consuming even though many acceleration methods have been presented to help improve its performance.Most of these acceleration techniques are based on the assumption that the view angle for the displays is limited not only due to their pixel pitch but also due to the proximity to the viewer.In Computer Graphics (specifically, in rendering), spatially varying reflectance models are often described by bidirectional reflectance distribution functions, which describe the angle-dependant ratio of incoming light to reflected light.Even though these functions are usually modelled analytically, lately, the research community has worked towards deriving them directly from the microscopic structure of surfaces, described as height fields or surface-normal distributions [8].If these structures become small enough, diffraction effects need to be taken into account as well [9].These methods open up the possibility to model large, visually coherent holographic effects due to surface structures in photo-realistic rendering.
While its advantages and potential are clear, one of the major drawbacks of CGH is its expensive computation both in time and memory resources.Researchers have already proposed simple techniques, well-know in the Computer Science field, to palliate these disadvantages and optimize the generation.For example, the work of Gupta et al. demonstrated that even quite basic techniques did have a significant impact in terms of computational efficiency [10].More sophisticated optimizations have been proposed [11,12] but significantly increase the implementation effort.With the goal of addressing these problems in mind, in this paper we exploit the capabilities of latest generation of hardware and software when applied to one of the most promising and flexible techniques for CGH: point-based methods.Sahin and colleagues recently published an overview on the different CGH techniques, along with their major advantages and pitfalls regarding quality and computational efficiency [13].The authors divide CGH methods into two broad categories depending on if the actual 3D positional information of the scene is available (i.e., wavefront-based) or if, instead, they use incoherently captured 2D images of the scene (i.e., ray-based).The former is characterized by simulating the diffraction process in order to calculate the 2D distribution in the hologram plane of the wavefront resulting from a 3D scene under coherent illumination.Among this first category, there are several subgroups which differ on how the scene's 3D information is represented, namely: point-cloud, polygon, and layered based [13].The method presented in this paper belongs to the first subgroup of the wavefront-based methods.These point-based methods were first introduced by James P. Waters [14], and their core idea is to model a 3D object as an agglomeration of self-emitting point light sources.The wave emitted by the object can then be calculated by superimposing the independent spherical waves emitted by each of these points.
In contrast to other simulation methods, the sparsity of the information in point-based methods can be exploited to increase performance.However, computing correct occlusion remains a challenge [15,16] because each source point creates a point spread function that can potentially reach every single point on the hologram plane.While there are methods capable of achieving high quality results (via the use of approximations, raycasting, or a combination of the latter with layer-based methods [15][16][17]), they all introduce a significant overhead.
More recently, there have been a couple of approaches that try to maximize quality either through expensive optimization techniques [18,19], or sophisticated lighting algorithms [17].A very promising method by Shi et al. [20] utilizes machine learning to directly calculate the hologram, which enables high quality results in real-time.However, they rely on an expensive point-based holography approach for the calculation of their training data.Those algorithms are very promising but not suitable for real time.To develop a performant algorithm, GPU acceleration is crucial.However, this incurs further complications when it comes to efficient thread scheduling.
One of the advantages of point-based holography is the conceptual simplicity of calculations compared to other holographic methods.The details of the hologram depend on the number of source points used and the precision of the calculations, i.e., higher numbers of points can lead to more detailed holograms.One major advantage is that these points can be distributed non-uniformly, allowing for fine grained control over the amount or lack of detail in certain parts of the reconstructed hologram.While simpler than capturing a full wavefront, point-based holography can still be computationally intensive, especially when the number of source points increases due to its worse computational scaling compared to layer-based methods.
The reason for this need to increase the allocated computational resources is that the contribution from each source point has to be calculated separately for every point in the hologram, which can involve a significant amount of computation, especially for complex or large objects.However, this approach is also a highly parallelizable process, meaning that it can be efficiently implemented on parallel computing hardware like Graphics Processing Units (GPUs).
Our approach involves scattering source points non-uniformly on the surface of objects using a multi-viewport grid inspired by deferred rendering [21].We evaluate different point scattering methods to test the impact they have on the overall hologram quality, and choose point density according to the amount of light reaching individual parts of the object based on their local density.Additionally, we approximate the point spread function of individual points using triangle fans that are created procedurally on the GPU using the new mesh shading pipeline introduced by Nvidia.Finally, we use hardware-accelerated ray tracing to occlude individual segments of the triangle fans.
One of the key contributions of our method is that it is easy to implement, as the complexity of efficiently scheduling the workload on the GPU is handled by the rasterization and blending hardware.Furthermore, the occlusion computation has the potential to improve performance, as it reduces the workload of rasterization and blending while simultaneously improving the quality of the generated hologram.In this paper, we present a comprehensive evaluation of our method, comparing it to other state-of-the-art techniques.The results demonstrate that our point-based holography method produces high-quality holographic images with reduced computational complexity, making it a promising candidate for practical applications.

Method
Our CGH method is a point based method which is based on scalar diffraction theory [22].As explained in detail in the appendix of [13], we use the first Rayleigh-Sommerfeld (RS) solution for the Huygens-Fresnel principle due to its simplicity.Its validity is assured by the fact that size and distance of our object scene are both much larger than the wavelength.We closely follow the approach of other, similar methods (e.g., [12]).
The scene objects are located around the origin of a cartesian coordinate system.We calculate the field U z (x, y) in the Wavefront Recording Plane (WRP) at a distance z along the optical axis as superposition of spherical waves originating from N point sources on the surfaces of the 3D scene objects: with the distance r n between the point n and the respective coordinates (x, y, z) on the WRP and the complex field U n of the source point.The superposition of the individual wavefronts creates an interference pattern that is then recorded in the WRP.The objects are assumed to be perfectly diffuse reflectors.This means that the amplitude of the object points follows the Lambert Law and a random value is chosen as initial phase of each individual source point.The RS solution was developed for a single plane object field perpendicular to the optical axis [22].This solution is generalized by the point based method to a scene consisting of curved 3D surfaces.The method thus does not take diffraction effects into account which are resulting from occlusion.We handle occlusion by ray casting based on geometrical optics here.
Equation ( 1) is evaluated for every point on the WRP and the Angular Spectrum Method (ASM) is used to propagate this wavefront to the actual hologram plane afterwards.This two-step process is known to be computationally efficient [13].Our processing steps are explained in the following subsections.

Point distributions and scattering
Using the Point Based Method has the advantage that the computational effort can be focused on the areas where it is needed the most, since points can be arbitrarily distributed.However, triangle based geometry is the most common geometric representation in computer graphics, as it is very flexible, very fast and easy to handle on modern graphics hardware.Furthermore it is easy to represent solid geometry with triangles and makes it easier to compute occlusion of different geometric objects.Using this common geometric representation has the additional benefit that we can build upon a vast amount of image synthesis research.For these reasons, we choose triangle based geometry as our input of choice.We then scatter points on this geometry in a way that offers the best trade-off between both representations.

Scattering
We use two fundamental approaches to scattering points on the surface of objects.Points are either scattered according to a 2D distribution that is derived from the perspective of the target plane or, alternatively, on the surface of the triangle mesh itself.Both approaches have their advantages and disadvantages.While scattering points directly on triangle meshes results in a static point cloud that can be reused indefinitely as long as the source geometry doesn't change, it is also a wasteful approach since a lot of points are placed in areas that end up not being visible.Scattering points on the surface based on a 2D to 3D projection step makes it sensitive to the resolution of the 2D scene representation, but it also means that the point cloud needs to be recomputed every time either the scene geometry or the target plane move.Thus, we discuss and explain both approaches in detail in the following paragraphs.
Triangle Mesh (3D) Scattering: In our first method, we scatter source points uniformly and randomly on the triangular mesh.The area A i of triangle i with the corner points ⃗ a i , ⃗ b i , and ⃗ c i is given by and the total area A of all M triangles by Using these quantities, we define M parameters based on the relative cumulative sum of the triangle areas: In order to scatter all N source points uniformly on the triangle mesh, we generate a random tuple ξ n , χ n , η n for each point n.All tuple elements are randomly and uniformly distributed in the range [0, 1).The element ξ n is used to select a certain triangle i by the mapping which makes sure that the number of points on each of the M triangles is proportional to its surface area.The elements χ n , η n are used to define a point ⃗ p n in the respective triangle: with the relative coordinates 2D Scattering: This point scattering algorithm works in three steps: First, we rasterize the target scene from the point of view of the target wavefront plane.This can be the destination hologram plane or a suitable WRP [23].To account for parallax effects, we rasterize a grid of viewpoints.From these viewpoints, we record all necessary geometric information, as well as diffuse shading information as shown in Fig. 1.This information defines a mapping from 2D coordinates to 3D coordinates.We can thus reduce our sampling procedure to 2D.Accordingly, we only need 2 random variables χ, η, instead of 3. Mapping these 2D coordinates to 3D is done by simply looking up the 3D positional coordinate that has been previously recorded in these images.We can then either uniformly and randomly distribute the points over these 2D images or, alternatively, place points proportional to the brightness values from the recorded shading information.We refer to this approach as the Importance Sampling (IS) variant.Importance sampling is a well known method in the field of Monte Carlo integration [24].It is usually used to distribute samples across an integrand proportionally to its value to reduce variance in the estimates.By distributing points according to shading brightness (which results in more points in brighter areas and fewer points in darker areas), we make use of the human visual system as it relates to contrast sensitivity in relationship to luminance.This is a well known effect that is utilized in applications such as video compression algorithms (e.g., [25]), and Monte Carlo image synthesis systems [24].
Both of the aforementioned approaches have the advantage that points are only placed on the parts of the surface that are actually visible from the target plane.While in the uniform scattering approach, sampling is quick and efficient, points might not necessarily be placed on geometry, which reduces efficiency.In contrast, the IS variant makes sure that every point falls on valid geometric information, and the point density is vastly improved.
To place the points with IS, we use a very similar approach to how we place them on the triangle mesh but instead of computing the cumulative sum from the triangle area, we compute it from the shading information.Here, we split the computation into two steps.First, the cumulative sum for one row of the shading buffer (Fig. 1(c)) is computed.Analog to Eq. ( 4) we end up with a unique mapping for each row.After this step, the last column of each row contains the sum of shading values for each row.Second, we compute the cumulative sum over the last column of each row and derive a mapping for the row selection from this.
The mapping of random values to 2D image values now consists of two steps.With the two random variables χ, η, we first select a row mapping with χ and then a column within that row with η.
Both of these 2D scattering approaches with and without IS make any additional occlusion computation almost unnecessary.In Section 4 we show that our particular occlusion computation, even in this case, can still be beneficial.

Point distributions
As explained above, the source points are scattered on the surface of the scene objects using either the three uniform random variables ξ, χ, and η (3D mesh scattering) or the two variables χ and η (2D scattering).The points are used to approximate the integral of all emitted light from the surface of the scene objects.Thus the random scattering of source points can be interpreted as a Monte Carlo approximation of the continuous light field resulting from the diffusely reflected light of the scene objects.
In Computer Graphics research, and Image Synthesis in particular, it is well known that pseudo-random numbers are not best suited to Monte Carlo integration due to their unbounded frequencies [26,27], which means that points can end up both very close together as well as very far apart.A very high local point density in this case can also lead to stronger speckle noise.We propose four different strategies for distributing points in 2D across the recorded information from Fig. 1(c) and Fig. 2 shows examples for 100 points each.Note that we test these distributions only for the 2D scattering case because of the reduced dimensionality.Halton sampling is known to perform poorly in higher dimensions.In Section 4 we show that grid sampling offers good quality but is prone to artefacts in some areas.Thus we limit the 3D scattering to pseudo-random scattering as this is the most flexible sampling method.
Regular Grid Distribution (Fig. 2(a)): Distributing the points on a regular grid guarantees an even distribution of points and avoids undersampling and oversampling of individual parts of the object.The regular grid structure however becomes very apparent in reconstructed holograms, which is why we discard this method.
Hexagonal Grid Distribution (Fig. 2(b)): We hypothesize that we can minimize speckle noise by placing points on a hexagonal grid , which can by trivially derived from a regular grid by defining a skewed coordinate system that rotates one axis to the diagonal.
Pseudo-Random Distribution (Fig. 2(c)): Pseudo-random distribution based on a hash function avoids any grid like artifacts and is fast and easy to compute on the GPU, but the geometry is unevenly sampled.
Halton Distribution (Fig. 2(d)): The quasi-random distribution based on the 2,3-Halton sequence, as proposed by [28], is an example of a low discrepancy sequence which tries to maximize the minimum distance between individual points, thus sampling the surface more evenly while still avoiding any perceivable artifacts.

Hardware rasterization
The fundamental idea of Point Based Holography is to compute the superposition of the projected phase and amplitude distribution from many point sources.
We limit these distributions per point to a maximum angle θ max , which is given by the grating equation At a given distance z of a point to the WRP this leads to a projected disk with radius Here we make use of the graphics hardware in an unusual way.GPUs are traditionally specialized triangle rasterization and blending systems.We believe that this is a good fit for hologram computation, since for each source point we need to rasterize the circular cone base on the WRP and blend (accumulate) all cone bases together.This avoids complex thread scheduling and synchronization implementations that would otherwise be necessary.We therefore represent the wavefront emitted from a single point as a triangular fan disk on the WRP.This is useful because triangles are handled very efficiently by the GPU.See Section 3.2.1 for more details and a full motivation for this choice.

Occlusion
Since the object surface mesh consists of triangular geometry, we use ray casting for occlusion computation.Ray casting algorithms are highly optimized for detecting intersections with triangles, making them efficient for this type of geometry and scale well with the complexity of the scene.Modern GPUs are highly efficient at parallel processing, making them ideal for the computationally intensive task of ray casting.Some of the latest GPUs with dedicated ray tracing cores are designed specifically to handle ray-based rendering techniques, including ray casting.These cores are optimized for tasks like calculating intersections between rays and triangles, which is a fundamental part of occlusion computation in ray casting.With the quadratic complexity in terms of wavefront propagation, this is still not feasible for fast hologram synthesis.Here, we make use of our triangle based representation of the circular cone bases.We emit one occlusion ray per disk segment and discard it if it is occluded.The ray starts at the center of the respective disk segment and heads towards the source point on the scene object.In this way, the computation is actually very similar to common shadow ray computations in Computer Graphics.If a ray intersects any geometry between the disk and the source point, the respective disk segment is discarded.This approach has several advantages over previous occlusion computations.Firstly, the complexity of occlusion computation is decoupled from the target hologram resolution, and only relies on the number of source points and the disk subdivision.Secondly, discarding disk segments can potentially even lead to increased performance on top of also improving the hologram quality since it reduces the workload on the rasterizer.And, thirdly, due to aforementioned improvements in modern hardware architecture, this approach is both straight forward to implement and surprisingly fast in practice.

Implementation
Here we discuss implementation details of our system, the overview of which is given in Fig. 3.We build upon the work of Fricke et al. [29] in that we describe the wavefront as a triangle fan and occlude individual triangles on demand.The following subsections will go into detail on how this is utilized.

Field computation
Our goal is to determine the field generated in the WRP by the collection of source points.We need to determine the contribution of each source point to every pixel in the WRP.An object  3. Overview of the whole system.Please notice that the steps highlighted in green, i.e., point density calculation and scattering, can also be done as pre-processing, thus having a constant impact on the efficiency of the system that can be ignored in running time.
point approach computes the result for all source points in parallel.However, scheduling the GPU in this manner results in race conditions.Multiple threads would write to the same WRP pixel in parallel, leading to wrong results.The race conditions can be avoided by the hologram point approach, which means to compute the result for all pixels on the WRP in parallel.Our approach is to choose a hybrid between object point and hologram point parallelization.
As explained above and shown in Fig. 4, each source point generates a circular field distribution on the WRP, which we treat as a circular fan of triangle patches.In order to schedule these overlapping patches efficiently on a modern GPU, we need to massively parallelize the accumulation of each of the field patches.Proper thread synchronization using specialized atomic operations is necessary which in turn has an impact on performance and implementation complexity.The determination of all hologram points to which a given source point actually contributes, is identical to the rasterization of the spherical wavefront in the WRP.Therefore, we formulate our wavefront accumulation problem as a real time rasterization problem.This is exactly what GPUs are designed for.We approximate each object point wave as a triangular circular fan as shown in Fig. 4 and feed this mesh to the hardware rasterizer.
The hardware fragment program, that is executed by the GPU for each rasterized point of each individual source point computes the values of phase and amplitude as a complex number.We configure the blending hardware to use the function "add", which results in the correct summation of all source point contributions without the need for any manual thread scheduling or synchronization.
The computation of phase and amplitude utilizes the rotationally symmetric structure of the field contribution of each source point.Furthermore, this field structure is highly redundant and identical for all source points with the same distance to the WRP which suggests the use of Look Up Tables (LUTs).In our approach, we build upon the radially symmetric LUT described in the work of Fricke et al. [30].

Occlusion computation
By leveraging hardware-accelerated ray casting found on modern GPUs, we can efficiently compute occlusions.We trace occlusion rays from the center of the triangles shown in Fig. 4 to the object point on the surface of the object.This is performed in a separate compute step after the point scattering has finished.This step then fills a mask buffer that contains the occlusion information as a binary value.In the Task Shader step we read this information to decide on whether to generate triangles or not.
It is to be expected that this binary decision of discarding whole triangles introduces unintended artifacts due to the high frequencies created.We found in our tests that if the point density is high enough, this binary occlusion test actually converges to a reasonably smooth transition.To minimize any additional artifacts, we opt to apply a slight low pass filter as a post processing effect after point accumulation has finished.This can be efficiently integrated as a multiplication into the field calculation step at no significant additional cost.

Task and mesh shader
NVidia introduced the new Mesh Shading pipeline with their Turing hardware generation.This pipeline introduces two new Shader Stages which execute specialized user-defined programs on the hardware.These are executed in the context of the rasterization pipeline and are meant to replace the geometry processing stages that were previously available like Vertex, Geometry and Tesselation Shaders.In contrast to general purpose compute kernels where everything including the input, output and number of executions can be freely chosen, specific outputs are pre-defined.For Task Shaders these are the number of Mesh Shaders to spawn and any information that needs to be passed to the Mesh Shader.A Mesh Shader by default outputs a point buffer containing the triangle vertex positions, and an index buffer describing how to assemble vertex points into triangles, of user defined size.This representation is referred to as a Meshlet.In our case, we define the part of our disks belonging to a particular occlusion ray as a Meshlet.The vertex position and index list, that defines the triangles are then directly forwarded to the rasterizer.
This system allows a form of dynamic thread scheduling on the GPU resulting in a fully GPU driven pipeline including the decision on the number of generated triangles based on the number of points minus discarded triangles due to occlusions.
Similar to compute kernels, the Task Shader invocations are subdivided into work groups.We spawn n work groups with 16 threads per work group for n object points, resulting in a total of at most 16n task shader invocations.Each Task Shader execution reads the occlusion value from the mask buffer that has been computed in the previous step using Ray-Casting.Using Warp Intrinsics we can efficiently compute the total number of Mesh Shaders to spawn and the offset information necessary to generate and orient the remaining triangles to the correct part of the disk.
The triangle rasterization hardware found on the GPU is used to compute the coverage for each disk patch that is projected onto the target plane.This way, no manual thread counting and scheduling is necessary and efficient hardware utilization is trivial.
Blending of the individual point contributions is done by the hardware as well.For this, a configuration flag is specified that tells the hardware to perform basic accumulation of individual fragment shader invocations.This makes any manual synchronization obsolete which would be necessary otherwise.Yanagihara et al. [12] reported that it is possible to omit this synchronization step when using a compute based implementation with the trade-off of errors in the phase distribution from race conditions resulting from multiple contributions being written to the same target memory area at the same time.This impacts the quality of the created hologram negatively and thus we avoid this case in all our discussed approaches.
All that is left to do for the Fragment Shader is to compute the complex phase and amplitude value for the given part of the disk of an object point, which can be either computed on demand or read from a LUT.
In preliminary tests, computing sin and cos functions inside the fragment shader was always slightly slower than using the LUT.Our LUT is small enough so that memory fetches are negligible, since the whole LUT per point fits in one cache line.This is another advantage of our triangle based approach since all fragments belonging to a point are likely scheduled together by the rasterizer.

Baseline
For comparison, we also implement a compute based method which more closely follows the traditional approaches discussed in, e.g., the work of Yanagihara et al. [12].We refer to this approach from here on as the baseline method.In this approach, we first determine the number of compute threads necessary so that every thread writes the wavefront contribution to one destination point from one source point.To do this, we compute the minimum quadratic pixel area for each source point which contains the projected field disk of this point on the WRP.We then compute the cumulative sum of these areas to get not only the overall sum of all computations necessary, but also the offset at which the compute threads for a particular object point start.As a third step, we spawn the appropriate number of compute threads for wavefront computation according to the last value in our cumulative sum.Similar to the IS approach discussed in Section 2, we can then retrieve the object point based on a particular thread index by doing a binary search on the cumulative sum buffer in this compute kernel.Finally, we use atomic operations to safely accumulate and write the results to the wavefront buffer.Optionally, we perform ray occlusion queries per thread to check if the object point is visible from the particular destination point similar to the method of Shi et al. [20], where they use this approach for the computation of their training data.

Results and discussion
All tests are done on a NVidia GeForce 3090Ti.The algorithm is implemented in Vulkan [31] using a custom framework written in C++.All experiments were done on a resolution of 4096 × 4096.The WRP was set to be between 0.003 m and 0.03 m away.The hologram plane is an additional 0.1 m away resulting in a total scene geometry distance of between 0.103 m and 0.13 m.
The pixel pitch is chosen to be 7.5 µm in all tests except one, where the pitch is explicitly mentioned, which puts it in the range of commonly available Spatial Light Modulator (SLM) devices.The wavelength λ was set to 532 nm.If not stated otherwise, all results are computed with Halton based 2D point sampling with IS enabled as this combination gave the best compromise between numeric Peak Signal to Noise Ratio (PSNR) based quality and perceived quality in our experiments.
We evaluate everything with respect to quality and quantity.For this, we perform tests with and without our occlusion algorithm enabled, with different number of points, and different point distribution methods.
All reconstructions are simulated using a simple propagation model based on [32].The hologram is propagated to a lens plane using the angular spectrum method, where it is multiplied by a lens and aperture function to extract the view point.Afterwards it is converted to the angular spectrum to extract the view direction.The focal plane is chosen by varying the lens function.The reconstruction depth is set to 0.1 m away from the hologram plane.
All qualitative comparisons are made against an image rendered with a simple Ray-Tracing renderer that is limited to direct lighting and diffuse shading as seen in Fig. 5(a) from the same viewpoint under the same lighting conditions.The lighting and reflectance parameters are chosen to best match our point based approach.To illustrate the results of our method, Fig. 5(b) shows a time multiplexed reconstruction of the scene from 100 holograms with our occlusion method enabled.Even though this comparison is not a perfect match to our reconstructed holograms in terms of shading and depth of field, all compared methods share the same dissimilarities and the overall relative quality differences are what we are interested in.
We additionally compare against a compute based implementation (i.e., the baseline method described in Section 3.3) with either pixel granular Ray-Casting based occlusion queries or no occlusion computation.This method does not use a LUT and uses the same point scattering method as our main method.We do this to isolate the impact that our approximations have on the overall quality without regarding the impact of our chosen numerical reconstruction method on the overall quality of the hologram.Pixel granular occlusion queries make the comparison method obviously more expensive but give us an estimate over the quality impact our approximate triangle based occlusion has.For a fair comparison in terms of speed, we disable the occlusion computation for the comparison method to better observe the impact of manual thread scheduling and synchronization on the overall computational cost.
In the following, we discuss the influence of both the number of points selected to generate the hologram and their distribution, as well as the impact of the occlusion computation on the generated results.

Impact of number of points
Here we discuss the quality and performance impact of different numbers of points.All timings contain all steps from start to finish, including point scattering and WRP propagation.We discuss the relative impact of these individual parts later.Figure 6 shows the influence of the number of points used when generating holograms with our method computed with occlusion and using the quasi random Halton distribution for the point distribution.As expected, increasing the number of points also increases the overall quality.What is interesting though is the fact that there are diminishing returns already after a comparatively low number of points.This tells us that we get a reasonable scene representation especially for sparse scenes like our test scene, which, given the timings reported in Table 1, makes this approach well suited for real-time applications with these particular scene constraints.We did not see a significant quality difference in our tests to the much more expensive compute based method.It makes sense that these two approaches should perform similar in terms of quality since they mainly differ in how threads are scheduled and the superpositions are computed.However, what is surprising, is, that neither our Look Up Table, nor the occlusion approach seems to have a significant impact.The latter might be explained due to the point scattering approach.Most of the occlusions are already encoded here, so culling is barely necessary, which also might explain the slight overhead in the occlusion computation for 10 6 points.This can also be seen in Fig. 7 as the relative quality with and without occlusions is very similar.Furthermore, in our tests we found that when using the more naive triangle based point scattering approach, occlusion computation time is much more compensated for by the reduced rasterization cost as the rasterization speed alone is increased by 20% for 10 6 points in that case.It should also be noted that there is an apparent sweet spot for the number of points between 10 5 and 10 6 points, from which on the time seems to increase more linearly with the number of points.This might be the point at which the GPU compute and cache utilization is saturated.

Impact of point distribution
We compare a total of seven different point scattering methods: The three 2D scattering methods with projection described in Section 2 with and without IS, i.e., distributing the points proportionally to a given density function and the scattering directly on the triangles of the object.Figure 8 shows the relative impact on PSNR values these different methods have.Grid based scattering achieved the highest quality in our tests, though, as discussed in Section 2, this comes at the cost of visual quality.Halton points had overall the most visually pleasing results and is thus a reasonable compromise in our opinion.The triangle based scattering performed the worst, which is unsurprisingly due to the fact that approximately half of the points are distributed on the surfaces facing away from the target plane which leads to them being culled by the occlusion computation.Note that this also explains the dip in quality since holograms are actually generated from approximately half the amount of points.), the results show the higher impact of IS over the selected strategy for point distribution.All holograms were generated with 10 6 points.

Impact of occlusion
Table 1 shows, that depending on the scenario, our occlusion computation has either negligible or no negative performance impact on the overall computation time.This stems from the fact that, in our experiments, the ratio of occlusion rays cast to the resulting culled disk triangles is such that the reduced rasterization workload outweighs the additional Ray-Casting cost most of the time.Even for our 2D scattering approach the overall cost reduction can be significant even though the quality is barely impacted (see Fig. 7).This might be due to points on parts of the surface that are tangential to the hologram plane being discarded by our occlusion computation.
A visual comparison between the hologram being reconstructed with, and without occlusion computation for the triangle based scattering method is given in Fig. 9.Here the visual impact on the occlusion computation is very apparent.As mentioned earlier, the performance impact is also most significant here as the computation time with occlusion enabled is 20% faster.In Fig. 10 we show further details of our occlusion computation.To better illustrate the impact our approximation has, we lowered the pixel pitch for this experiment to 3.8µm, which more closely matches what is available on the most recent commercially available devices.Our approximation correctly removes occluded areas in most places while failing to correctly occlude some areas due to strong parallel shifts of the viewpoint.Note that this limitation can be most likely alleviated by randomly distributing the occlusion rays within disk segments instead of sampling only the segment centers.

Further performance remarks
Here we discuss the relative performance impact of individual parts of the system illustrated in Fig. 3.
Out of the total computation time of 135.77 ms only ≈ 46% were spent on rasterization of the disks.A significant 26% was spent on the complete point scattering pipeline, including viewport grid generation, cumulative sum generation, point placement, and projection.Note that for the triangle mesh generation, this time can be excluded since in that case the point cloud generation can be considered a preprocessing step since it is viewport independent.The remaining time is spent on the Fast Fourier Transform based propagation.The green detail boxes show areas where our occlusion computation succeeds, while red areas show the limitations of our approximate occlusion.

Conclusion
We presented a computer generated hologram method that makes use of modern GPU features like Mesh Shading and Hardware Accelerated Ray Casting.It uses more traditional GPU features that have largely been overlooked by the CGH community.Additionally, we discuss different methods of deriving point distributions from geometry represented as triangular meshes, which is the most common geometric representation in computer graphics, bridging the gap between the flexibility of Point Based Holography methods and computer graphics research.The presented system is capable of producing high quality holograms while still being flexible enough to also be fast depending on the chosen parameters.Our method is faster than the more common and vastly more complex compute based approach of scheduling work on graphics hardware, which is usually done using CUDA, while retaining a similar quality.Furthermore, our novel approach for work scheduling using the rasterization hardware allows us to better schedule the right amount of work and even enables dynamic GPU driven scheduling, which allows for occlusion computation in a way that speeds up the computation even further while also improving the quality.

Funding. Deutsche Forschungsgemeinschaft (390833453).
Disclosures.The authors declare no conflicts of interest.

Fig. 1 .
Fig. 1.Information captured from standard rendering algorithm used for point definition.(a) contains the surface normal stored as the red, green and blue color components of the image.(b) contains the surface 3D position, and (c) contains the Lambertian diffuse lighting approximation computed from the dot product of the surface normal and the light direction.This information is used to distribute the points proportionally to their perceived prominence.
Fig.3.Overview of the whole system.Please notice that the steps highlighted in green, i.e., point density calculation and scattering, can also be done as pre-processing, thus having a constant impact on the efficiency of the system that can be ignored in running time.

Fig. 4 .
Fig. 4. Ray casting logic: Projected area of complex phase and amplitude from individual points approximated as triangle disks are procedurally generated on the GPU and centered at each point (highlighted in black in the figure).Triangles are then inspected for occlusions using hardware accelerated Ray Casting and removed if occluded (red areas in the figure).

Fig. 5 .
Fig. 5. Baseline scene for all our comparisons (a) side by side with a hologram reconstructed with our method (b).The shown result is the time multiplexed reconstruction from 100 holograms using triangle based point and computed with occlusions.

Fig. 6 .
Fig. 6.Hologram reconstructions from different number of points.All reconstructions are computed with occlusion and using Halton sequence for point distribution.

Fig. 7 .
Fig. 7. Quality impact of number of points.The graph portrays the relative PSNR values as compared to the overall best approach for different generation methods for increasing number of points.

Fig. 8 .
Fig.8.Relative PSNR values for different generation methods.As it can be seen, all of our results (first 6 bars) improve quality over simple triangle scattering.Additionally, when comparing different point distributions with (bars 1-3) or without IS (bars 4-6), the results show the higher impact of IS over the selected strategy for point distribution.All holograms were generated with 10 6 points.

Fig. 9 .
Fig. 9. Visual impact of occlusion computation.Both images depict time multiplexed reconstructions from 100 holograms from triangle based point distribution for illustration.Note that the computation with occlusion is significantly faster than without.

Fig. 10 .
Fig. 10.Details of occlusion.The impact our occlusion method has on the visual quality of the reconstructed hologram with a pixel pitch of 3.8µm.The top two rows (a) show reconstructions with occlusion and the bottom two rows (b) show reconstructions without occlusion.Perspectives in the reconstruction are shifted to the right (r) and to the left (l).The green detail boxes show areas where our occlusion computation succeeds, while red areas show the limitations of our approximate occlusion.

Table 1 . Computation time in ms for different number of points. We present the time to compute a hologram using the baseline algorithm without occlusions followed by our results of using our method with (left) and without (right) occlusion computation. Number of Points Baseline Method Our Method Without Occlusion Without Occlusion With Occlusion
24.95 ms 24.82 ms 25.67 ms 0.24 ms 16.93 ms 16.20 ms 20.08 ms 0.99 ms 15.14 ms 13.74 ms 21.58 ms 2.66 ms