Comprehensive Performance Analysis of Objective Quality Metrics for Digital Holography

Objective quality assessment of digital holograms has proven to be a challenging task. While prediction of perceptual quality of the recorded 3D content from the holographic wavefield is an open problem; perceptual quality assessment from content after rendering, requires a time-consuming rendering step and a multitude of possible viewports. In this research, we use 96 Fourier holograms of the recently released HoloDB database to evaluate the performance of well-known and state-of-the-art image quality metrics on digital holograms. We compare the reference holograms with their distorted versions: (i) before rendering on the real and imaginary parts of the quantized complex-wavefield, (ii) after converting Fourier to Fresnel holograms, (iii) after rendering, on the quantized amplitude of the reconstructed data, and (iv) after subsequently removing speckle noise using a Wiener filter. For every experimental track, the quality metric predictions are compared to the Mean Opinion Scores (MOS) gathered on a 2D screen, light field display and a holographic display. Additionally, a statistical analysis of the results and a discussion on the performance of the metrics are presented. The tests demonstrate that while for each test track a few quality metrics present a highly correlated performance compared to the multiple sets of available MOS, none of them demonstrates a consistently high-performance across all four test-tracks.


Introduction
Predicting perceived visual quality for 3D media in general is highly desired. In this regard, objective quality evaluation of 3D content has been particularly pursued based on the type of the utilized technology to capture the depth information plus visual parallax. For example in [1,2,3,4] methods are introduced to evaluate the stereoscopic scenes. Alternatively, for Depth-Image-Based-Rendering techniques, which are essentially based on providing a depth value per regular 2D image pixel, methods like [5,6,7,8,9] are proposed for evaluation. For light field (LF) imaging, some recent efforts have proved to be efficient in their perceptual analysis, among them [10,11,12,13,14].
One of the emerging plenoptic modalities, which essentially can provide the most complete set of visual depth cues compared to the other 3D data modalities [15], is digital holography . Holography can present faithful reconstruction of the real object, with continuous parallax (within the recorded Field of View (FoV) and the freedom to refocus on unlimited number of focal distances -just like watching the real 3D scene.
However, perceptual quality prediction of the holographic content requires taking into account extra layers of complexity, on top of the general difficulties associated to quality prediction on plenoptic content [15,16], such as stereoscopic or light field content. A major complexity added to the Visual Quality Assessment (VQA) of digital holograms, consists of predicting the visual quality of the captured 3D scene from the complex-valued holographic fringes, instead of analyzing several reconstructed view-ports of a scene involving an expensive rendering step. Even without considering the perceptual quality and visual appearance aspects, direct calculation of a mathematical error between a pair of complex-valued data is a challenge of its own (i.e. given the wrapping nature of phase and unboundedness of the magnitude, quantifying the maximum possible error or in general defining a bounded measure, which gives the relative error compared to the magnitude of the reference data, is not straightforward). Nonetheless, in [17,18], the authors have recently proposed a framework which can potentially be utilized with this extent. The next major challenge for objective holographic VQA, is introduced by the lack of comprehensive perceptuallyannotated holographic data sets. This comes from the fact that creating such data sets is on its own a challenging task. First of all, collecting appropriately diverse sets of holograms that take into account the different characteristics of the many types of holograms is critical. This diversity can be addressed via an increased complexity of the recorded scene, i.e. via the number of objects available in the scene, the objects' depth, the distance from the recording sensor, the presence of spatial occlusions, and the object materials and surface structures, e.g diffused, transparent, specular etc. Various types of hologram production are possible, e.g computer generated hologram(CGH) or optically recorded hologram (ORH). But each type is limited by (virtual) sensor resolution, pixel pitch, reference wavelength, numerical aperture, sources of noise etc. All of the mentioned factors contribute to the attributes of the produced hologram. Some humble efforts to create publicly available holographic data sets can be found in [19,20]. Apart from data selection, conducting subjective experiments requires inventive but reproducible methodologies and scoring protocols for such plenoptic content.
Yet another issue for holographic subjective experiments arises from the fact that holographic displays with acceptable visual attributes are still rare and mostly operate under laboratory conditions, which require advanced technical skills to be properly configured. This makes it difficult to directly evaluate the holograms. To address this issue, most of the related researches, have utilized other types of displays to conduct holographic subjective experiments. In an early effort, Lehtimaki et al. [21,22] utilized a stereoscopic display to investigate applicable visual depth cues and appearance of a handful of optically recorded holograms. In [23], the perceptual quality of a limited set of holograms was even studied after printing on glass plates. Later in [24], a large 2D screen with high resolution was utilized to conduct subjective experiment on the central view of CGHs. In [25,26], a light field display was proposed to render a set of CGHs. Nonetheless, the most comprehensive subjective experiment for holographic content to this date is provided in [27] where a subjective experiment was conducted on a holographic display [28], a light field display [29], and a regular 2D display [30] using the same data set. A diverse set of both CGH and ORH was generated from single and multi-objects scenes carefully picked to represent various depths, recording distances and surface materials. The 96 holograms of this data set, called HoloDB, and the Mean Opinion Scores (MOS) gathered from each display setup have been made publicly available [31].
An additional complexity is the presence of speckle noise which is due to coherent light being used for the recording and displaying of holograms [32]. The speckle stems from the interference of multiple wavefronts, each of which is generated from a different illuminated point on the scene surfaces and is super-positioned with the reference light beam. Although, those interfering wavefronts are generated from e.g the reflection of the highly coherent reference beam, the slightly different distances of the points to the recording sensor introduces spatial incoherence. Constructive, as well as destructive interference show-up in the reconstruction as very dark and bright spots, randomly positioned which alter the appearance of the main recorded data. The visibility of the speckle noise may vary significantly from object to object depending on the smoothness level of the object surface or, in case of an in-line recording setup, based on the thickness of the recorded sample. In [33], Bianco et al. provide a good review of the many proposed solutions for the suppression of speckle noise. More recently, in [34] and [35], state-ofthe-art denoising methods have been evaluated both objectively and subjectively, using different sets of digital holograms and various objective metrics.
To the best of our knowledge, a full-fledged holographic perceptual quality metric is yet to be introduced due to the obstacles we reviewed above. Nevertheless, optimized compression techniques for digital holograms [15,36,37,38,16] and sophisticated numerical methods for CGH generation [39,40,41,42,43,44] continue to advance. Consequently, lack of reliable perceptual quality predictors is being felt more than ever. As a first step towards designing such algorithms, we report in this paper a comprehensive analysis on the prediction performance of state-of-theart Image Quality Metrics(IQMs), based on the subjective scores provided by the HoloDB data set. We also explore their strengths and weaknesses with regard to the studied holographic data and summarize their behaviour when used to compare the holograms before and after rendering. Additionally, we study a conventional speckle denoising filter and compare the performance of the IQMs before and after denoising. Furthermore, to ensure that the results of our study is not biased to the characteristics of Fourier holograms, we study the performance of IQMs after a lossless, numerical conversion to the more conventional Fresnel hologram type.
The main objectives and novelties of this manuscript include: • analysis and comparison of the accuracy of the predicted visual quality by the IQMs operating on: the hologram plane; numerically reconstructed holograms for different viewing angles and focal depth planes; numerically reconstructed holograms after speckle denoising; • assessing the impact of the hologram type, Fresnel or Fourier, on the prediction performance of the IQMs; • definition of general guidelines for deployment of currently available IQMs on holographic content. In section 2, we discuss the characteristics of the tested data, we explain the experimental pipeline and the conversion process of Fourier holograms to Fresnel holograms. The tested IQMs as well as their input requirements are also discussed here, along with the utilized tools for the statistical analysis of the results. The test results along with statistical analysis and discussion about the performance of quality metrics are provided in section 3. Finally, Section 4 provides the concluding remarks.

Experimental pipeline and test methodology
In this section, we describe the technical details required for the test data and experimental setup, the IQMs under test and the deployed methods for statistical analysis.

Description of test data
The Fourier Holograms of the HoloDB database [27] are used for our experiments. The database consists of 8 reference holograms, 4 of which were captured from the macroscopic real objects and the other 4 were numerically generated from the point clouds. In the optical recording off-axis lensless Fourier holography was employed to obtain holograms of comparatively large objects with wide angular FoV while maximizing the used Space Bandwidth Product (SBP) [45,46]. Compared to regular Fresnel holography, this method circumvent the common limitations on the sensor pixel pitch and resolution [47,48]. From the ORHs, on-axis holograms were obtained by bandpass filtering of positive frequencies components. The computer generated holograms were generated immediately as matching on-axis Fourier holograms. Fig. 1 shows the center view of the rendered reference holograms in their frontal focal distances.

Compression specifications
Each reference hologram later was separated into its algebraic components and quantized to 8 bits per pixel (bpp). Those quantized real and imaginary parts were encoded separately using the JPEG 2000 [49,50], intra H.265/HEVC [51] and wave atom coding (WAC) [52]. All three encoders compressed the reference holograms at bitrates of 0.25 bpp, 0.5 bpp, 0.75 bpp, and 1.5 bpp (bpp=bits per complex-valued pixel) creating a data set of 96 compressed holograms. We remark, that upon compression with JPEG 2000 aliasing artifacts were introduced in the reconstructions at the lower two bitrates. In these cases, fine scales of the 4 level Mallat CDF 9/7 wavelet decomposition were suppressed, leading to a downsampling of the hologram by a factor of 2 per suppressed scale. This shrank the aliasing free cone [15] and caused the aliasing after reconstruction [53]. The test subjects were instructed to ignore these artifacts during the subjective experiments where the MOS values were produced.

Rating modalities
Along with each hologram a total of 12 MOS scores are provided in HoloDB, accounting for renders from 2 focal distances and 2 perspectives (center view and right-corner view) which was displayed in 3 display setups (Holographic display, Light Field (LF) Display and regular 2D Monitor (2D)). The exceptions are the CG-Chess which was reconstructed in 3 focal distances (because of its deep volume of 310mm compared to recording distance of 491mm which includes multiple chess-pieces positioned in 3 rows) and the OR-Mermaid which was rendered in a single focal distance (due to its narrow thickness of only 5mm compared to the recording distance of 450mm).

Quality assessment pipeline
The typical procedure of visual quality assessment for a set of compressed digital images consists of simply comparing their perceptual appearance right after and before the compression step. However, in digital holography the quality assessment can be performed in more different points through the processing pipeline due to additional steps being involved for visualizing the 3D content in each hologram. Below, the implemented operations in our assessment pipeline and the assessment points are sequentially explained: • Quality assessment in hologram domain: The real and imaginary parts of holograms are originally restored in floating point format. Though, the current implementation of the codecs like HEVC does not support such inputs and thus the algebraic components of holograms are quantized to 8bpp prior to the compression. Hereafter, the 8bpp hologram (8bpp for each of the real and imaginary channels) acts as the reference and undergoes the same operations as its compressed version. Choosing the 8bpp format is also related to the strict input requirements of some of the tested quality metrics as depicted in table 1.
The reference is being encoded-decoded. The first measurement point(QA 1) is here where the input of the encoder and the output of the decoder are given to the quality metrics to predict the visual quality of the holograms after reconstruction. These predictions are then compared w.r.t. the MOS.
To assess the impact of the hologram type on the performance of quality metrics, the compressed and reference holograms are converted to Fresnel type.The second measurement point(QA 2) is here where similar quality evaluations as the QA 1 are performed.
• Quality assessment after reconstructing the objects: Visualization of the holographic 3D contents, mainly requires two additional steps. First, the data values of the quantized hologram needs to be scaled back to the original data range of the source hologram. Second, each hologram should be back-propagated to reveal the captured 3D content.This process is repeated for both the reference and the compressed(decoded) holograms.
After the center and corner views are rendered from both the hologram pair, those views are quantized to 8bpp again (to meet the input requirements of IQMs). Before quantization, data range clipping is performed to ensure few extreme values resulted from the speckle noise does not create an unnecessary expansion of dynamic range. Then again they are given to the quality metrics to predict their visual quality which is compared to their corresponding MOS(QA 3).
Finally, to evaluate the impact of the speckle noise on the performance of the quality measures, an speckle denoising process is performed on the extracted views. Then in QA 4 track, the same quality predictions and performance evaluations as the QA 3 are conducted. Fig. 2 shows an abstract scheme of the experimental pipeline and the four steps for which we conducted the quality evaluations in this paper. For the sake of clarity, we have omitted separate processing lines for the real and imaginary parts in the represented hologram domain.

Conversion of Fourier to Fresnel holograms
Aside of the Fourier form of the considered holograms, which makes the most efficient use of the available spacebandwidth product, we also considered synthetically obtained Fresnel forms. Fourier holograms are marked by their wavefronts not being convergent and are thus under a planar reference wave reconstructed only in the conjugated plane of a lens. The wavefronts of Fresnel holograms converge under plane wave illumination without any additional lenses. Both forms are common and posses a substantially different space-frequency behaviour. Representing the same content as in a Fourier hologram H F our in a Fresnel hologram H F res without aliasing, requires upsampling U S(·) of H F our , such that the effective pixel pitch is halved (sampled bandwidth doubled), and demodulated with a parabolic wavefront K of mono-chromatic light of wavelength λ.
The distance z in used in the kernel K(z) is not the actual scene distance, as this would require upsampling by a large factor m. Instead, we compute z such that objects are brought into focus upon direct observation as close to the hologram plane as possible without causing aliasing after upsampling. We will provide a general formula, but consider only m = 2 as the smallest integer m. Only upsampling by an integer m preserves the original samples and ensures invertibility.
Given a Fourier hologram rectangular hologram recorded with a square pixel pitchp it is necessarily bandwidthlimited at ±(2p) −1 and it is sufficient to consider only the larger of the two hologram dimensions, i.e.N px. If the scene centre is in focus after a Fourier transform of the hologram, the main ridge of the phase space footprint will be aligned with the spatial axis, cf. Fig. 3a and no additional offset of z needs to be considered. Combining the ansatz, that after upsampling (p =p/m, N = mN ) and demodulation at the edges (of the major dimension) of the Fresnel hologram |ξ| = N p/2 the largest frequencies after (2p) −1 should be achieved; with the space-frequency law of Fresnel diffraction [54] it is straightforward to show that z is given as Equation (1) is reversible without any loss and thus testing the reconstruction only from either representation is sufficient. The conversion may be understood as obtaining a Fresnel hologram from its compact space-bandwidth representation [55]. Alternatively, the Fourier hologram can be interpreted as the interim hologram formed by a Fresnel hologram and the parabolic phase kernel K, which is used to facilitate numerical back-propagation of a hologram within the Fresnel approximation. The spacefrequency footprints of either form and the parabolic kernel are given in Fig. 3. As we will see they will significantly influence the performance of the quality metrics.

Quality metrics
For the current experiment, we use the mathematical measures which can directly evaluate the complex data, including Peak Signal-to-Noise Ratio (PSNR), Mean Squared Error (MSE) and its normalized form (NMSE), where the MSE is normalized by the Frobenius norm of the reference hologram. We also test our recently proposed IQM called Sparsness Significance Ranking Measure (SSRM) [56] which is native on the complex domain and showed a good compatibility for the quality evaluation of a limited set of CGHs [57]. In its original form, SSRM operates solely in Fourier domain and predicts the similarity by comparing Fourier coefficients of reference and the impaired data. It calculates a separate quality score for the DC term which later is combined with the score of other coefficients. However, in some preliminary tests, we found out that in case of Fourier holograms, this may not necessarily result in a more robust performance. Consequently, we considered a version of this method which treats the DC term just like other Fourier coefficients. The results for testing this version are presented under the name of SSRMt. Since the compression of holograms was performed separately on the real and imaginary parts of the holograms, we can also separately evaluate those real and imaginary parts with the state-of-the-art IQMs -which by default can operate on the real-valued data-and then calculate the arithmetic mean to achieve one quality prediction score for each compressed hologram. The goal is to check whether they can relate their signal fidelity measurements on holograms, to the overall perceptual quality of the rendered scenes (represented by the provided MOS). The tested IQMs include: FSIM [58], IWSSIM [59], MS-SSIM [60], VIF [61], NLP [62], GMSD [63]. The UQI [64] and SSIM [65] have been utilized in the context of holographic data e.g in [66,67] and [68], consequently we added them also for the experiments.It should be noted that the machine learning based IQMs along with the ones which require the information from the chrominance channels of the color images were omitted from this evaluation. The former group obviously require large training sets to be adapted for the case of holography while our holographic data set is too small for such purposes. The later group also require information from the color channels while our holograms are all monochromatic. Table 1, summarizes the input requirements of the tested quality metrics and provides their parameter settings as it was used in this experiment. It also summarizes their main features and utilized underlying principle. For the experiments in QA 1 and QA 2 tracks where the input data are the complex-valued holograms, the real and imaginary parts of holograms were separately tested by all of the studied 13 methods. Although, for the methods 1 to 5 in the table 1 which can directly be deployed for the complex

Statistical analysis
In this experiment, we follow the guidelines of the Video Quality Experts Group(VQEG) [69] for evaluating the predictive performance of the tested IQMs. In this regard, three evaluation criteria are considered namely: prediction monotonicity, accuracy, and consistency.
The Spearman Rank-order Correlation Coefficient (SROCC) and Kendall's tau Rank-Order Correlation (KROC) are utilized to measure the strength of the IQMs in predicting the rank-ordering of the MOS. Next, we utilize the Pearson Correlation Coefficient (PCC) to measure the linear correlation between the MOS and the predicted scores.
The score scale of several IQMs are not the same as the MOS and their score functions have a non-linear behaviour. To compensate, it is recommended in [69] to fit a logistic function on the predicted scores with the constraint of being monotonic in the fitted interval. Afterwards, measuring the PCC and the Root Mean Squared Error(RMSE) between the MOS and the fitted scores will provide an estimation of prediction accuracy for the tested IQMs. In this experiment, we utilized a 4-parameter logistic function, as it is recently utilized in [70], which automatically guaranties the monotone behaviour of the fit function. Hereafter, the PCC measured before and after the logistic regression is referred as PCC NoFit and PCC Fitted respectively. The recommended measure of prediction consistency is the outlier ratio. It is simply calculated as the ratio of the number of fitted predictions out of the 95% confidence intervals of the MOS divided to the total number of MOS.
Additionally, we perform a statistical significance test to determine whether the difference between performance of two quality metrics(here represented by the absolute difference between the fitted scores and the MOS) is statistically significant. A two-sided t-test is performed [71] where the null hypothesis is that there is no difference between performance of quality metrics, against the alternative hypothesis that the difference in performance is significant. The null hypothesis is rejected at a 5% significance level.

Evaluation of quality metrics in hologram plane
In this section, we discuss the outcome of the first (QA 1) test track referring to the evaluation of the performance of the studied IQAs in case of compressing Fourier holograms in the hologram plane, and a second (QA 2) test track performing this test after converting the reference and decoded Fourier holograms to Fresnel holograms. Following the procedure depicted in Fig.2, for every complexvalued hologram, the same synthetic aperture used in [27] to extract the centre and right-corner views to display and obtain the MOS, was utilized to select exactly the same data from the original hologram and the corresponding compressed versions. Note, that the compression process was done on the real and imaginary parts of the complete holograms and then the synthetic apertures were applied, not vice versa. Thereafter, the algebraic components of those cropped compressed holograms were compared to their non-compressed unsigned 8bit version, i.e. a cropped part of the input data to the encoder was compared to the same cropped part of the output of the decoder. The obtained results are compared to the MOS for the corresponding perspective averaged over the focal distances. In order to adhere to the page limits, only the overall statistics across all perspectives are presented, not the per view evaluations.

QA 1 -Evaluation on Fourier Holograms
The overall results of our statistical analysis for this experimental track is presented in Tab. 2. The results are shown based on each set of MOS obtained from conducting the subjective test for holographic display (MOS OPT), the light field display (MOS LF) and the regular 2D display (MOS 2D).
A quick look at the table makes it evident that SSIM ranks best based on almost all evaluation criteria, independent of which MOS it was compared to. Seen the fact that the majority of the tested IQMs previously shown to be superior to the SSIM at least for the case of classical digital images, such a good performance of SSIM here was not expected. In digital imaging, SSIM is designed to measure the visual similarities between the structural information of the compared pair of data. However, when measured in hologram domain in particular, there is no specific structure in the data and the holograms in general mostly appear as noise. So, how SSIM is able to make such an accurate prediction by both correct ranking of the distorted holograms(high SROCC) and linear correlation of the predicted scores compared to the MOS (high PCC)?
To answer this question, we took the CG-Ball hologram as an example and calculated the SSIM quality map for the real and imaginary parts of the compressed holograms separately. Fig.4 shows these quality maps. Due to resemblance of the quality maps for real and imaginary parts we only show the results for the real parts. The top row shows the results for the CG-Ball compressed with HEVC-Intra mode. While not directly visible in the compressed data itself, a certain grid artifact (related to the boundaries of the HEVC code blocks) appears in the SSIM quality map where the data is normalized based on the luminance and the standard deviation of pixel values (and compared to the non-compressed data using SSIM formulae). In the second row, quality maps for the hologram compressed with JPEG 2000 are depicted. Here, also heavy blocking artifacts show up in SSIM quality maps which otherwise are not clearly distinguishable in the compressed data, when observed directly. These artifacts at least partially could be related to what after the reconstruction appears as the aliasing artifacts(see section 2.1.1). In the third row, where quality maps for WAC are shown, no structured compression artifact are found. However, just like with the other two encoders, SSIM is able to correctly distinguish the compression levels by verifying those increases or decreases in the magnitude of similarity (i.e. changes in brightness of quality maps observed in passing from one quality map to another. See the brightness of Fig.4.(a) to (d) or Fig.4.(i) to (l) and also note the Fig.4.(i) compared to (e) and (a) ). Overall, it appears that by amplification of the compression artifacts which otherwise are wrappedup and masked by the heavily noisy environment of the hologram (i.e.the interference pattern), SSIM is able to rather easily predict the relative quality score and rank of the decoded holograms without having any explicit knowl-edge about the actual appearance of the objects after the reconstruction.
Another observation based on the results of Tab. 2 is that the multi-scale metrics are generally worse than single scale methods (such as SSIM, PSNR). This probably due to the fact that weights that assigned to the scores of each scale in multi-scale metrics. These weights are assigned based on characteristics of the human visual system (HVS) and psychophysical measurements of digital images and might not be relevant to complex holographic data before the reconstruction. However, it must be noted that even the predictions of the worst metrics in each category are not failing completely. For example, the UQI metric which seems to achieve the lowest correlation results, has an average SROCC of 0.8.
To better understand the significance of SSIM performance in relation to the other metrics, we have provided 3 significance tables, one per set of MOSs in Fig. 5. Here, it can be seen that the performance of the SSIM compared to the IQMs like MSE, NMSE, PSNR, GMSD, and NLPD is not significantly better, while compared to the other metrics it shows a clear superiority. Accounting for the MOS LF scores shown in Fig. 5.(b), the results are even closer and most of the quality metrics are on par with respect to the criterion explained in section 2.5. The reason for these results could be related to the nature of Fourier holography. The object-field wavefronts captured by a Fourier hologram are not convergent at any finite distance. Thus, we may interpret them as being convergent at a point at infinity. Propagation over infinite distances is described by the Fraunhofer propagation and essentially reduced to a Fourier transform. Henceforth, when analyzing Fourier holograms we essentially study the Fourier domain of the in-focus light field of an object. Thus, the metrics which try to relate their local (e.g SSIM, GMSD) or global error measurements (e.g PSNR, MSE) to the perceptual quality often perform better than metrics which rely on internal transformations. (e.g SSRM has an internal Fourier transform which, when applied to a Fourier hologram, brings the data back to the object plane).

QA 2 -Evaluation on Fresnel holograms
The next test relates to the synthesized Fresnel holograms. The same analysis process as for the Fourier holograms in the previous section is deployed. The Fourier to Fresnel hologram conversion was detailed in section 2.3. While the space-frequency behaviour of the holograms changed drastically (sse Fig. 3, the data still represents wavefields with statistics very different from natural images.
Interestingly, VIFp shows the best performance across all 3 MOS values in Tab. 3. Although its rather lower PCC NoFit values indicate that its behaviour with respect to the MOS is less linear compared to its competitors like SSIM -closely following behind the VIFp. Another point to be noticed is the rather larger gap between the top and worst performing metrics w.r.t each criterion. This gap becomes more evident in the significance tables of Fig. 6. accordingly. It appears that all encoders, especially HEVC-Intra mode and JPEG 2000, create particular compression artifacts, which clearly intensify when operating at lower bit rates. This creates a distinctive gradient over quality maps, enabling the SSIM to correctly rank and predict their relative quality score.
Tab. 2: Statistical evaluation of quality metrics for the Fourier holograms, rated after hologram decoding; thus before rendering. The statistics based on the MOS scores obtained from optical holographic display (OPT), the light field display (LF) and the regular 2D display (2D) and the IQM scores are separately shown. Quality metrics directly evaluated on the complex-valued data have " C" as postfix.  Here, it is clear that the top 3 quality metrics namely VIFp, UQI and SSIM are rather similar while their predictions are significantly better than others.

Evaluation on reconstructed holograms
In this section we provide the evaluation results of the experimental tracks QA 3 and QA 4, i.e. the evaluations after reconstruction of the Fourier holograms and after reconstruction plus speckle denoising. Here, we used the reconstructions of the centre and right-corner views, and at different focal distances as provided by the HoloDB. These reconstructions are generated with the same synthetic aperture as was used in previous experimental tracks, as well as in [27] to obtain the MOS. Only the absolute amplitudes of the reconstructed wavefield are examined by the IQMs as it is the principal part of the light sensed by our eye and upon which the MOS scores are based. The predictions and the MOS scores for all views and focal depths per rate point and codec are averaged to facilitate a direct comparison of IQM performances before and after reconstruction. Also, similar to the previous test tracks QA 1 and QA 2, only the overall statistics are presented.

QA 3 -Evaluation on reconstructed Fourier holograms
Tab. 4 provides the statistical results for the IQMs, measured immediately after the reconstruction process, with respect to the measured MOSs. It is interesting to note that at this point, the reconstructions are much more similar to natural images. However, their statistical properties in general do not exactly follow that of natural imagery. The reconstructions are contaminated for example by speckle noise and do contain out of focus objects as it was explained in section 1. Nonetheless, the IQMs are expected to predict perceptual quality much better than prior to reconstruction since most of them are optimized for natural images.
For the majority of the tested IQMs we find a very competitive performance with the top ranks shared among the MSE family (MSE, NMSE, and PSNR) as well as the SSRM and SSRMt measures. However, the performances of the SSIM, UQI, and VIFp are this time however ranked worst -being in strong contrast to their performance before rendering. The significance tables in Fig. 7 clearly emphasize the gap between these three IQMs plus FSIM and the rest -within which all candidates perform very similar.

QA 4 -Evaluation after speckle denoising of reconstructed Fourier holograms
In this test track we try to address the concern regarding potential effect of the speckle noise on the IQM performances. To evaluate, we denoised the reconstructions via a two-dimensional Wigner filter. The same parameters for denoising the reconstructions of the reference and the compressed holograms were used. Fig. 9 demonstrates and exemplary case before and after the denoising. Tab. 5 demonstrates the evaluation results for the IQMs tested on the denoised reconstructions. With the denoised reconstructions being much more similar to natural images, the Fourier transform-based SSRM dominates once more the top spots w.r.t most of the evaluation criteria; closely followed by its twin version the SSRMt, and the MSE family (NMSE, MSE, and PSNR). When compared to the results of Tab. 4, we notice that on one side the top performing IQMs before denoising do not experience a significant improvement or degradation, while on the other side almost all of the remaining IQMs clearly step up their game, closing the gap. Solely the NLPD can be considered an outlier as it significantly drops in performance after the denoising step. One reason for this can be the sensitivity of the NLPD to the modifications on the image gradient which is exactly what a Wigner filter would flatten out and hence potentially directly impacting the prediction performance Tab. 3: Statistical evaluation of quality metrics for the synthesized Fresnel holograms, rated before hologram reconstruction. The statistics based on the MOS scores obtained from optical holographic display (OPT), the light field display (LF) and the regular 2D display (2D) and the IQM scores are separately shown. Quality metrics directly evaluated on the complex-valued data are prefixed by " C".    of NLPD. Although, the exact reason for such behaviour is not known to us. The statistical significance tables in Fig. 8 also emphasize our observations.

Global analysis
After reporting the results in each track, in this section a cross-track analysis of the IQM performances is provided. To do so, we present the overall scatter plots of quality predictions for each IQM in Fig. 10. For each IQM, their quality predictions for QA 1 to QA 4 vs the MOS of the holographic setup are shown together. The scattered points from each test track are accompanied with a logistic fit curve calculated as explained in section 2.5. The results from the QA 1 to QA 4 are colour coded with blue, orange, green, and yellow, respectively. Here, in the cases of MSE (d), NMSE (g), PSNR (h), and VIFp (m), graphs reveal a distinctive shift when the measurements repeat in different steps of the holographic processing pipeline, i.e. these IQMs show a similar behaviour overall but in a shifted score range for each test track. For the case of PSNR, MSE, and VIFp, these data-range shifts occur exactly with the same order of the test tracks progressions. While these plots show that measures like the MSE family in general exhibit a similar behaviour independent of the point in the processing pipeline, where they have been calculated, these range shifts imply that their values are strictly comparable only with the measurements performed in the same place. As an example, a PSNR of 30 dB does not necessarily correspond to the same visual quality when the measurements are done once on the hologram and once on the denoised reconstruction. For the IQMs which are designed to always provide a bounded measurement this is usually not an issue, though their overall behaviour compared to the MOS can drastically change depending on the QA step where they are used for quality prediction, e.g. in case of SSRM and SSIM. Overall, there seems to be a trade-off between generality and comparability. One should decide between having an unbounded measurement exclusively comparable with the measurements in the same test point but with stable behaviour across the processing pipeline, or having a bounded measurement but with varying behaviour dependent on where is used to predict the visual quality.
Next, we want to have a simple quantitative evaluation of the performance of each IQM w.r.t. all of the benchmarked criteria and compared to all three sets of MOS. To do so, within each row of the Tab. 2-5, we ranked the IQMs (e.g in Tab. 2, based on the SROCC criterion SSIM has the lowest rank so receives the highest score of 15 out of the 15 tested IQMs and on the other side UQI receives score of 1 for being the worst IQM w.r.t the SROCC). Thus for each table, every IQM receives 18 ranking scores for the 6 evaluation criteria (i.e. SROCC, KRCC, PCC NoFit, PCC Fitted, RMSE and Outlier Ratio) and all 3 sets of MOS. Then, we calculated a column-wise sum of these ranking scores per IQM. That way, we obtained a compact indicator of the IQM performance per test track for which the results are depicted in the bar charts of Fig. 11. We could continue summing the ranking scores across all 4 test tracks, though due to significant changes in the rankings of the IQMs for each test track compared to others, we prefer to provide the ranking scores for each test track separately. The bar charts of Fig. 11 reveal the top performing IQMs in each track. Moreover, the ranking scores of the IQMs helps to see their relative performance compared to each other. For example, in Fig. 11a, the plateau of the ranking scores from FSIM till VIFp shows that these IQMs w.r.t all evaluation criteria are almost equally unreliable compared to the top three IQMs.

Final verdict
Our experiments for the first time provide a full view of the performance of the available quality measures for the case of digital holography w.r.t. MOS. We visualized the behaviour of each method when tested across the four test tracks (Fig. 10) and quantified the performance of these Tab. 5: Statistical evaluation of quality metrics for the reconstructed Fourier holograms after speckle noise removal. The statistics based on the MOS scores obtained from optical holographic display (OPT), the light field display (LF) and the regular 2D display (2D) and the IQM scores are separately shown.  quality measures based on several evaluation criteria (See Fig. 11 ). Considering all aspects tested in this research, we would like to finalize the analysis of the tested IQMs by providing a usage guideline such that helps interested user to choose the most appropriate method for the holographic VQA tasks until more hologram-oriented quality metrics are designed. For each testing track, Tab. 6 provides our recommendations where the recommendations are categorized in three groups colour coded with green: usage is advised, light-blue: Neutral or results were not conclusive to provide a strong recommendation and red: the application is discouraged. According to our experimental results, MSE and NMSE which their measurements are purely based on signal fidelity and does not take into account any perceptual aspect, did not show any major drawback and in fact they are among the top three recommended methods especially after the reconstruction. To a lesser degree PSNR follows the same performance across all testing tracks. This means, its additional steps compared to the other two simpler methods (i.e the fractional measurement of error relative to the maximum possible signal value and the applied logarithmic scale) have a negative impact on the quality predictions w.r.t. the MOS.

PSNR
The SSIM and SSRM appear to demonstrate the most dramatic behaviour across all tracks. As mentioned earlier, while SSIM is found to be the top performing IQM prior to, and the worst after reconstruction, SSRM performs exactly the opposite. It performs poorly on holograms and jumps to the top of the list when tested on the reconstructed scenes. The SSRMt also demonstrates similar behaviour as the vanilla SSRM, which suggests a deeper connection between their identical operational core and their performance. Regarding the other members of the SSIM family (e.g. MS-SSIM, IWSSIM) though, it appears that application of more complex perceptual models dropped their performance in the hologram domain. After the reconstruction, their multi-scale analysis based measurements could only create an improvement on their performance over the default SSIM but not adequate to make them competitive. The case of IWSSIM can also be considered in conjunction with the NLPD. Both of these methods benefit from the multi-resolution decomposition using Laplacian pyramid and both present a similar performance across our evaluation pipeline suggesting that such a decomposition may not necessarily be the most effective approach for VQA in digital holography. The other SSIM family member UQI, which does not benefit from the regularization in its operational core, is expected to perform worse than the more mathematically stable SSIM. Lack of regularization for this metric is known to cause instability especially in cases where the sum of squared means or sum of variances for the reference and distorted data gets close to zero. For this reason, we discourage its usage over the SSIM and in case of Fresnel holograms where it shows a competitive performance, we recommend utilizing it along with secondary quality prediction methods.
Also, the image-gradient based methods -which are normally able to make very accurate quality predictions in natural imagery -do not show an outstanding performance (e.g, FSIM, GMSD) suggesting that such approaches may not necessarily be effective for VQA in digital holography. However, the rather stronger performance of GMSD could be related to utilizing the standard deviation of the local gradient magnitude comparisons which enables a rather more stable prediction with more independence from the noisy nature of holograms and the speckle noise.
The VIFp which represents the information theoretic approach to the VQA and is been a long standing contender in the realm of IQA, also does not exhibit a reliable achievement across all of our experimental tracks. Only in one case when the Fresnel holograms are tested, outperforms others in most cases. This method exhibits a number of assumptions, which may or may not hold in case of holography causing large fluctuations in its performance. For example, it employs the natural image statistics by using the Gaussian Scale Mixture (GSM) to statistically model the wavelet coefficients after applying an steerable pyramid decomposition. This model is expected to substantially deviate from the statistics of the holographic signals. However, VIFp also models the distortion channel using an attenuating additive noise in wavelet domain which in this case potentially can relate well with the highly noisy nature of holograms. On top of them, in VIFp it is assumed that while passing through the HVS, the uncertainty level increases for perception of visual data. Thus the wavelet subbands for both the source and distortion channels are subject to an extra additive white Gaussian noise which models such increase in the uncertainty level. Now, wherever all these assumptions happen to fall in-line with the statistics of the tested holograms, a good prediction performance is foreseen. However, in most cases it is going to be the other way around. A re-adjustment on these assumptions based on the holographic statistical properties may potentially solve the unreliability issue for information theoretic based metrics like VIFp. Although, because of extremely noisy nature of holographic interference fringes and their strong dependence to the properties of the recorded scene, such statistical modelling may not be generalizable. We have recommended the usage of VIFp for the Fresnel holograms in Tab. 6. However, we strongly advise to use this metric along with other recommended IQMs due to its instability caused by the above explained reasons.
It should be noted that, these recommendations hold only within the bounds of current experiment (i.e. as long as the degradations are limited to compression artifacts, and if the same procedure as this research is utilized for processing, rendering and quality assessment of the holograms). Moreover, being the best among themselves does not necessarily translate these IQMs into being reliably accurate in predicting the visual quality of holograms. This gets more clear, when the achieved correlations w.r.t. MOS in this research is compared to the ones of the same IQMs tested on digital images. Especially, the achieved correlations before the reconstruction (reported in Tab. 2 and Tab. 3), still have lots of room for improvement. Not to mention, our results showed that careful considerations should be taken into account for choosing the right quality prediction method among the available options based on the application and the characteristics of the holograms at the point where the measurement occurs. Therefore, in spite of the provided recommendations, we emphasize that none of the available options are at the level that confidently alleviate the strong need for design and development of especial quality assessment algorithms for the holographic data.

Conclusions
In this research for the first time we utilized 96 holograms of HoloDB to make a systematic performance evaluation of the most viable options for the quality assessment of digital holograms. The results for 4 separate sets of evaluations are provided. We compared the reference holograms with the distorted versions: before rendering, on the real and imaginary parts of the complex-wavefield; after converting the Fourier holograms to regular Fresnel holograms; after rendering, on the quantized amplitude of the reconstructed data, and after speckle denoising. For every experimental track, the quality metric predictions are rigorously compared to three sets of MOS which were previously obtained via subjective experiments. Statistical analysis of IQM performances and a discussion on the behaviour of outstanding methods are presented. Finally their overall performance based on all of the utilized evaluation criteria and all three sets of MOS are summarized per test track to introduce the best performing quality metrics for each testing track. All aspects considered, turns out while for each test track, a couple of quality metrics present a significantly correlated performance compared to the multiple sets of available MOS, none of them show a consistently high-performance across all the four test-tracks. This emphasizes their sensitivity to the characteristics of the input and the position in the processing pipeline where the quality predictions are obtained. It also reveals once more the dire need to design efficient quality metrics for holographic content. We are hoping that this research provide a thorough understanding of the complexities involved in VQA of digital holograms and consequently act as the first systematic step toward designing advanced perceptual quality prediction methods for digital holograms.