Light-field compression using a pair of steps and depth estimation

: Advanced handheld plenoptic cameras are being rapidly developed to capture information about light ﬁelds (LFs) from the 3D world. Rich LF data can be used to develop dense sub-aperture images (SAIs) that can provide a more immersive experience for users. Unlike conventional 2D images, 4D SAIs contain both the positional and directional information of light rays; the practical applications of handheld plenoptic cameras are limited by the huge volume of data required to capture this information. Therefore, an eﬃcient LF compression method is vital for further application of the cameras. To this end, the pair of steps and depth estimation (PoS&DE) method is proposed in this paper, and the multiview video and depth (MVD) coding structure is used to relieve the LF coding burden. More speciﬁcally, a precise depth-estimation approach is presented for SAIs based on the cost function, and an SAI-guided depth optimization algorithm is designed to reﬁne the initial depth map based on pixel variation tendency. Meanwhile, to reduce running time, intermediate SAI synthesis quality and coding bitrates, including the key SAIs selected and cost-computation steps, are set via extensive statistical experiments. In this way, only a limited number of optimally selected SAIs and their corresponding depth maps must be encoded. The experimental results demonstrate that our proposed LF compression solution using PoS&DE can obtain a satisﬁed coding performance.


Introduction
The rapid development of computational photography has brought great advances and changes to immersive 3D applications. As a computational photography technique, a light field (LF) represents all the light information in a scene, and thus provides a more immersive experience for audiences [1,2]. Because conventional imaging technologies project a 3D scene onto a 2D plane, there is a loss of stereoscopic sensation. As opposed to conventional cameras, plenoptic cameras have micro-lens arrays in front of their sensor, wherein both the spatial and angular information of light rays can be simultaneously captured in one shot [3]. An image captured by a plenoptic camera is called a light-field image (LFI), which is shown as Fig. 1. As seen, an LFI is composed of massive hexagonal blocks (also called Micro-Images, MIs). Using LFIs, focal plane shifting and dense viewpoint image acquisition can be implemented in postprocessing, and even a viewpoint at a new position can be synthesized [4]. To capture more extra data of target, the light-field-based hardware is improved using the special materials [5]. Due to the additional information within the LFI, it can solve many traditional vision problems, such as image segmentation [6], salience detection [7], depth estimation [8][9][10][11], and video stabilization [12] after geometric calibration [13]. Besides, another prevalent application of light field is in microscopy [14][15][16], where the holographic visualization can explore the potential application of light field in medical and biological researches. The equipment increases field of view, but collects much semi-transparent objects. It is noticed that the existing pixel-based depth estimation algorithms undeniably cannot work for the semi-transparent objects. The various transformations of light field image are illustrated in Fig. 2. It descripts that the raw light-field image (LFI) is composed of micro-image (MI); the sub-aperture image (SAI) is derived by rearranging the co-located pixels each MI; the epipolar plane image (EPI) is generated by stacking and cutting SAIs. Additionally, SAIs can provide an immersive experience with the naked eye [17]. A raw LFI with MIs obstructs spatial correlation elimination; it can be observed in Fig. 2 that each pixel from the same MI contributes to different SAIs. Consequently, LFIs with SAIs have been a focus of investigation, as angular correlation is prone to exploitation. However, the huge number of SAIs prevents practical application of LFIs for restoring and transporting. Many technological challenges remain associated with the huge amount of data. To this end, we previously proposed an LF compression method in [18,19]. This article, based on our earlier work, deeply analyses all combinations of depth computation and sub-aperture images sparsely sampling steps to compensate the computational complexity and the coding efficiency. Although these improvements seem to be unimpressive, they can positively affect depth estimation, bitrates reduction and reconstruction performance. With regard to LF compression, many solutions have been recently proposed. As shown in Fig. 3, these methods are usually classified into two categories: those that compress raw LFIs and those that encode SAIs extracted from raw LFIs. Moreover, with the improvement of codec, the developing redundancy elimination has been evolving the light field coding approaches. Because there is incoherence among MIs in raw LFIs, the high-efficiency video coding (HEVC) encoder is frequently used for such LFIs. The HEVC codec contains many flexible and novel prediction tools, and a high-order prediction model takes advantage of the HEVC encoder, as proposed in [20], wherein geometric transformations have up to an unprecedented eight degrees of freedom. However, MIs in raw LFIs are difficult to handle because of their enormous quantity and extraordinary low resolution. Some other compression structures have been proposed to improve the compression ratio for raw LFIs using the video codec [21,22], but the numerous MIs restrict them to exploiting spatial correlation. Because raw LFIs with honeycomb arrangements limit the standard encoder [23], an MI reshaping method is proposed for raw LFIs in [24,25] for the HEVC encoder. Nevertheless, all these compression solutions require the transmission of camera parameters for further applications, which increases the coding burden on the compressed data stream [26].

Related works
LF data formats and coding conditions can strongly impact compression performance, and there is much room for improvement in LFI compression. Another representation of LFIs is the dense-view SAI, which can be applied for autostereoscopic display and has very strong correlation. Consequently, an increasing number of researchers have focused on SAI compression. Two LF compression schemes that obtain a high compression ratio and quality, based on video coding techniques and disparity compensated prediction, respectively, have been proposed in [27]. As reported in [28], each SAI can be considered a random linear combination of angular viewpoints. Consequently, the method in [28] reconstructs all SAIs using only a few SAIs, thus removing the high correlation. The SAI streaming compression scheme proposed in [29] adopts two SAI scan orders (i.e., line scan mapping and rotation scan mapping) to improve compression efficiency. To find an optimal scan order, homography-based low-rank approximation is presented in [30,31]; this method can adaptively adjust the scan order according to LFI content. Recently, the optimized SAI arrangement based on content correlation introduced in [32] achieved bitrate reduction. All the above approaches convert SAIs into a single-viewpoint pseudo-sequence, which is useful for coding using the video encoder; however, with the increase in the number of SAIs in LFIs, the compression efficiency decreases significantly.
Because there is a high cross-correlation among SAIs, both the horizontal and vertical disparities can be utilized to compress LFIs. Multiview video is transformed using SAIs, and the possibility for efficient SAI compression via multiview video coding (MVC) is verified in [33]. A similar compression algorithm is proposed in [34] to encode LFIs using MVC; this algorithm achieves higher compression performance than that achieved by converting SAIs into a single-view video. Because LFIs encoded by these compression methods are computer-generated, the perfect disparities make efficient inter-view prediction possible. A scheme based on a pseudo-sequence for LFI compression is designed in [35]. Specifically, this method decomposes raw LFIs into multiple SAIs, assigns each SAI a layer, and then encodes all SAIs using a symmetric 2D hierarchical scheme in accordance with the assigned layers. To further improve coding efficiency, the 2D hierarchical reference structure proposed in [36] supplies a motion vector scaling algorithm based on spatial coordinates to enhance the precision of inter-view prediction.
Although compressing SAIs using cross correlation can improve compression efficiency, it still has high computational complexity and requires high bitrates due to the huge number of SAIs in LFIs [37]. To handle this issue, a graph-based representation is presented in [38] to code the LF using geometric information to improve coding performance. However, this method requires extra colour information, which increases the coding burden. In addition, the scalable LF compression scheme described in [39] codes only a subset of SAIs and reconstructs all SAIs using a patch-based restoration method. Moreover, bi-level view compensation with rate-distortion optimization is reported in [40] that first classifies the entire SAIs as either key or non-key SAIs and then performs learning-based angular super-resolution to compensate the non-key SAIs. This approach exploits the intra-and inter-view relationships, but compensation using only super-resolution is unreliable. It should be noted that compression efficiency can be significantly improved using image geometry information, like disparity and depth map. However, because this algorithm cannot reduce the number of SAIs, the problem of high bitrates has not yet been fully solved.
Due to its depth-enhanced format, the advanced multiview video and depth (MVD) map structure can decrease bitrate and save coding time because only a few views and their corresponding depth maps are coded and transmitted while arbitrary views are rendered from the decoded data [41]. According to the literature, methods have been developed recently to compress SAIs with the depth map or disparity. Similar to [40], an LF codec with disparity-guided sparse coding is proposed in [26] that optimally selects several key structural SAIs; all SAIs are reconstructed from the key SAIs and the associated disparities. The depth-image-based rendering (DIBR) technique is employed in [42] to synthesize non-corner SAIs instead of coding them. Coding only the four corner SAIs decreases bitrate, but such corner SAIs have distortions that diminish the results of synthesis. The compression results of these approaches simultaneously verify their validity and confirm that the quality of the depth map or disparity is vital to such an MVD coding scheme. Therefore, an LFI-compression structure using depth estimation and view synthesis was proposed in our previous work. A prototype is proposed in [18], in which the depth map is computed only by horizontal EPI and optimized through crude error pixel repairing, so it generates the unsatisfactory depth map. To obtain a more accurate depth map, the depth estimation of the structure has been improved in [19]. Although the number of SAIs selected and coded is small, an intermediate SAI can still be approximated once the depth map can be precisely generated, which can contribute to low computational complexity and bitrate. Note that this article is based on our earlier work, as described in [18,19], and the novelties of this article are generalized as follows.
(1) This paper estimates depth maps for SAIs according to the angular domain information of the LF. Based on an epipolar plane image (EPI), the pixel displacement in LFIs is fully exploited. A valid cost function is designed that can precisely determine the optimal slope of the epipolar line for each point.
(2) The slope computation procedure generates massive noise points in flat areas, which present a new challenge for depth optimization. Considering the strong relationship between an SAI and its corresponding depth map, an SAI-guided depth map optimization method is presented that can eliminate noise points according to the pixel variation of the associated SAI.
(3) To reduce computational complexity and bitrate while maintaining depth-estimation precision, an appropriate cost-computation step and a key SAI-selection step are determined based on statistical experiments. Because the pair of steps and depth estimation are embedded in the proposed method (PoS&DE), the coding and reconstruction process can be optimized.
(4) Advanced MVD is employed in the proposed compression structure to encode the selected key SAIs. In this way, unselected SAIs can be synthesized at the decoder using DIBR so that no extra information must be sent to the decoder. Additionally, the proposed PoS&DE confirms that the virtual view synthesis solution can significantly improve LFI-compression efficiency.
The rest of this paper is organized as follows. The 4D light-field model is analysed in section 3. Based on the analysis, the proposed PoS&DE is introduced comprehensively in section 4. Section 5 demonstrates and discusses the experimental results. Finally, section 6 concludes this paper.

4D light-field model
From a geometric optics perspective, a light ray in space can be denoted by position (x, y, z), direction (θ, φ), wavelength λ and time t such that 7D plenoptic function P(x, y, z, θ, φ, λ, t) is constructed as the light-field model to describe the set of light rays travelling in every direction through every point in the 3D space. Although the multidimensional function can represent light rays comprehensively, it is difficult to process by computer. Due to concision and intuition, 4D model P(u, v, x, y) is the most common model in computational photography. In the 4D model, (u, v) denotes angular resolution and (x, y) denotes spatial resolution. A diagram of the 4D light-field model is illustrated in Fig. 4. As seen, the distance between the two space-specific planes is represented as f , ∆u and ∆v denote the geometrical distances between the two virtual cameras in the horizontal and vertical directions, respectively, and ∆x and ∆y are the distances the corresponding scene points in the image moved in the horizontal and vertical directions, respectively. Therefore, the depth value of each point in every SAI can be derived based on the expression, Z = f r (r = ∆u ∆x or ∆v ∆y ) [2].  Due to a strong relationship, SAIs have been a prevalent topic of study for LF compression. In this section, we propose an improvement to SAI-compression performance based on view synthesis. To reduce computational complexity with satisfactory reconstruction performance, we find a reasonable pair of steps to generate the depth map and encode a small number of selected SAIs. The proposed compression approach encodes only the selected SAIs; the unselected SAIs will be synthesized using DIBR. Meanwhile, the MVD coding scheme is used in our proposed method to fully exploit inter-view (inter-SAI) and inter-component correlations. A flowchart of the proposed compression method is shown in Fig. 5. According to the depth computation expression mentioned in section 3, ratio r plays a crucial role in the depth-estimation process. To obtain ratio r, we use an EPI, which is a famous concept in the field of multiview computer vision [43]. By gathering SAI pixels at a specific spatial coordinate x (or y) with a fixed angular coordinate u (or v), a horizontal (or vertical) EPI can be generated, as illustrated in Fig. 6.

The proposed SAI-compression method
The epipolar lines in each EPI reveal the spatial structure of a 3D scene [11]. More specifically, the slope can be used to deduce the depth value of an associated point in each SAI using a reformulated equation, as follows: where i is an enumeration value that indicates horizontal or vertical,α (i) specifies the slope of a specific point in the EPI and tan α = r. Therefore, the process to obtain an accurate depth value is to precisely compute all slopes in the horizontal and vertical EPIs.

The optimal slope decision
Using the analysis described above, it is challenging to obtain the depth value that achieves the best slope decision for each epipolar line in an EPI. To this end, the optimal slope decision model, Eq. 2, is constructed to find the best slope from a candidate set for each point. Because it is assumed that points on an epipolar line have pixel values that are nearly the same, during the matching process, the slope of the point with the lowest cost is selected as the optimal slope.
Here, C(α (i) , p) denotes the cost of slope α at point p = (x(y * ), v(y * )) (in a horizontal EPI) or point p = (u(x * ), y(x * )) (in a vertical EPI). The tested α (i) is selected from the slope candidate set. α * (i) (p) indicates the determined best slope for point p in a specific EPI. Since the disparity between two adjacent SAIs is rather tiny in terms of the LFI captured by the plenoptic camera, the range of candidate slopes is restricted from −45 • to 45 • . Furthermore, the computational process of C(α (i) , p) is demonstrated as follows.
For simplicity, V p denotes the luma value of current point p, and V (α (i) ) ∆p represents the luma value of the other point on the epipolar line with candidate slope α (i) . In addition, ω n is a flag that is assigned 0 if the tested point ∆p is invalid and is otherwise assigned 1. However, all points on the epipolar line are calculated exhaustively for matching, which is time-consuming. Therefore, a proper computation step should be determined to balance the computational complexity and estimated accuracy, which is analysed in detail in section 5.
By using the proposed model based on the horizontal and vertical EPIs, two preliminarily depth maps are produced for each SAI. Because these two depth maps have the same weight, each depth map is merged by the weighted average of the two depth maps. To be applicable for the MVD coding scheme and maintain the linear structure of the estimated depth map, min-max normalization is adopted in this paper to restrict the depth value to the range from 0 to 255.

SAI-guided depth optimization
Homogenous regions in the EPI that have massive pixels with almost the same value lead to miscalculations in the depth map. There is no dividing line for homogenous areas, and so the value of the computed least cost for one point may be over 2. The proposed cost function fails to select the optimal slope of these points, which are marked as noise points. In this case, the noise points need to be eliminated to optimize the depth map.
Many depth optimization algorithms have been proposed in the literature, including the interpolation method [44], Markov random field (MRF) propagation [8], locally linear embedding (LLE) [2] and sub-pixel-based matching (SPM) [45]. However, these algorithms refine the depth map based on its intrinsic characteristics, which is not suitable for viewpoint images with narrow baselines. Therefore, the correlation between the depth map and its associated texture is considered in the proposed optimization approach, and SAIs are regarded as the reference to refine the estimated depth map.
In general, if points are located on the boundary of the depth map, their corresponding points in the texture image are also located on the boundary. For points astride the boundary in the texture image, their associated points in the depth map may have distinguished depth values. Meanwhile, in the texture image, the homogenous regions contain numerous similar pixels with co-located points in the depth map that vary regularly. Based on these observations, the noise points in the initial depth map can be eliminated according to the variation tendency of their corresponding points in the texture image. To this end, a weighted mean filter is designed in this work that can refine the initial depth map via the generated parameters.
To repair error depth value Z e , the proposed weighted mean filter uses a patch. In Eq. 4, P e is a 3 × 3 patch centred on a specific noise point, |P e | represents the cardinality of P e , Z r denotes the depth value of the neighbouring points in patch P e (excluding the noise point), and γ r implies that the tendency coefficient reflects the pixel variation tendency of the associated SAI. The tendency coefficient can be automatically obtained via Eq. 5: where V e indicates the luma value of the pixel in the SAI corresponding to the noise point in the depth map, and V r is the luma value of its adjacent point. In this way, tendency coefficient γ r can be assigned according to the variation tendency of the associated SAI. For validity, the tendency coefficient must be constrained, and ε denotes the offset to control smoothness; the larger this offset, the stronger the smoothness. To balance the smoothness, offset ε is empirically set to 0.012 in this paper. Because SAIs are regarded as the reference images to guide depth optimization, noise points can be removed based primarily on the characteristics of an SAI. Meanwhile, the final depth map can maintain good consistency.

LF reconstruction and compression with depth maps
The proposed method focuses on compressing LF images captured by a Lytro camera, which is the most frequently used plenoptic camera on the consumer market. With the compact arrangement of the micro-lens, the camera balances the resolution and view number well. However, the increase in resolution and view number for 4D LF requires a much larger storage capacity and transmission bandwidth than a conventional 2D image. Therefore, LF data compression has become a great challenge for its further application. As mentioned in section 2, the estimated depth map is coded with the MVD coding structure from the proposed approach, which can fully exploit the inter-view (inter-SAI) and the inter-component correlations within LFIs.

Pair of steps
Because the MVD coding structure requires only a small number of SAIs and their corresponding depth maps, it can essentially remove the burden associated with coding an LFI. Although reliable depth map estimation fails if there are too few SAIs due to its negative impact on the quality of the synthesized SAIs, having too many SAIs increases data volume, which increases computational complexity and obstructs the usage of the LF. Consequently, this work requires selecting an appropriate key SAI solution to balance the quality of the estimated depth map and the bitrates of the selected SAIs. Additionally, the exhaustive testing for cost computation will achieve a high depth-estimation performance and significantly improve computational complexity. Conversely, sparse testing can save estimating time and decrease depth quality. Furthermore, key SAI selection and computation will jointly impact the reconstruction performance. In the proposed scheme, key SAIs are selected with a proper step, converted into pseudo-multiview sequences and encoded by the 3D extension of the HEVC coding structure. Therefore, a detailed analysis for the pair of steps, including the key SAI-selection and cost-computation steps, are as demonstrated in section 5.

LF coding using MVD
To maximize coding and reconstruction performance, the MVD coding scheme, which can exploit the inter-view (SAI) and inter-component correlations, is used in this work. The advanced MVD structure can encode a small number of texture videos composed by the key SAIs and their corresponding depth maps; then, the coded bitstream packets are multiplexed into a 3D video bitstream [41]. On the decoder side, the view synthesis algorithm proposed in [46] can be employed to render missing SAIs using the decoded depth map and the corresponding key SAIs, and the entire LFI can be reconstructed.
A raw LFI cannot be directly used for display, but after decomposing, the numerous SAIs generated with slightly different disparities can provide an immersive 3D experience. Augmenting SAIs will dramatically increase the size of the LF data. The most conventional solution to handle this issue is to transform all SAIs into a single-viewpoint pseudo-sequence; this coding approach codes the pseudo-sequence using a video encoder. However, the coding for such huge data is complicated and time consuming. To this end, this work converts the key SAIs into multiview pseudo-sequences and encodes them with their associated depth maps using the MVD structure. Note that the appropriate number and order of views will be discussed in section 5.

Experimental results
Extensive experiments, including PoS&DE, are conducted in this section to evaluate the proposed compression method. The LF dataset from the ICME 2016 Grand Challenge [47], which contains 12 LFIs captured by the Lytro camera at various scenes, is used in the simulated experiment of the assessment. Two of the 12 LFIs, Color_Chart_1 and ISO_Chart_12, are mainly used for calibration, and their depth varies slightly. Therefore, they are not applied for PoS&DE. The SAI has been decomposed from the raw LFI (with dimensions of 15 × 15 × 624 × 432) using the MATLAB Light Field Toolbox 0.4 [4]. Table 1 shows the names of the 10 selected LFIs; Fig.  8(a) shows the centre SAIs of those LFIs.  Fig. 7. Note that, because the depth maps are used only to synthesize missing SAIs as non-visual data, the distortion of the reconstructed LFs rendered using the SAIs and depth maps is measured by the mean square error (MSE). The computation step (S c ) indicates the step length to compute the costs for matching epipolar line slopes. In addition, the selection step (S s ) selects the disparity between any two adjacent viewpoint sequences. The pair of steps (S c , S s ) influences reconstruction distortion. As seen from Fig. 7(a), with the increase of both the computation and selection steps, the reconstructed SAI distortion severely increases. When S c is fixed, the performance of the estimated depth map is changeless for a specific light field image. The disparity is the unique factor that impacts on reconstruction distortion. The reference SAIs used for synthesis are sampled along vertical direction, so only the horizontal disparity needs to be taken into consideration. Once the depth map is produced, the intermediate SAI synthesis is regarded as the horizontal pixel displacement of the reference SAI according to the depth value. Consequently, with the linear increase of S s , the error propagation from pixel displacement causes the linear reconstruction distortion increase. Moreover, the every depth point value implies the displacement of the reference SAI during view synthesis. The depth map precision is subject to the size of S c . The reconstruction via an outstanding depth map is robust whether the disparity is large or not, so the distortion variation under small S c changes slowly. The distortion variation is very sensitive to the disparity if the estimated depth map contains much error points. Therefore, the distortion slope variation is different for each S c as shown in Fig. 7(a). When S s is set to 1, all SAIs are transformed into pseudo-sequences. Because it is unnecessary to synthesize missing SAIs using the depth map, there is no reconstructed SAI evaluation. An increase in S s leads to a large disparity between the pseudo-sequences composed by the key SAIs, and a larger S c decreases the quality of the estimated depth map. Therefore, reconstructed SAI distortion is severe and cannot be applied for practical usage in this case. Considering feasibility, reconstructed SAI distortion should be under 30. Within this restriction, the pair of steps (S c , S s ) should be fewer than (4,4). Due to the special structure of LFIs, the centre SAI has the most representative features, so it must be contained in the base view sequence in order to maintain the high performance for coding and synthesis. With regard to the 15 × 15 angular resolution, the number of pseudo-sequences converted by SAIs is 15 at most. Within this restriction, as S s increases from 1 to 7, the number of pseudo-sequence reduces gradually; it will be fixed to 3 if Ss is more than 3. Additionally, Fig. 7(b) shows the average running time for one depth map estimation under varied S c . The S c decides the number of the points to be matched in EPI slope decision. A small S c signifies more points to be computed, resulting in more running time for depth estimation. On the contrary, a larger S c reduces the sampling rate of computed points, which saves estimating time. A close relationship is found when running time is compared with S c , while the image content, however, is irrelevant. As a result, four images with different contents have almost the same time consumption under the same S c . An over-sparse calculation produces a low-quality depth map, which significantly damages the reconstructed SAI. Based on the aforementioned analysis and the computational complexity statistics, the pair of steps (S c , S s ) is set to (3,4).

Depth-estimation comparison
In this section, typical methods, line-assisted graph cut (LAGC) [9] and accurate depth map estimation (ADME) [10], are considered anchors to compare the estimated depth map with the proposed depth-estimation algorithm. Figure 8 shows the experimental results for 10 LFIs from the LF dataset. Because LAGC estimates the depth map based on stereo-pair matching, it fails to produce an available depth for an LFI with a narrow baseline. It is difficult to match stereo pairs in LFIs with low contrast, and so LAGC cannot distinguish the objects in such LFIs. In comparison with LAGC, depth-estimation method ADME performs better. More specifically, the depth maps generated by ADME are more continuous than those generated by LAGC. Moreover, much of the depth information in various LFIs can be exploited by ADME. The proposed depth-estimation method precisely computes the slope of every epipolar line in the horizontal and vertical EPIs and employs an efficient depth optimization with the proposed SAI-guided algorithm. Therefore, the proposed algorithm can comprehensively generate a continuous depth map and preserve much of the detailed information in the complex and background regions.

Reconstruction evaluation
To further assess the estimated depth map, the objective and subjective quality of the intermediate SAI is shown in Fig. 9 and Fig. 10, respectively, in which the SAI is synthesized by the algorithm in [46]. The average structural similarity index (SSIM) and multi-scale (MS)-SSIM of the synthesized SAIs are computed as depicted in Fig. 9. Note that LAGC fails to produce a feasible depth map, and so it does not appear in Fig. 9. In general, our proposed method outperforms ADME because it preserves more information than ADME. More specifically, most of the SAIs produced with our depth map can achieve 0.83 in SSIM and 0.9 in MS-SSIM. With regard to I03, too much coding distortion and complex textural information in the associated SAI result in poor synthesis. However, as illustrated in Fig. 10, only the edge of the flowers in the front of Flowers is blurred, which is negligible. Similarly, the wheel in Bikes has some distortion. I06 and I08 contain very sparse foreground objects in the studio with large homogeneous areas, and so they are insensitive to the depth map and their quality can reach 0.95 in SSIM and MS-SSIM.
Although there is some noise in their estimated depth maps, the synthesized SAI is almost same as the original SAI. In Fig. 10

Coding performance
As mentioned in section 5.1, the selected step is set to 4, and so a 3-viewpoint pseudo-sequence is developed from partial SAIs. In Fig. 11, the red arrow denotes the base viewpoint and its scan order, and the two neighbouring blue arrows represent the non-base viewpoints and their scan order. The border SAIs (in white) suffer severe geometric distortion and blurring, resulting in less value for usage. Therefore, they are not coded in the proposed PoS&DE. The SAIs in red represent the missing SAIs that can be synthesized using the decoded multiview sequence and the associated depth map using the view synthesis method in [46]. Consequently, coding performance is jointly measured using average peak signal to noise ratio (PSNR) for the 13 × 13 = 169 viewpoint images (including the 13 × 3 = 39 coded viewpoint images and the 130 synthesized viewpoint images, as shown in Eq. 6) and the total bitrate (viewpoints and depth maps). Coding performance is jointly measured by the PSNR and the bitrate change, which are calculated by the Bjontegaard delta PSNR (BD-PSNR) and Bjontegaard delta bitrate (BD-BR), respectively [48].  with 7-and 3-viewpoint pseudo-sequences. The HEVC codec under the "Low-Delay P-main" configuration is regarded as the benchmark, and the SAIs are arranged as pseudo-sequences to be encoded. Moreover, a lower BD-BR implies that the bitrates are saved more significantly, and a higher BD-PSNR denotes a higher reconstruction quality. SC-SKV uses disparity to guide non-key SAI coding, which reduces the residual of the non-key SAIs. Thus, SC-SKV can achieve low bitrates compared with the HEVC codec, as shown in  To evaluate coding efficiency intuitively, the typical LF compression solution, the multiple view structure (MVS) [33], is added for comparison, as illustrated in Fig. 12. As seen, in general, PoS&DE consumes fewer bits than the other two methods. Only for some manually constructed scenes, such as I06, I07 and I08, PoS&DE can achieve nearly the same reconstruction performance as the other two methods at low bitrates. We believe a better reconstruction solution will improve the overall performance, which will be investigated in our future work. Additionally, for the high-reconstruction case, PoS&DE can greatly decrease bits, especially for 3 viewpoints. Therefore, these results confirm that PoS&DE can maintain acceptable quality while drastically improving compression performance.

Conclusion
In this paper, we propose a LF compression solution that significantly reduces coding bitrate by coding a subset of selected SAIs with their corresponding depth map, synthesizing the remaining unselected SAIs and ultimately reconstructing all SAIs. By computing the cost of each point on the epipolar line in the horizontal and vertical EPI, an initial depth map can be obtained. SAI-guided depth optimization is designed to refine the noise in the initial depth map in accordance with the pixel variation in the associated SAI. The MVD coding structure is adopted to fully exploit the relationships among SAIs. Based on statistical experimentation, depth estimation is accelerated using an optimal cost-computation step. Meanwhile, the key SAI-selection step is determined to decrease the number of coded SAIs. Experiments have been conducted on the benchmark LF dataset and the results demonstrate that the proposed PoS&DE LF compression scheme can generate an accurate depth map to improve coding performance compared to other LF compression methods. This method results in a 4.58-dB PSNR gain and an 86.1% bitrate reduction, on average, over the HEVC codec.