Light-Field Raw Data Synthesis From RGB-D Images: Pushing to the Extreme

Light-field raw data captured by a state-of-the-art light-field camera is limited in its spatial and angular resolutions due to the camera’s optical hardware. In this paper, we propose an all-software algorithm to synthesize light-field raw data from a single RGB-D input image, which is driven largely by the need in the research area of light-field data compression. Our synthesis algorithm consists of three key steps: (1) each pixel of the input image is regarded as a spot lighting source that emits directional light rays with an equal strength; (2) the optical path of each directional light ray through the camera’s main lens as well as the corresponding micro lens is considered as accurately as possible; and (3) the occlusion of light rays among objects at different distances within the input image is handled with the depth information. The spatial and angular resolutions of our synthesized light-field data can be scaled up when the input RGB-D image has a higher and higher spatial resolution. Meanwhile, for a given input image with a fixed size, we pay a special attention to what would be the extreme we can push the parameters involved in our synthesis algorithm, such as the number of rays emitted from each pixel, the number of micro lenses, and the number of sensors associated with each micro lens. The usefulness of our synthesized data is validated by refocusing, all-in-focus, and sub-aperture reconstructions. In particular, all-in-focus images are evaluated objectively by computing the structural similarity (SSIM) index, which allows us to reach the goal of pushing to the extreme through selecting various parameters mentioned above.


I. INTRODUCTION
Light-field is a vector function that records the amount of light flowing in every direction through every point in the 3D space. Such directional light rays can be defined as a 5D plenoptic function [1], parameterized by three coordinates and two angles. If we restrict ourselves to locations outside the convex hull of the object, we can measure the plenoptic function by taking multiple photos using a digital camera, resulting a four-dimensional function.
Light-field cameras can capture three dimensional scene information by recording the directions of the incoming light rays in addition to their intensities. The idea is induced by and similar to using an array of conventional cameras to record different views of a scene at the same time. Clearly, The associate editor coordinating the review of this manuscript and approving it for publication was Naveed Akhtar . the novelty of a light-field camera is that it has reduced the large volume of an array of cameras or the complicated gantry to just a single shot with a single portable camera. Here, the number of micro lenses determines the spatial resolution of the image and the number of sensors after each micro lens determines the angular resolution of the image. Notice that each sensor produces one pixel in the resulted light-field data.
Today's light-field cameras available in market are mostly from Lytro and Raytrix. Lytro has developed consumer-level light-field digital cameras that are capable of capturing images using the plenoptic technique; whereas Raytrix is mainly for industrial and scientific applications, with resolutions starting from 1 megapixel at a very expensive price. The newest Lytro model (Lytro ILLUM) can generate raw data that can be extracted by the external light-field toolbox [2], with spatial resolution of 434 × 625 and angular resolution VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ of 15 × 15. There are also some other light-field cameras, such as Pelican imaging [3], Adobe light-field camera [4], CAFADIS camera [5], and Mitsubishi Electric Research Laboratories's (MERL) light-field camera [6]. Although there are a number of manufacturers working on the production of light-field cameras and many improvements have been made for different applications, there exists a trade-off between the spatial resolution and angular resolution. For the Lytro model mentioned above, though the raw light-field data is with size 6510 × 9375 (more than 60 megapixel photosensors totally), its spatial resolution is only 434 × 625 and angular resolution is 15 × 15. If the photosensor array remains unchanged but we wish to have a higher spatial resolution, the only solution is to reduce the angular resolution. Therefore, with raw data captured by today's light-field cameras, it seems to be an impossible task if one wishes to reconstruct images (refocusing or all-in-focus) with spatial resolution that can be achieved by today's high-end conventional cameras (e.g., 4K or 8K) and at the same time to produce a large number of sub-aperture images in order to offer some 3D capability.
We believe that it is still far away to reach the camera-based hardware solution of producing light-field raw data with both sufficiently-high spatial resolution (e.g., 4K or higher) and angular resolution (e.g., 20 × 20 or higher). Nevertheless, certain research topics on light-field image data are emerging rapidly. For instance, how to compress light-field image data efficiently has drawn a great attention recently [7]- [11], especially for those with large spatial resolution (4K or higher) and angular resolution (e.g., 20 × 20 or higher). This is because that the compression algorithms for high-resolution light-field raw data could be very different from those compression algorithms that are applicable for low-resolution light-field raw data. For instance, the popular video coding standard H.265 [12] defines a lot more intra-prediction modes, as compared to the earlier version H.264 [13], in order to provide a more accurate prediction for each intra frame. According to the ISO/IEC JPEG coding [14], the light-field data compression has been regarded as the focus of the coding of the still image. The evaluation metrics are still under discussion, while they are mainly composed of the objective evaluation metrics (PSNR & SSIM) and the subjective evaluation (animation comparisons). Furthermore, the on-going effort for the next-generation standard H.266 designs even more intra-prediction modes as compared to H.265. One rational for doing this is that the resolution of video frames is becoming bigger and bigger.
In order to develop and test the compression algorithm for high-resolution light-field data, we first need to access the light-field data with high resolution, which is unfortunately impossible with the existing capturing devices. To overcome this dilemma, we are therefore motivated to develop a synthesis-based solution to provide light-field raw image data where the spatial resolution and angular resolution can be made arbitrarily high. For example, we can provide a synthetic light-field raw data with size 80000 × 80000 (6400 megapixel totally: 4000 × 4000 spatial resolution and 20 × 20 angular resolution), which is much higher than the exiting light-field raw data captured by any hardware device.
Specifically, we propose an all-software algorithm to synthesize light-field raw data from a single RGB-D input image to extend the aforementioned resolution limitations existing in the current optical hardware manufacturing. As for the availability of depth information, our algorithm can use the captured depth map (e.g., by Kinect). It can also be applied to the stereoscopic case where the depth information can be estimated from a pair of two images (taken at different viewangles). Specifically, our synthesis algorithm consists of three key steps: (1) each pixel of the input image is regarded as a spot lighting source that emits directional light rays with an equal strength; (2) the optical path of each directional light ray through the camera's main lens as well as the corresponding micro lens is considered as accurately as possible; and (3) the occlusion of light rays among objects at different distances within the input image is handled with the depth information.
Given an input image with a fixed resolution, an important goal is to push to the extreme in terms of the resolution (both spatial and angular) of the synthesized light-field raw data. Specifically, this is accomplished by selecting three primary parameters involved in our synthesis algorithm, namely, the number of rays emitted from each pixel of the input image, the number of micro lens, and the number of sensors associated to each micro lens. In principle, the higher these parameters are, the better the rendering result is. However, our study shows that, for an input image with a fixed size, there does exist an upper limit for each of these parameter, which is thus regarded as the extreme as no further improvement can be obtained beyond this extreme. We would also like to point out that our algorithm is scalable to the rise of the input resolution without any re-design cost. With the rise of the input size, our algorithm can still be feasible by increasing the sampling rate of main lens and micro lens. The choice of sampling rate of main lens and micro lens (the spatial and angular resolution) is scalable to the input size. Furthermore, other parameters involved in our synthetic camera system (e.g., focus, aperture, and other camera parameters) are all adjustable according to the specific application of the light-field raw data without limitation of the existing capture devices.
The rest of the paper is organized as follows. In Section II, we review some existing light-field raw data and the related algorithms. In Section III, the imaging process of our synthetic camera system will be illustrated in detail. The evaluation of our synthetic data is presented in Section IV, where both objective and subjective evaluations will be considered and we will pay a special attention to pushing to the extreme through choosing several important parameters involved in our synthetic system. Finally, we summarize our work in Section V. Example of synthetic light-field data (with computer generated model) generated by MIT media lab [15], which is in the form of image array.

II. RELATED WORKS
Because of a large demand of light-field raw data with high spatial resolution and high angular resolution in both academic research and practical applications, several algorithms for the synthesis of light-field raw data have emerged in recent years. In this section, we will go through these algorithms and some existing datasets.

A. EXISTING DATASETS
Synthetic light-field data rendered with POV-Ray is provided by MIT media lab [15], with a form of image array. One example is shown in Fig. 1. These light-field data contain either 5 × 5 or 7 × 7 views of different 3D scenes. They are mainly for the research of 3D displays and are rendered on a regular grid that facilitates some easy processing. However, the problem is that they are computer-rendered (with computer-generated model and scene) rather than being captured from real scenes.
The Stanford light-field archive [16] provides light-field data that are captured from real scenes with equipments as camera arrays, where the calibration data is also provided. Since the light-field data is not captured with a light-field camera, the data is again in the form of image array rather than the form of a light-field raw data.
The dataset captured by the first generation of Lytro camera is available in [17]. It contains 30 light-field images both indoor and outdoor. However, these light-field images are with a low resolution (220 × 220 in spatial resolution and 15 × 15 in angular resolution).
Another dataset with an increased resolution (434 × 625 in spatial resolution and 15 × 15 in angular resolution) captured by Lytro ILLUM is available in [18]. It contains 118 lightfield images that are classified into ten different categories. As compared to the resolution supported by today's conventional cameras, the spatial resolution achieved by this Lytro camera is still quite low.

B. EXISTING SYNTHESIS ALGORITHMS
A temporal reconstruction technique has been proposed in [19], which exploits the anisotropy in the temporal light field and permits efficient reuse of samples between pixels. Although its purpose is mainly for the rendering of simultaneous motion blur, depth of field, and soft shadows, it proposes the idea of making use of a temporal light field that is reconstructed by the information of sparse samples to reduce the noise due to the variance of the high-dimensional integrand.
Reconstruction of 4D light-fields is proposed in [20], which makes use of sparsity in the Fourier domain. The main contribution is an upsampling of the light-field data by an optimization in the continuous Fourier domain, which solves to a certain extent the problem about the low resolution of light-field data. Meanwhile, how to reconstruct (partially) high-resolution 4D light-fields from a stack of differently-focused photographs taken with a fixed camera is proposed in [21], in which the camera calibration for the stack of photographs has been considered carefully.
On the other hand, some algorithms have been proposed to use the characteristics of light-field data to do the image-based rendering. For instance, image-based rendering from unstructured lumigraph is proposed in [22], a view rendering that conquers the missing of the depth map from several views is proposed in [23], and a novel algorithm is proposed in [24] to simulate the depth-of-field (DOF) effect from the perspective of light-field rendering. The difference between these algorithms and ours is: the goal of the above-mentioned algorithms is to render a new image (new view or new DOF), whereas our algorithm intends to synthesize the high-resolution light-field raw data. Although we also consider several rendering tasks, such as refocusing, all-in-focus, and multi-view, in the verification part of our work, the purpose of doing so is to demonstrate the usefulness and usability of the synthesized light-field data. Because of this, we do not compare our rendering results directly with the most recent state-of-the-art works that often utilize some different settings. However, we believe that the light-field data synthesized in our work offers a rich set of transitional characteristics that would become exploitable in the image-based rendering methods.
Recently, learning method has been proposed in [25] to synthesize the 4D light-field from a single image input. This algorithm works well with one particular type of image (for example, flowers) according to its training dataset, which unfortunately cannot deal with all kinds of images at one time. An End-to-end view synthesis for light field imaging with Pseudo 4DCNN algorithm has been proposed in [26]. In this paper, the main purpose is to find a way to realize the view synthesis with good quality and fast speed, and the algorithm needs a large amount of inputs that need the light-field camera, while the main purpose of ours is to provide a simple and easy-to-execute way (even without a light-field camera) to generate a synthetic light-field raw data that can be applied in various applications, eg. view synthesis, refocusing, depth estimation, compression on LF data and so on. Meanwhile, we have proposed an algorithm to reconstruct light-field raw data from a single RGB-D image [27]. The present paper provides a big extension to [27] as follows. First, the optical path of each directional light ray through the camera's main lens as well as the corresponding micro lens is kept in the 3D coordinate; whereas such 3D scenario is simplified into two 2D processes in [27] so that the rendering results are defective visually. Second, all approximations made in [27] are replaced by a much more accurate representation, e.g., sampling of the continuous light-ray emitted from each pixel is now much more dense and all calculations along the optical path of each light ray are now conducted much more precisely. Third, the depth map is kept in its 8-bit coded format, whereas this map has been quantized into 5 layers in [27]. Last, we have tried to push to the extreme through the selection of several important parameters involved in our synthetic system.

III. OUR SYNTHETIC CAMERA SYSTEM
The pipeline of our synthesis algorithm for light-field raw data is shown in Fig. 2. At the input end, the RGB image is first organized into a 3D point cloud according to the associated depth map (D). Each color pixel is assumed to be a spot lighting source to emit continuous light-ray towards the camera's main lens. We sample this continuous light-ray into a large number of directional light rays. Each directional light ray will propagate through the camera's main lens and one corresponding micro lens and eventually be recorded on the corresponding sensor in the sensor array. This process is the core part of our synthesis algorithm, which will be described in details in the following.

A. LIGHT SOURCE ASSUMPTIONS
Whether from the sun or other lighting sources, light rays with certain brightness, color and direction always reflect off objects and travel in various directions to produce a continuous light-field. Of the infinite rays in the light field, we can only see those that shine towards our eyes. Conventional cameras follow exactly this mechanism to record the color and intensity of light rays within the camera's view angle (or aperture).
When we take a picture, the camera records the intensity and color of every light ray reflected from the objects in the scene that the camera is aiming at. When we observe the picture, our eyes perceive the light rays reflected from the picture. Although the reflected light rays are originated from some lighting sources (the sun, some illuminating lights, or a mixture), we can regard that these light rays are emitted from the picture directly. Since the picture is usually a digital one, we make two grounding assumptions as follows: • Each pixel within the picture serves as a spot lighting source to emit a light-ray continuously along two incident angles.
• After sampling this continuous light-ray, all resulted light rays have the same strength but propagate in different directions.
Note that pixels have different depths so that some directional light rays emitted from a pixel may be blocked by other pixels. Such a occlusion effect will be discussed later in this section. For scenes containing objects with unknown bidirectional reflectance distribution functions (BRDF), there are two kinds of reflection, i.e., diffuse reflection and specular reflection. For the diffuse reflection, which means that the objects are lambertian, every point of the object can indeed be regarded as a spot light source to emit light rays that are distributed uniformly [28]. For the specular reflection, the situation can be simplified such that only a single directional light ray is emitted from the point. On the other hand, for objects with known BRDF, one can do the corresponding adjustment, such as the angle of each emitting light ray and its strength. Since most objects in the real world are lambertian, the diffuse reflection model has been widely used, which justifies our assumptions.

B. OPTICAL PATH
In this subsection, we will present how to describe as accurately as possible the optical path of each directional light ray  through the camera's main lens as well as the corresponding micro lens.

1) SYSTEM SETUP
Our system is made up of four parts: object, main lens, micro lens array, and photosensors. We take a single RGB-D image as the input where the image size is M × N . The parameters of the camera's main lens as well as each micro lens are listed in Table 1. In general, a lens is composed of two sphere surfaces that are mirror images with respect to each other, as shown in Fig. 3. In practice, we know the sphere's curvature R, the maximum thickness between two spheres d, and the refraction index n of the lens. Then, the focal length f can be derived according to [29] This relationship can be applied to the camera's main lens as well as each micro lens by substituting R, d, and n with the corresponding parameters listed in Table 1.
For a real light-field camera, distortions caused by lens are inevitable [30]. To minimize the impact of such distortions, the whole camera system needs a very complex lens design, which is extremely hard to simulate. Thus, the synthetic camera system used in our algorithm has been simplified so that lens distortions are not considered. Based on this kind of simplified ''ideal'' system, however, we find that the synthesized light-field data is still usable for applications such as refocusing, all-in-focus, and multi-view, as demonstrated in Section IV.
where the origin of the 3D coordinate system is at the center of the lens. Note that the input image is placed at the right side of the lens and each pixel's coordinates (x, y ) scan as x = 1, 2, · · · , M (from bottom to top) and y = 1, 2, · · · , N (from left to right), respectively.

2) GENERATION OF 3D POINT CLOUD
With an input RGB image and its corresponding depth map, we only have the pixel coordinates (x, y) (x = 1, 2, · · · , M and y = 1, 2, · · · , N ) and its corresponding depth value i d (usually coded in an 8-bit depth map), without knowing the real position in the world coordinate system. In order to generate the 3D point cloud of the input RGB-D image, we need to calculate the corresponding coordinates (X z , Y z , Z z ) in the world coordinate system for every pixel at (x, y) with i d , as shown in Fig. 4. The depth map only reveals the relative distance between the object point and the camera. Specifically, the larger the value i d is, the nearer the point is to the camera. We first denote by using z near and z far , respectively, the distance of the nearest point to the camera (with the biggest value i d _max in the 8-bit depth map) and the distance of the farthest point to the camera (with the smallest value i d _min in the depth map). Then, for any point (x, y) in the image with depth value i d , its distance to the camera is where the original M × N image will be amplified to H z × W z at distance z from the lens. VOLUME 8, 2020

3) LIGHT RAY PROPAGATION
Suppose that the pixel value of a point at (X z , Y z , Z z ) is I and the light-ray irradiated from this pixel toward the main lens is sampled into K × K directional light rays. This is equivalent to sampling the main lens into K × K cells so that each cell receives one light ray. Then, every light ray irradiated by the pixel has intensity of where C area = K ×K h×w and h × w is the real coverage of the reflection of the pixel I . If the pixel is lambertian, h × w = K × K , and the intensity of every light ray irradiated by one pixel with value I is i = I K ×K , which is the diffuse reflection. If the pixel is regarded as being caused by the specular reflection, the real coverage of the main lens is h × w = 1 and C area should be K × K , and the resulted i = I . Here, we only consider the diffuse reflection so that all pixels are lambertian. In addition, we do not consider the decaying of intensity of a light ray during the propagation.
Referring to Fig. 5, a directional light ray l 1 is irradiated from an object point P start and arrives at the front surface of the main lens at point P front . Then, it refracts through the front surface of the main lens and reaches the point P back at the back surface of the main lens, with the direction changed to l 2 . After refracting through the back surface of the main lens, it changes the direction to l 3 and hits the point P end on the front surface of one micro lens in the micro lens array, which is placed at the focal length of the main lens.
Two directional lines l 2 and l 3 as well as their corresponding end points (i.e., P back and P end ) can be calculated through Algorithm 1.
Specifically, the directional line l 1 is known (because both P start and the irradiating direction are known) so that P front can be located. Steps 1-4 in Algorithm 1 solve the directional Algorithm 1 Light Ray Propagation Through Convex Lens 1: l 1 = P front − P start ; l nfront = P cfront − P front ; P cfront = (0, 0, R main − d main /2) is the centre of the sphere of the front surface of lens 2: According to Snell's law [31], cos(θ 1 ) = l 1 ·l nfront l 1 l nfront ; sinθ 2 = sinθ 1 n main ; 3: norm = l 1 ⊗ l nfront is the normal of the plane where l 1 and l nfront are in. 4: Solve the following equations to get l 2      l 2 · norm = 0 l 2 · l nfront = cosθ 2 l nfront l 2 l 2 · P front = 0 (5) 5: P back is the intersection point of l 2 and sphere X 2 2 = R 2 (centred at P cback = (0, 0, −(R − d/2))). 6: Repeat Step 1 to 4 to get l 3 and P end . line l 2 , after which the point P back can be determined. Then, l 2 will refract out the back sphere surface of the main lens and travels continuously to reach one micro lens. Steps 1-4 in Algorithm 1 are repeated here to solve the directional line l 3 and the corresponding point P end .
Notice that the propagation of a light ray through one micro lens can be modeled in the same way as we did above. Here, we need to translate the coordinate of the system with vector (0, 0, f main + d main 2 + d micro 2 ), which means that the origin of the coordinate system is now at the center of the micro lens array.
In summary, for every directional light ray irradiated by an object point P start (i.e., l 1 ), its optical path through the main lens and the corresponding micro lens can be determined accurately. Then, the intensity recorded at the corresponding photosensor is i s = i · cos(θ end ) where θ end is the angle. After adding up intensities of all light rays arrived at the same sensor, we can get the final light-field raw data. The computational complexity is related to the input image size and the sampling of the main lens, i.e., O(M × N × n 2 main ).

C. OCCLUSION EFFECT
In order to model our synthetic imaging system as accurately as possible, we need to consider the noise and distortion in the input data and the corresponding occlusion effect due to rectilinear propagation of visible light.

1) PREPROCESSING OF INPUT DATA
First of all, we would like to point out that there are no specific requirements on the input RGB image, i.e., any color image with any resolution is acceptable in our system. On the other hand, since our synthetic imaging process requires relatively accurate 3D scene information, a refinement on the given depth map is usually recommended, particularly when the input depth map is composed of large part of holes and/or corrupted by noise. To this end, we choose to apply the depth map refinement algorithm proposed in [33] to fill the holes and sharpen the boundary of the object.
Obviously, the more accurately the depth map matches the corresponding color image, the more accurate the 3D point cloud generation will be. Eventually, this would contribute to a more accurate synthesis of light-field raw data. Depth maps with high accuracy can be obtained by high-end depth cameras. Nevertheless, this solution seems quite expensive. On the other hand, today's depth cameras (e.g., the newest Kinect) can support 1000×1000 only. Other solutions include designing some algorithms to enhance the accuracy of today's depth cameras or estimating as accurately as possible the depth map from two or more pictures that are taken at different angles with high-end conventional cameras.

2) OCCLUSION CHECK
After we get the input RGB image and its corresponding refined depth map, we can transfer every pixel to its world coordinates according to Eq. (3). In this 3D point cloud, some light rays emitted from a pixel might be blocked by other objects that are sitting in front, leading to the occlusion effect, as illustrated in Fig. 6. We carry out the occlusion check according to Algorithm 2. If the light ray is indeed blocked, we will neglect its influence and regard that this light ray cannot reach the final sensor plane.

IV. EVALUATION AND APPLICATION
The purpose of our algorithm is to synthesize the light-field raw data, but the ground-truth data is not available. Because of this, we conduct the objective evaluation between the all-in-focus image from the synthesized light-field data and the original input RGB image by computing the structural similarity index (SSIM index) [34]. The subjective evaluation is carried out by evaluating the results of several main Find P ins (X ins , Y ins , Z ins ) -the intersection point of light ray l (irradiated from the light source at depth z pixel ) and the plane at z layer = Z ins ; 3: Find (x ins , y ins ) -the integer coordinate of P ins at depth layer z layer , obtained by rounding X ins and Y ins of P ins ; 4: Check the depth image at z layer , Id zlayer , that if the pixel value of Id zlayer at (x ins , y ins ) does not equal to 0. If so, exit the loop and the light ray l is assumed to be blocked. 5: end for applications of the light-field data, such as refocusing, allin-focus and multi-view. One example of our synthetic light-field raw data is first shown in Fig. 7. The input RGB-D data is obtained from [32]. In this example, the input RGB image and the corresponding depth map are of size 1000 × 1000. However, only a portion of size 400 × 400 is shown in Fig. 7(a) and (b). A portion of the synthesized raw data is shown in Fig. 7(c). 1 Visually, the result shown in Fig. 7(c) looks pretty much like a light-field data. Neverthless, we need to evaluate, both subjectively and objectively, the usefulness of our synthetic light-field raw data. In this section, this job will be done through three reconstructions, i.e., refocusing, all-in-focus, and sub-apertures. In this process, we pay a special attention to pushing to the extreme by selecting the involved parameters, namely, the number of directional light rays emitted from each pixel, the number of 1 Please go to http://pan.baidu.com/s/1c1263Ws to observe the original raw light-field data. This website also includes the raw light-field data with size 2000 × 2000 × 21 × 21.  micro lenses, and the number of sensors associated to each micro lens.

A. RENDERING APPLICATIONS
For the refocusing application, which is performed according to [35], we perform the subjective evaluation because each refocused image can hardly be aligned accurately with the original image. On the other hand, all-in-focus images will be evaluated objectively by computing the SSIM index [34] between the all-in-focus image and the original RGB image, where the all-in-focus image is calculated by [36]. Lastly, sub-aperture images will be compared with the multi-view results that are obtained by the depth image-based rendering (DIBR) algorithms [37]- [40].

B. CHOICES OF PARAMETERS
Different choices of parameters in our synthetic camera system will result in different rendering results in terms of subjective and objective evaluations. Suppose that the optical structure of our synthetic light-field camera system is fixed, we will put emphasis on choices of three parameters: the number of light rays (towards the main lens) K × K , the number of micro lenses P × P, and the number of sensors for one micro lens Q × Q. We will demonstrate how to push to the extreme through selecting these parameters.
Since the input depth map (either from the internet database or from Kinect) is limited in its size at 1000 × 1000, the size of the input RGB image is set to 1000 × 1000. However, we would like to emphasize that this size can be easily extended to a higher level (e.g. 4K or 8K) as long as the depth map of a larger size is available.

1) EXTREME OF SAMPLING OF LIGHT RAYS
The light-ray emitted from a spot lighting source is continuous along two incident angles so that there are an infinite number of light rays. We need to do a sampling of these light rays, resulting K × K light rays. In principle, the bigger the value is assigned to K , the better the rendering performance will be, which will be demonstrated in the next subsection through computing the SSIM index between the all-in-focus image and the original input image. However, these results also show that, with the rise of the sampling rate K , the SSIM curve tends to may saturate at some point. Therefore, we can regard such a turning point as the extreme choice of the sampling rate K .

2) EXTREME OF SPATIAL RESOLUTION
The spatial resolution of a light-field data is determined by the number of micro lenses P × P. In principle, P can be pushed to an extremely high level. However, the original input image is with a spatial resolution of M × N . The question is that whether it is necessary to push the micro lens array size beyond the original resolution. The answer will be provided in the next subsection, also through computing the SSIM index between the all-in-focus image and the original input image.

3) EXTREME OF ANGULAR RESOLUTION
The angular resolution of the light-field data is determined by the number of sensors Q × Q for every micro lens. In our algorithm, there is no limit on the choice of Q, which means that one can push to the extreme by choosing a large Q, e.g. 50 × 50 or 100 × 100. However, since the angular resolution  [27] and [24]. (a) to (d) are the images refocused at the distance 243.75, 262.50, 287.50, and 306.25cm, respectively, from the main lens, generated by our algorithm from the synthetic light-field raw data shown in Fig. 7(c); (e) to (h) are the images refocused at the same distances from the main lens, generated by the algorithm of [27]; (i) to (l) are the images refocused at the same distances from the main lens, generated by the algorithm of [24] without filtering.
(multi-view information) is related to and restricted by the input depth map, we will demonstrate, again, by computing the SSIM index between the all-in-focus image and the original input image, that an increase of Q might not always contribute positively to the rendering performance.

C. RENDERING RESULTS
We carry out experiments on several test images to generate the rendering results of three applications: refocusing, all-infocus, and sub-apertures.

1) REFOCUSING
Four refocused results at different distances from the camera are shown in Fig. 8(a)-(d) where the involved parameters are listed in Table 2. When we focus at a specific distance z focus , the points with Z z = −z focus become clear while others are blurred. The larger ||Z | − z focus | is, the more blurred the point at (X z , Y z , Z z ) will be. Part (a) of Fig. 8 is focused at the front part of the scene so that the back part gets blurred. The focus is moving from front to back in Fig. 8(a) to (d). The subjective perception of a clear blurring-shift at different focal distances serves as a positive evaluation for this rendering application of our synthesized light-field raw data. Compared to our previous results shown in Fig. 8(e)-(h) [27], our present results are obviously much better. In particular, all black lines appearred in Fig. 8(e)-(h) have become invisible. There are several reasons as follows. Firstly, the parameters used in [27] are set at a much lower level, e.g., only 81 × 81 light rays (K = 81) and 200 × 200 micro lenses (P = 200) are used; whereas the sensor array is the same (Q = 20). Secondly, the propagation of each VOLUME 8, 2020 FIGURE 9. SSIM index comparison: The SSIM index between the input RGB image (a) and the all-in-focus image generated from the synthetic light-field data generated by our algorithm (b), is 0.9236. The SSIM index between the input RGB image (a) and the all-in-focus image generated from the synthetic light-field data generated by [27] (c), is 0.5974. The SSIM index between the input RGB image (a) and the all-in-focus image generated from the stack of refocused images generated by (d), is 0.7414. directional light ray in the 3D space has been simplified into two 2D scenrios (i.e., X -Z and Y -Z ) in [27] so that all calculations along the optical path of one light ray through the main lens and the corresponding micro lens will not be accurate. Lastly, the depth map has been quantized into 5 layers in [27], whereas we keep the original 8-bit depth map with no quantization in the current work.
Compared to the results generated by [24] that uses a similar ray tracing method to generate the refocused image with stereo images (shown in Fig. 8(i)-(l)), the boundary artifacts are superiorly removed by our algorithm. Since [24] only considers the refocusing task, the synthetic 4D light field serves as an intermediate result so that other tasks such as building the sub-aperture images may not be supported. On the other hand, our algorithm is applicable for not only refocusing but also other tasks as presented in the following.

2) ALL-IN-FOCUS
The SSIM index [34] has widely been used to measure the similarity between the processed image and the original one. This index has been proven to be more consistent with human visual perception as compared to other quantitative metrics such as the peak signal-to-noise ratio (PSNR) and mean squared error (MSE). In our experiments, we compute the SSIM index between the original input RGB image and the all-in-focus image (also called the extended depth of field image) that is obtained by the digital photomontage algorithm developed in [36]. As an example, the all-in-focus image generated from the raw data shown in Fig. 7(a) is presented in Fig. 9(b). Again, the parameters listed in Table 2 have been used to obtain this result. Comparing it with the original image shown in Fig. 9(a), we obtain the SSIM index at 0.9236. Meanwhile, the SSIM index between the all-in-focus result from the light-field data generated by [27] ( Fig. 9(c)) and the original image ( Fig. 9(a)) is 0.5974. For the all-in-focus result from the stack of refocused images generated by [24], the SSIM index between it ( Fig. 9(d)) and the original image ( Fig. 9(a)) is 0.7414.
Thanks to the SSIM index, we now have a quantitative metric to push to the extreme about the quality of all-in-focus images by testing a range for each of the three important parameters involved in the synthesis process. That is, while keeping all other parameters unchanged, we choose to vary the number of light rays (K ), or the number of micro lenses (P), or the number of sensors (Q), respectively. The resulted SSIM curves are shown in Fig. 10. Note that these curves are the avaraged results over ten test images.
As shown in Fig. 10(a), the more light rays after the sampling, the better the rendering result is. This is in consistency with the real world situation where the number of FIGURE 11. (a) The sub-aperture image composed by taking the sensor at location (4,5) behind each micro lens from a light-field raw data with the parameters listed in Table 2. (b) The sub-aperture image after adjustment by Algorithm 3. light rays irradiated from a spot lighting source is infinite. Therefore, one can push K to be as high as possible to get a better rendering result. However, Fig. 10(a) also shows that the SSIM curve grows slower and slower with K . If we set the acceptable threshold at 0.9 for SSIM, we find that 800 × 800 light rays are sufficient for our test images of size 1000 × 1000, which may serve as a simple rule for the selection of the parameter K .
According to Fig. 10(b), for the test images of size 1000 × 1000, it seems that the SSIM curve reaches its maximum when about 800 × 800 micro lenses are used. We believe that this may serve as the second simple rule for the selection of the parameter P. Obviously, when the size of the input image increases, the parameter P should increase proportionally. Such a relationship also applies to the selection of K .
Finally, Fig. 10(c) shows that, when more sensors are used, the rendering performance declines but very slightly. Generally speaking, an increase on Q only makes very little influence to the all-in-focus image. In the meantime, we know that the information provided by the input depth map is limited and not very accurate. We believe that this is one reason why more sensors do not contribute positively to the rendering performance in the all-in-focus application. However, we still recommend that a relatively big sensor array be used, as more sensors can produce more sub-aperture images.

3) MULTI-VIEW SUB-APERTURE IMAGES
The straightforward method to compose a sub-aperture image is to group the pixels in a light-field raw data that are located at the same position in the Q × Q sensor array corresponding to each micro lens [35]. As a result, each sub-aperture image will be of size P × P and corresponds to one view for the desired sub-region of the main lens. Fig. 11(a) shows one result by following this method, where a large portion becomes black, i.e., no pixels can be recovered. The reason is explained below.
Referring to Fig. 12, we assume that three light rays l 1 , l 2 , and l 3 with same direction arrive at the first sensor behind micro lenses L micro1 , L micro2 , and L micro3 , which represents the direction of view for the sub-aperture image composed by FIGURE 12. l 1 − l 6 are light rays with the same direction that represents the view determined by the composition of the first sensor behind each micro lens, but only l 1 , l 2 , and l 3 can reach the first sensor of micro lens L micro1 , L micro2 , and L micro3 respectively. l 4_substitute , l 5_substitute , and l 5_substitute are the substitute light rays for l 4 , l 5 , and l 6 , respectively, which results in substitutions of sensors for micro lenses L micro4 , L micro5 , and L micro6 to compose the desired view.
the first sensor of each micro lens. However, for micro lenses such as L micro4 , L micro5 , and L micro6 , light rays l 4 , l 5 , and l 6 with the same direction are all out of the bound of the main lens, which means that no light rays can reach the first sensor of micro lenses L micro4 , L micro5 , and L micro6 .
In order to solve this problem, we need to take advantage of light rays within a range of each desired direction. The range of direction of a light ray is determined by the number of sensors Q.
As illustrated in Fig. 12, for micro lens L micro4 , we locate the light ray l 4_substitute whose direction differs slightly from the direction of l 4 . This small difference must be within the range that is covered by l 4 ; otherwise such a substitute will not be used. Note that l 4_substitute will not go through the centre of L micro4 but be refracted to reach the second sensor of L micro4 , thus resulting in a sensor substitute when composing the corresponding sub-aperture. The same substitute applies to L micro5 , L micro6 , and so on, and, obviously, the shift of the sensor's location becomes larger and larger.
The relationship between the direction of the desired view (represented by the location of the sensor behind each micro lens (u sensor , v sensor ), where u sensor ∈ [1, Q] and v sensor ∈ [1, Q]), the location of the target micro lens ((u micro , v micro ), where u micro ∈ [1, P] and v micro ∈ [1, P]), and the shift of the location of the substituted sensors ((u shift , v shift ), where u shift ∈ [1, Q] and v shift ∈ [1, Q]) can be derived as follows (the calculation is the same for u and v so that we only represent the calculation of u): where u ds_main is the projection of the light ray that represents the desired view on the main lens along the X axis and u ds_sensor is the distance between u sensor and 1 2 Q. Then, we can define the location of the boundary micro FIGURE 13. An 8 × 8 sub-aperture array generated from the synthetic light-field raw data with the parameters listed in Table 2. The right part shows the zoom-in versions from three selected views (left column) in comparison with the similar views synthesized using the DIBR algorithm [39] (middle column) and using the algorithm in [41] (right column).
lens u boundary as where ceil() returns the smallest integer greater than or equal to a given number. For those micro lenses that are out of the boundary, the shift of the location of the selected sensor u shift should be The final location of the selected sensor (u select , v select ) follows Algorithm 3. We have applied Algorithm 3 to the example considered in Fig. 11 where Part (b) shows the result. Compared with Part (a), we have obtained the full-size sub-aperture image.
Finally, Fig. 13 shows an 8 × 8 array of sub-aperture images that represent multi-views of the original scene. Compared with the multi-view images generated by the DIBR algorithm [37] and [38] (middle column), and the novel view from [41] (right column), we believe that our results (left column) are quite comparable. In fact, all methods (ours, DIBR and Zhang's) do have certain drawbacks. For instance, our algorithm produces some too bright regions at the object borders while the DIBR algorithm leads to more dark pixels (holes) in the area other than the object borders and the Zhang's algorithm [41] has more distortion when the novel view is further from the input view. Here, we do not attempt to produce better results as compared to the DIBR algorithm; alternately, we just wish to show some results that are comparable, which therefore demonstrates that our synthesized raw data is usable.

4) MORE RESULTS
In this part we show more results from different input sources: some existing RGB-D dataset, captured by Kinect, and stereo image pairs, with different parameters setup.   In Fig. 14, the synthesized light-field raw data is with size 800 × 800 × 20 × 20 (P = 800, Q = 20, and other parameters are the same as in Table 2). The input RGB-D image pair is from Kinect, which can capture color and depth image at the same time.
In Fig. 15, the synthesized light-field raw data is with size 1000 × 1000 × 20 × 20 (P = 800, Q = 20, and other parameters are the same as in Table 2). The input RGB-D image pair is from the dataset provided in [32], which can provide a pair of color and depth image with high quality.
In Fig. 16, the synthesized light-field raw data is with size 2100 × 2100 × 20 × 20 (P = 2100, Q = 20, and other parameters are the same as in Table 2). The input RGB-D image pair is from the right view of a stereo camera and its estimated depth map from the stereo image pair.
In Fig. 17, the synthesized light-field raw data is with size 2000 × 2000 × 30 × 30 (P = 2000, Q = 30, and other parameters are the same as in Table 2). The input RGB-D image pair is from the dataset provided in [32], which can provide a pair of color and depth image with high quality. In our experiments, the maximum of parameter choices of P (spatial resolution) and Q (angular resolution) is 2100 and 30, respectively. According to the mechanism of our algorithm and the time consumed for the experiments (up to 8 hours on CPU), we can further push the choice of parameters (such as P and Q) even higher if we run the synthesis on GPU. Here, the only concern is the memory demand of the uncompressed raw light-field data, which can be solved by the upgrading of storage devices and corresponding compression method of light-field data.

V. CONCLUSION
In this paper, we have proposed an algorithm for synthesis of light-field raw data with a single RGB-D input image. Our synthetic data can easily extend over any limit on the spatial and angular resolution of light-field data. To this goal, we have proposed to make use of the depth information to build up a 3D point cloud from the input RGB image and follow as accurately as possible the optical path within our synthetic imaging system.
To demonstrate the usefulness of our synthesized light-field raw data, we have considered three rendering application, i.e., refocusing, all-in-focus, and sub-apertures, where both subjective and objective evaluations have been carried out. For a given input RGB-D image, the rendering performance depends on the choices of three parameters in our synthetic camera system, including the number of light rays, the number of micro lenses, and the number of sensors corresponding to each micro lens. In particular, for the all-infocus scenario, we have used the SSIM index to help us push the rendering performance to the extreme by selecting each of these three parameters separately.
Sub-aperture images have been generated from our synthetic light-field raw data. Although these images can provide similar 3D information as compared to the multi-view images produced by the DIBR methods, we notice that such 3D information still cannot deal very well with the dis-occlusion problem, as the input is only from one view. This remains as one of the issues in our future works.
In summary, our synthesis algorithm has effectively solved the problem associated with the limited spatial and angular resolution of the existing light-field raw data. With these synthetic raw data, we believe that more applications can be developed, e.g., how to compress light-field raw data and how to extract the depth information from light-field raw data.