Deep learning-based fringe modulation-enhancing method for accurate fringe projection profilometry

: Fringe projection proﬁlometry (i.e., FPP) has been one of the most popular 3-D measurement techniques. The phase error due to system random noise becomes non-ignorable when fringes captured by a camera have a low fringe modulation, which are inevitable for objects’ surface with un-uniform reﬂectivity. The phase calculated from these low-modulation fringes may have a non-ignorable phase error and generate 3-D measurement error. Traditional methods reduce the phase error with losing details of 3-D shapes or sacriﬁcing the measurement speed. In this paper, a deep learning-based fringe modulation-enhancing method (i.e., FMEM) is proposed, that transforms two low-modulation fringes with diﬀerent phase shifts into a set of three phase-shifted high-modulation fringes. FMEM enables to calculate the desired phase from the transformed set of high-modulation fringes, and result in accurate 3-D FPP without sacriﬁcing the speed. Experimental analysis veriﬁes its eﬀectiveness and accurateness.


Introduction
Fringe projection profilometry (i.e., FPP) plays an important role for non-contact, high-resolution and high-precision 3-D measurement [1][2][3][4]. In FPP, the desired phase is calculated by either using a phase-shifting algorithm [5] or a transform-based method [6]. The former achieves higher accuracy but requires at least three fringes [7], and the latter requires less fringes but loses 3-D details [8][9][10]. The measurement error mainly comes from system nonlinear response (i.e., gamma distortion), which can be compensated by using a look-up table [11], pre-coding [12], or gamma model-based technique [13], etc. However, FPP system inevitably contains random noise due to unstable ambient light, camera and projector flicker, camera noise, and quantization error in the frame grabber and the projector [14], etc. The random noise may generate non-ignorable phase error when the captured fringes have a low fringe modulation [15,16], because the random phase error is inverse-proportional to the fringe modulation [17]. The phase error could be larger than 0.06 rad when the fringe modulation is lower than 11, and the 3-D measurement error could be more than 0.4 mm according to the experimental set-up in this study.
In FPP, the fringe modulation is influenced by the projector light reaching to the object's surface, the surface reflectivity, the aperture and exposure of the camera [18,19], etc. The projector light can be assumed as constant, and the aperture of the camera is pre-determined. The measured object's surface contains un-uniform reflectivity (i.e., partial surface with relatively low reflectivity and other with high reflectivity) [20]. The camera can use a large camera exposure to capture high-modulation fringes for local surface with a low reflectivity, but may capture saturated fringes for other surface with a high reflectivity. Furthermore, high-speed 3-D measurement requires a small camera exposure [21,22], which also brings challenges to capture high-modulation fringes.
Traditional methods reduce the phase error by temporally increasing fringes [14] or filtering the spectrum of the random noise [23,24], but the former sacrifices the measurement speed and the latter loses 3-D details [25]. Therefore, it has great importance for FPP to reduce the phase error effectively and accurately. Deep learning has been successfully used in computer vision applications [26][27][28][29]. For low-light images enhancement, deep learning can map low-light images to normal-light images [30]. Recently, deep learning has also been introduced to fringe reduction in FPP [31][32][33][34] and denoising in optical interferometric techniques [25,35,36], etc. Most deep learning-based FPP applications focus on improving the measurement speed by using a single fringe [31,33], which are difficult to achieve high accuracy similar to traditional phase-shifting algorithms using relatively more fringes.
In this paper, a deep learning-based fringe modulation-enhancing method (i.e., FMEM) is proposed by designing a fringe modulation enhancement convolutional neural network (i.e., FMENet) to transform two low-modulation fringes with different phase shifts into a set of three phase-shifted high-modulation fringes. The desired phase can be calculated from these transformed high-modulation fringes. The phase error is obviously reduced, and accurate 3-D shapes can be obtained even for fringes with a very low modulation.
The rest of the paper is organized as follows. Section 2 analyzes the phase error. Section 3 presents the proposed FMEM. Section 4 gives experiments. Section 5 concludes this paper.

Analysis of the phase error due to low fringe modulation
In FPP, a set of phase-shifted sinusoidal fringes are first projected by a projector, and then captured by a camera as [37][38][39][40] I n (x, y) = a(x, y) + ∆a n (x, y) + b(x, y) cos [ϕ(x, y) − δ n ] , n = 1, 2, 3, . . . , N, where (x, y) denotes the camera coordinate, ϕ denotes the desired phase, δ n = 2π(n − 1)/N is the phase shift amount, N is the number of phase steps, ∆a n denotes the random noise, a denotes the fringe background, and b denotes the fringe modulation as described by [18,19,41] b(x, y) = st e α(x, y)b p (x, y), where s is the camera sensitivity, t e is the camera exposure, α is the surface reflectivity, and b p is the fringe modulation of projection fringes. The objects' surface reflectivity is unknown and the camera sensitivity and fringe modulation of projection fringes are constant. Therefore, the fringe modulation can be adjusted by using different camera exposures. In real, the fringe modulation is calculated by [21] For simplicity, the notation of (x, y) is omitted hereafter. The actual phase is calculated by using a least-squares algorithm [42] ϕ = tan −1 N n=1 I n sin δ n N n=1 I n cos δ n .
The phase error caused by the random noise can be described by [39] which is proportional to the random noise and inverse-proportional to the fringe modulation. Because the noise level in a FPP system can be assumed as constant in a short camera exposure time [43,44], the phase error can be reduced by enhancing the fringe modulation according to Eq. (5). A large camera exposure directly enhances the fringe modulation, but reduces the measurement speed and also causes motion-induced problems [45]. For clarity, a simulation is provided to illustrate the phase error by selecting the commonly used three-step FPP. A Gaussian distribution with a zero-mean and a standard derivation of 1 is used to generate the random noise [16]. The phase error of the three-step FPP for different modulations is shown in Fig. 1(a). When the fringe modulation is smaller than 30, the phase error becomes non-ignorable as larger than 0.02 rad [46]. When a small fringe modulation of 10 is selected, the phase error of FPP with different phase steps is shown in Fig. 1(b). The number of phase steps is at least 25 to reduce the phase error to less than 0.02 rad.

Proposed FMEM
The proposed FMEM designs a fringe modulation enhancement convolutional neural network (i.e., FMENet) to transform two low-modulation fringes with different phase shifts into a set of three phase-shifted high-modulation fringes, which includes two steps of training and testing. A three-step phase-shifting algorithm is used for FMEM due to its high-speed [47]. Two low-modulation fringes with respective phase shift amounts 0 and 2/3π are captured under a small camera exposure and selected as the input. A set of three phase-shifted high-modulation is captured under a large camera exposure (i.e., fringes have the highest fringe modulation and without intensity saturation) and selected as the ground-truth. In the training step, FMENet is trained to minimize the loss function (i.e., the difference between the output and the ground-truth) [26]. In the testing step, FMENet outputs a set of three phase-shifted high-modulation fringes with the input of two low-modulation fringes according to the trained model. Figure 2 illustrates the schematic of FMEM. The input, output and ground-truth fringes are denoted by I n , n = 1, 2, I out n , n = 1, 2, 3 and I gt n , n = 1, 2, 3, respectively. By substituting I n with I out n in Eq. (4), the desired phase of ϕ out is calculated. The desired phase is wrapped in a range of (−π, π], and the gray code-based method is used to retrieve the absolute phase [48,49]. Because of the projector defocusing and the camera discrete sampling, the random noise generates unwrapping errors along 2π phase jumps, which can be effectively removed by introducing a median filter, even for the captured fringes with a low fringe modulation [5]. The 3-D shape can be reconstructed by combining the absolute phase with calibrated parameters [50]. The loss function of FMENet is expressed as where θ 1 denotes the parameter space in the network that includes the weights, bias and convolutional kernels [32], m denotes the number of pixels and · 2 is 2 -norm. The structure of FMENet is illustrated in Fig. 3, where the fringe resolution of 512×512 is selected. FMENet consists of a U-shaped structure [51,52] and a residual structure [26,53], which increase the computing efficiency of the network and achieve better restoration of image details. The U-shaped structure consists of operations: convolution (Conv) [51] and batch normalization (BN) [54]; Conv, BN and Dropout [55] by three times; Conv, BN and rectified linear unit (Relu) [56]. The input fringes are downsampled and upsampled spatially, which achieve the extraction of high-level features and the recovery of detail features. Meanwhile, the skip connection is used to integrate low-level information with high-level information to retrieve missing feature information, which realizes fully mining of pixel-level information and semantic-level features [52]. The feature information is fed into the residual structure, which includes the operations of Conv, three residual block [53] and Conv. The residual structure alleviates the problem of gradient disappearance caused by increasing depth in deep neural networks and further restores local details in the outputted fringes. FMENet is trained by the adaptive moment estimation (ADAM) optimizor [57]. We set a batch size of 8 and start with a learning rate of 5e −4 , which is divided by a factor of 2 each time (usually every 200 epochs) when the training error becomes stagnant.

Experiment analysis
We first verify the designed FMENet, and then compare the proposed FMEM with traditional methods. A high-speed three-step defocusing FPP is used [46,58], and the defocusing level is properly selected to remove high-order harmonics [59]. The FPP system includes a projector (DLP6500, Texas Instruments) with a resolution of 1920 × 1080, a CMOS camera (Basler acA800-510um) with a resolution of 400 × 400 and a lens of 12 mm focal length. The distance between our system and the measured objects is about 1.5 m. The fringes are designed with a fringe period of 30 pixels and a fringe frequency of 64, according to the projector's resolution of 1920. FMENet is implemented by using Python and the framework of Pytorch on a PC with Intel Core i9-7900X CPU (3.30 GHz), 32 GB of RAM, and the GeForce GTX Titan RTX (NVIDIA).
The training set, the validation set and the testing set of FMENet contain 100 scenes, 20 scenes, and 20 scenes, respectively. For simplicity, fringes with a low modulation are captured by using a small camera exposure. For each scenes, seven sets of low-modulation phase-shifted fringes are captured under seven relatively small camera exposures (i.e., 250 us, 350 us, 450 us, 550 us, 750 us, 950 us, and 1,150 us), and a set of high-modulation fringes are captured under a large camera exposure (i.e., 3,500 us). FMENet uses two low-modulation fringes with respective phase shift amounts 0 and 2/3π as the input and a set of three phase-shifted high-modulation fringes as the ground-truth. In addition, we emphasize that all the experimental results are obtained from the testing set.
In our datasets, the collected samples are independent and identically distributed, and similar results can be obtained in the training set, validation set and testing set, which can prove that our datasets contain sufficient samples and have extensively covered the sample space in our experiment environment. It should be noted that, the proposed FMEM requires additional samples for different measurement systems due to its data dependent.

Verification of FMENet
By randomly selecting one scene from the testing set, the experimental result is obtained and shown in Fig. 4. The fringes, captured under seven relatively small camera exposures, outputted by FMENet, and captured under the large camera exposure of 3,500 us are shown in Figs. 4(a)-4(c), respectively. In details, the first phase-shifted fringe of each set is selected, and the intensity of the 158th row is plotted in Fig. 4(d). FMENet outputted fringes are clear and bright as fringes captured under the large exposure. Fringe modulations of the inputted seven sets are provided in Table 1, where the modulations of inputted fringes and outputted fringes are denoted by b in and b out , respectively. The mean fringe modulation of the whole image size is calculated to describe the fringe modulation, because different pixels in the captured fringes may have different fringe modulations. The fringe modulations of inputted seven sets range from 3.53 to 17.59, all the fringe modulations of outputted seven sets are above 53, and the fringe modulation of the ground-truth is 54.86. FMENet enhances the fringe modulation to the value close to the desired modulation of the ground-truth and performs consistently well for other scenes in testing set.

Comparison between the proposed FMEM and traditional methods
The proposed FMEM is compared with the three-step FPP (i.e., 3-step), windowed Fourier filtering (i.e., WFF) [24] and fifteen-step FPP (i.e., 15-step) [60]. The ground-truth phase is calculated from fifteen phase-shifted high-modulation fringes. By subtracting the actual phase from the ground-truth phase, the mean absolute value of the phase error is provided in Table 2, where ∆Φ 3−step , ∆Φ FMEM , ∆Φ WFF and ∆Φ 15−step denote the phase errors of 3-step, FMEM, WFF and 15-step, respectively. FMEM performs slightly better than WFF and obviously better than 15-step when the fringe modulation is very low (i.e., ranges from 3.53 to 8.16), and reduces the phase error to around 0.02 rad for different fringe modulations. Although 15-step performs similar to FMEM when the fringe modulation is higher than 8.16, it requires fifteen fringes, which obviously reduces the measurement speed compared with FMEM using two fringes. The phase error distributions of two scenes (i.e., 250 us and 750 us) are shown in Fig. 5(a) and Fig. 6(a), respectively. For clarity, two areas are enlarged and shown in Fig. 5(b) and Fig. 6(b), respectively. As shown in Fig. 5, the phase error of FMEM is obviously smaller than that of 15-step, and also smaller than that of WFF in areas with relatively complex surface. The phase error shown in Fig. 6 is generally smaller than that in Fig. 5, because a larger camera exposure results in a smaller phase error.  The absolute phases of the above two scenes are retrieved from the obtained phases with the assistance of gray code-based patterns [61], and then combined with system calibrated parameters to reconstruct the 3-D shapes [62], as shown in Fig. 7(a) and Fig. 8(a), respectively, where the first, second, third, forth and fifth columns represent the 3-D shapes calculated from the phases obtained by the ground-truth, 3-step, FMEM, WFF and 15-step, respectively. For clarity, two areas are enlarged and shown in Figs. 7(b)-7(c) and Figs. 8(b)-8(c), respectively. The measurement errors of the 126th row of the four methods are selected and plotted in Fig. 7(d) and   As shown in Fig. 7, the resulting 3-D shapes of FMEM and WFF are smoother than the shape of 15-step, which contains speckles caused by the larger phase error. FMEM preserves more 3-D details than WFF around the outline of the eyes, the underside of the nose, the mouth and the whiskers, etc, and achieves accurate 3-D shapes similar to the 3-D shape of the ground-truth. The measurement error of 3-step gives a mean value of 0.93 mm, which can be reduced to 0.23 mm, 0.26 mm and 0.40 mm by the proposed FMEM, WFF, and 15-step, respectively. As shown in Fig. 8, FMEM also performs well by obtaining the smooth 3-D shape similar to the 3-D shape of the ground-truth. Due to a larger camera exposure, WFF preserves more 3-D details and 15-step generates less speckles compared with Fig. 7. The measurement error of 3-step is 0.41 mm, which can be reduced to about 0.17 mm by using FMEM, WFF and 15-step.
The calculation times of 3-step, FMEM, WFF and 15-step are respective 0.9 ms, 79.5 ms, 231.1 ms and 1.8 ms when these methods are run on GPU. FMEM is obviously more time consuming than 3-step and 15-step, but less time consuming than WFF. However, FMEM achieves similar accuracy as 15-step by only using two fringes, and the 3-D shape can be reconstructed offline, which is important for measuring dynamic objects.

3-D reconstruction for scenes with un-uniform reflectivity
One scene containing two objects with un-uniform reflectivity are reconstructed and shown in Fig. 9. In Fig. 9(a), the first, second and third columns are fringes captured under the camera exposure of 4,000 us and 50,000 us, and fringes outputted by FMENet, respectively. It is difficult to capture high-modulation fringes for the two objects due to their un-uniform reflectivity. A small camera exposure causes low-modulation fringes of the right object as shown in the first column, and a large camera exposure causes saturated fringes of the left object as shown in the second column. FMEM outputs the desired high-modulation fringes as shown in the third column. The fringe modulations of the left and right objects in the first column are respective 47.99 and 8.77, and they are respective 13.35 and 77.11 in the second column. FMEM enhances the fringe modulations for the two objects to 47.35 and 80.97, respectively. Figure 9(b) is the resulting 3-D shapes. The low-modulation fringes and saturated fringes introduce speckles for the right object and large holes for the left object in 3-D shape, respectively. FMEM obtains the accurate 3-D shape of the scene with un-uniform reflectivity.

Conclusion
In this paper, a fringe modulation enhancement method (i.e., FMEM) is proposed by designing a fringe modulation enhancement convolutional neural network (i.e., FMENet). Two lowmodulation fringes with different phase shifts are transformed into a set of three phase-shifted high-modulation fringes by using FMENet. The desired results can be obtained on sufficient samples, which are reliable and repeatable. FMEM is verified on scenes captured under different low camera exposures and scenes containing un-uniform reflectivity, which enables to reconstruct 3-D shapes from only two low-modulation fringes and achieves high accuracy similar to traditional phase-shifting algorithm with a large number of phase steps (e.g. 15-step).

Disclosures
The authors declare no conflicts of interest.