Multi-scale band-limited illumination profilometry for robust three-dimensional surface imaging at video rate

: Dynamic three-dimensional (3D) surface imaging by phase-shifting fringe projection profilometry has been widely implemented in diverse applications. However, existing techniques fall short in simultaneously providing the robustness in solving spatially isolated 3D objects, the tolerance of large variation in surface reflectance, and the flexibility of tunable working distances with meter-square-level fields of view (FOVs) at video rate. In this work, we overcome these limitations by developing multi-scale band-limited illumination profilometry (MS-BLIP). Supported by the synergy of dual-level intensity projection, multi-frequency fringe projection, and an iterative method for distortion compensation, MS-BLIP can accurately discern spatially separated 3D objects with highly varying reflectance. MS-BLIP is demonstrated by dynamic 3D imaging of a translating engineered box and a rotating vase. With an FOV of up to 1.7 m × 1.1 m and a working distance of up to 2.8 m, MS-BLIP is applied to capturing full human-body movements at video rate.


Introduction
Three-dimensional (3D) surface imaging techniques [1][2][3][4] have found diverse applications, including industrial manufacturing, archaeological inspection, entertainment, and biomedicine [5][6][7][8][9]. The incessant demands of higher accuracy in extracting surface spatial information in complex scenes pose strict requirements to existing 3D imaging methods. For example, a millimeter-level spatial resolution is desired in reliable 3D facial recognition for attendance checks [10]. Meanwhile, video-rate 3D imaging is necessary for fluid-flag interaction analysis in applied bionics [11]. Moreover, spatially isolated 3D objects with large depth discontinuities must be identified for in-situ monitoring and robotic classification [12]. Furthermore, a high dynamic range is indispensable to measure highly reflective objects in defect inspection of steel castings for the automobile industry [13]. Finally, a large field of view (FOV), ideally covering the full human body, is needed to detect and collect 3D movements for virtual reality gaming [14].
Among existing techniques, phase-shifting fringe projection profilometry (PSFPP) has proven to be a potent technique for 3D surface imaging [15]. In the most widely used configuration, PSFPP works by first projecting sets of phase-shifting sinusoidal fringe patterns onto 3D objects and then analyzing deformed structure images reflected from the objects to retrieve 3D surface information [16]. In contrast to other structured-light 3D imaging techniques [17][18][19][20], PSFPP uses sinusoidal fringe patterns to encode grayscale phase values for each pixel, followed by phase unwrapping to extract 3D information with a high depth resolution [21].
The speed of sinusoidal fringe projection is one of the most important specifications in PSFPP. These patterns are commonly generated by digital micromirror devices (DMDs) [22]. Though being a binary amplitude spatial light modulator [23], the DMD can generate grayscale sinusoids in various ways. The conventional dithering method forms a grayscale image by controlling the average reflectance of each micromirror over time, which clamps the projection rate at hundreds of hertz [24]. To improve the fringe projection speed, binary defocusing techniques [25,26] and band-limited illumination profilometry (BLIP) [27,28] have been developed, both of which can produce a grayscale sinusoidal pattern from a single binary DMD mask. Their fringe pattern projection speeds can keep up with the DMD's refreshing rate [up to tens of kilohertz (kHz)], showing promise for high-speed 3D visualizations.
Although overcoming the limitation in projection speed, binary defocusing could bring instability in pattern quality. In this method, quasi-sinusoidal fringe patterns are generated behind the system's image plane. Consequently, DMD's uneven surface could induce image distortion to the defocused binary DMD masks at any unconjugate plane, which may decrease the 3D measurement accuracy, especially under coherent illumination [29]. In addition, because the proper defocusing degree to eliminate the high-order harmonics from the binary mask is frequency-dependent [30], this technique is less compatible with sinusoidal patterns of different periods. Furthermore, the defocusing operation reduces the contrast of quasi-sinusoidal fringes [31], especially at the far end of the measurement volume. Difficult to maintain the defocusing degree for fringe patterns with different frequencies at a fixed image plane, binary defocusing techniques pose challenges in imaging spatially isolated 3D objects [32]. Finally, the binary defocusing technique compromises the depth-sensing range [25]. Hence, it has not reached a cubic meter (m 3 )-level measurement volume, falling short in multi-scale 3D imaging with tunable working distances and FOVs.
BLIP can overcome the limitations in binary defocusing techniques. It generates sinusoids by controlling the optical bandwidth of a 4f imaging system. As a result, regardless of their frequencies, sinusoidal fringe patterns are always generated on the image plane of the DMD. This configuration eliminates the distortion brought by uneven DMD surface and maintains the contrast of projected fringe patterns. It is also inherently compatible with multi-frequency fringe projection, varying working distances, and different FOVs. Despite these attractive advantages, to date, BLIP has not yet been demonstrated for imaging 3D objects with highly reflective surfaces. Meanwhile, BLIP has only been applied with the single-frequency fringe projection method, which is less robust for reconstructing the 3D information of spatially separated objects. Finally, limited by the power of the light sources and the focal lengths of the lenses on the projector and the camera in the existing apparatus [33], BLIP's FOV has not yet reached the meter-square (m 2 ) level.
To surmount these limitations, in this paper, we report multi-scale (MS) BLIP for robust 3D surface imaging at video rate. MS-BLIP implements multi-frequency fringe projection with the associated phase unwrapping, which enables imaging spatially isolated 3D objects. Meanwhile, MS-BLIP adopts dual-level intensity projection to enhance its dynamic range, which allows recovering 3D information from surfaces with high reflectance. Moreover, MS-BLIP applies an iterative method for distortion compensation, which improves the 3D reconstruction quality over a m 2 -level FOV. Finally, MS-BLIP is demonstrated for video-rate 3D surface measurements at varying working distances of up to 2.8 m with tunable FOVs of up to 1.7 m × 1.1 m.

Setup
The schematic of the MS-BLIP system is shown in Fig. 1. A pulsed laser with an average power of 400 mW (AONano 559-0.2-10, Advanced Optowave) is used as the light source. After expansion and collimation, the laser beam is directed to a 0.45" DMD (AJD-4500, Ajile Light Industries) at an incident angle of ∼24°to its surface normal. Binary fringe masks, generated by an error diffusion algorithm from their corresponding grayscale patterns [34], are loaded onto the DMD and displayed at up to 1 kHz. A band-limited 4f imaging system that consists of two lenses (Lens 1 and Lens 2 in Fig. 1; AC254-075-A, Thorlabs) and one pinhole converts these binary patterns to grayscale fringes at the intermediate image plane. The smallest period in the used sinusoidal fringe patterns is 388.8 µm (i.e., 36 DMD pixels), which demands a 150-µm-diameter pinhole (P150K, Thorlabs) to pass the spatial frequency components of these patterns while filtering all noise induced by the digital halftoning [35]. A dove prism (PS992M, Thorlabs), placed between Lens 2 and the intermediate image plane, rotates the generated fringe patterns to match the aspect ratio of the targeted scene. Then, a camera lens (AF-P DX NIKKOR 10-20mm f/4.5-5.6G VR, Nikon) projects these fringe patterns onto 3D objects. The deformed structure images are captured by a high-speed CMOS camera (CP70-1HS-M-1900, Optronis) with a camera lens (AZURE-3520MX5M, AZURE Photonics). Synchronized by the DMD's trigger signal, the acquired images are transferred to a computer via a CoaXPress cable connected to a frame grabber (Cyton-CXP, Bitflow).

System modeling
The projective behaviors of the camera and the projector in the MS-BLIP system are modeled based on the "pinhole model" with the consideration of image distortion. For a 3D point (x, y, z) in the world coordinate system, in a perfect imaging condition, the pinhole model describes its corresponding pixel on the camera sensor, (û,v), to be (1) Here, the matrix A contains the camera's intrinsic parameters: f u and f v describe the effective focal lengths along the axes of the camera, and (u pp , v pp ) is the coordinate of the camera's principal point. Defined as the camera's extrinsic parameters, R is a 3×3 matrix accounting for rotation, and T is a 3×1 matrix for translation. Finally, s is a scalar factor of numerical extraction of (û,v). Overall, this operation is expressed by (û,v) = Proj c (x, y, z). Similarly, the projection of the 3D point to the projector is modeled by (û ′′ ,v ′′ ) = Proj p (x, y, z). The superscript of "double prime" is used hereafter to refer to the coordinates for the projector.
To take into consideration of distortion in the acquisition of deformed structure images, we first define the normalized camera coordinate without distortion as ( Then, the distorted normalized camera coordinate (u n , v n ) is determined by [36] where d 1 ,. . . , d 5 are the camera system distortion coefficients, andr 2 n =û 2 n +v 2 n . Once (u n , v n ) is computed, the distorted camera pixel coordinate (u, v) is found by using Eq. (2). Overall, this operation is expressed as (u, v) = Dist c (û,v). Similarly, the distortion for the projector is modeled by (u ′′ , v ′′ ) = Dist p (û ′′ ,v ′′ ).

System calibration
MS-BLIP's system calibration aims to compute the intrinsic parameters, extrinsic parameters, and image distortion coefficients for both the camera and the projector. In the calibration of the camera, a checkerboard pattern was imaged with 20-40 poses. A MATLAB toolbox [37] was used to extract the grid corners, which allowed calculating all the calibration parameters. In the calibration of the projector, the projector was treated as another camera to adopt the similar process used in camera calibration [38]. The method involved capturing additional images of the checkerboard pattern under the illumination of both horizontally and vertically shifted fringe patterns to determine the pixel-wise mapping between the camera view and the projector view. Then, the camera-captured images of the checkerboard were transformed to generate the corresponding images from the perspective of the projector. Finally, the same MATLAB toolbox used in camera calibration was implemented to compute the corresponding calibration parameters of the projector.

Dual-level intensity projection
To image 3D objects with a large variety of reflectance, projection intensity modulation [39] is implemented to improve MS-BLIP's dynamic range. In particular, two normalized intensity levels, empirically chosen to be 1.0 and 0.3, are set for each fringe pattern. For any camera pixel in the acquired deformed structure images, if any values in the sequence of high-intensity projection are saturated, then it is replaced with the corresponding sequence with low intensity.

Coordinate recovery using multi-frequency fringe projection
MS-BLIP implements the multi-frequency fringe projection method [40]. The intensity of the camera pixel (u, v) in the captured deformed structure images can be expressed as Here, I km (u, v) represents the intensity in the k th deformed structure image in the m th sequence. K is the number of phase-shifting steps. In this work, we use four-step phase shifting in six sets of fringes, so that K = 4, k ∈ [0, 3], and m ∈ [0, 5]. I b (u, v), I va (u, v), and φ m (u, v) represent the background, intensity variation, and depth-dependent phase for the m th sequence, respectively. From Eq. (4), the calculated φ m (u, v) is wrapped in the interval (−π, π]. Then, phase unwrapping is performed to determine the horizontal coordinate of the projector, which is then used to solve the 3D surface information by triangulation.
In single-frequency fringe projection, a single unwrapped phase map is generated by a twodimensional weighted phase unwrapping algorithm [41], which must make assumptions about the surface smoothness. The unwrapped phase value of each pixel is derived according to the phase values within a local neighborhood of this pixel. Consequently, this method falls short in solving the phase ambiguities induced by the depth-discontinued regions or isolated objects [42].
This limitation is lifted by multi-frequency fringe projection that unwraps the phase value of each pixel individually. The values of P m p in our work were determined by synthetically considering the system's signal-to-noise ratios and the measurement accuracy. The coarsest fringe pattern is chosen to have no more than one period so that its absolute phase, defined as φ 0 (u, v), equals the wrapped phase map φ 0 (u, v). The ensuing sets of fringe patterns, i.e., m>0, have the periods of P m p = Round(N u /2 m ). Here, N u is the width of the DMD expressed in the unit of DMD pixel. Because unwrapped phase maps of two adjacent sets of fringe projection Equation (5) shows that, unlike single-frequency phase unwrapping, the phase of each pixel is unwrapped independently, which avoids incorrect depth quantification in analyzing spatially isolated objects [42]. Using this method, the unwrapped phase map of each set of fringe patterns is computed successively, andφ 5 (u, v) is used to compute the horizontal coordinate of the projector, denoted by u ′′ , by

Distortion compensation and 3D point recovery
Distortion compensation contains two major steps. First, the undistorted camera pixel coordinate (û,v) is extracted by performing the inverse operation of the distortion model (i.e., Dist c ). However, the direct inversion of Eq. (3) takes the form of a 7 th degree polynomial in bothû n and v n , rendering direct recovery difficult. Thus, we apply a fixed point iteration method [43] for computing the undistorted normalized camera coordinate (û n ,v n ). With the initial condition of , the i th iteration is described by wherer 2 n,i =û 2 n,i +v 2 n,i . The corresponding undistorted pixel coordinate (û,v) on the camera plane is then calculated by using Eq. (2). Overall, this operation is expressed as (û,v) = DistComp c (u, v).
The next step is to obtain the corresponding undistorted projector horizontal pixel coordinatê . By using Proj c and Proj p , the coordinate of the distortion-compensated 3D point is calculated by triangulation as Ideally,û ′′ could be computed by (û ′′ ,v ′′ ) = DistComp p (u ′′ , v ′′ ). However, v ′′ cannot be calculated from the projection of vertical fringe patterns. To overcome this limitation, we have developed an iterative method that recovers distortion-compensated 3D information without prior knowledge of v ′′ from fringe measurements. We first estimate the coordinate of the 3D point by the Tri function [i.e., Eq. (8)] by using the coordinate (û,v, u ′′ ). This 3D point is then projected to the projector plane to extract the initial estimate of v ′′ with the function of Dist p , which is input to an iterative algorithm. Each iteration starts with performing the function DistComp p by using u ′′ calculated from (Eq. (6)) and the variable v ′′ to compute the distortion compensated projector coordinateû ′′ , which is then used to extract the 3D coordinate by the function Tri. The method is presented as the following pseudo-code:     Fig. 2(b). The 220 effect of different intensity levels in deformed structure images is compared in Fig. 2(c). The

Demonstration of MS-BLIP
To verify the feasibility of MS-BLIP, we imaged static 3D objects at a working distance of 0.5 m, with an FOV of 0.3 m × 0.2 m. The depth resolution, quantified by the standard deviation of a reconstructed planar surface [28], was measured to be 0.3 mm. The lateral resolution, quantified by imaging the sharp edge of the planar surface [44], was 0.4 mm.
To demonstrate MS-BLIP's ability to tolerate a large variety in reflectance, we imaged an open box that had a highly reflective matte surface with a weakly reflective foam surface [ Fig. 2(a)]. The representative fringe masks with two intensity levels are displayed in Fig. 2(b). The effect of different intensity levels in deformed structure images is compared in Fig. 2(c). The high-intensity projection (i.e., I 1.0 km ) results in a large number of saturated pixels in the left region of the matte surface. The low-intensity projection (i.e., I 0.3 km ), despite eliminating saturation, fails to resolve the white foam surface and the right bottom part of the matte surface. The dynamic ranges of I 1.0 km and I 0.3 km were calculated to be 48.1 dB and 41.8 dB, respectively. In comparison, dual-level intensity projection preserves low-intensity pixels while avoiding any saturation [ Fig. 2(d)]. With an enhanced dynamic range of 58.6 dB, this method fully recovers this 3D object. MS-BLIP's high accuracy in imaging 3D objects with depth discontinuities is proven by comparing the multi-frequency phase unwrapping method applied in MS-BLIP with the standard single-frequency counterpart. As shown in Fig. 3(a), a 50-mm-long cylinder with a 15-cmdiameter flat front surface stands on a base. A square foam plate is placed ∼230 mm behind the cylinder. The single-frequency method used four phase-shifting fringe patterns with a period of 36 pixels (i.e., P 5 p ) and an additional pattern of the DMD's vertical central line. Figures 3(b)-(c) show the captured deformed structure images by using the two methods. Because the distance between the cylinder and the foam plate resulted in a shift of multiple periods of fringes in deformed structure images, single-frequency fringe projection fails to correctly resolve the depth in the left half of the cylinder [ Fig. 3(d)]. In contrast, displayed in Fig. 3(e), the multi-frequency method, adopted in MS-BLIP, accurately recovers the detailed 3D information. Specifically, the calculated depth between the cylinder's front surface to the foam plate is 281.7 mm, which well matches the ground truth (with a deviation of 0.61%). To demonstrate the necessity of distortion compensation, we analyzed the 3D reconstruction results of a selected region on the foam plate [red dashed box in Fig. 3(a)]. As shown in Fig. 3(f), the system distortion mistakenly guides the 3D reconstruction, which displays a curved structure of the reconstructed surface. With the iterative method for distortion compensation applied in MS-BLIP, the surface with correct depth information is presented in Fig. 3(g).
MS-BLIP can accurately resolve 3D information of fine structures. Figure 3(h) shows the 3D reconstruction of the cylinder's flat round surface with two dents [blue dashed box in Fig. 3(a)]. The quantitative analysis of selected line profiles shows that the two dents on the flat surface have a depth of 6.0 mm and a diameter of 10.0 mm [ Fig. 3(i)]. The deviations from the ground truth are 0.3 mm and 0.5 mm, respectively, which show an excellent agreement with the object's configuration.

MS-BLIP of translational movements
To highlight the potential application of MS-BLIP in automated detection and classification, we imaged an engineered box moving translationally at a speed of ∼15 cm/s [ Fig. 4(a)]. The box used a piece of curved cardboard as the base, on which pipes made of different materials were fixed. On the top row, the left pipe was made of black foam with low reflectance, while the other three metal pipes, as well as the white swelling glue, had high reflectance. MS-BLIP had a 3D imaging speed of 10.4 fps, an FOV of 0.3 m × 0.2 m at a working distance of 0.5 m, a depth resolution of 1.3 mm, and a lateral resolution of 1.1 mm. Figures 4(b)-(d) show the advantage of dual-level intensity projection. For the I 1.0 km projection, because saturated intensity distorted the pixel-wised sinusoidal intensity profile, fringe residues show up as artifacts on the regions with high reflectance [ Fig. 4(b)]. Meanwhile, due to the lack of scattered light intensity from the I 0.3 km projection, part of the black foam pipe was failed to be reconstructed [ Fig. 4(c)]. In contrast, the dual-level intensity projection enhanced the dynamic range by over 12 dB. Presented in Fig. 4(d), the dual-level intensity projection successfully reconstructs the full shape of this 3D object. Figure 4(e) shows the reconstructed 3D images of the moving engineered box at four time-lapse frames with an interval of 960 ms (see the full evolution in Visualization 1). MS-BLIP allowed tracking the 3D movements of all spatial points. As an example, the time-resolved positions of two points, selected from the pipes with both low and high reflectance [marked by P A and P B in Fig. 4(e)], are shown in Fig. 4(f). The results reveal that the two points have linear movement in the y axis at 15.4 cm/s but stay relatively stationary in the z axis, well matching the pre-set experimental condition.

MS-BLIP of rotational movement
To demonstrate MS-BLIP's potential in industrial inspection, we imaged the rotational movement of a bamboo vase with extending branches [ Fig. 5(a)]. With a height of 1.3 m, this 3D object was glued on a stand rotating at 0.6 rad/s. To fit the projection region with the desired imaging area, the dove prism was placed at 45°, which results in a 90°rotation of the projected patterns with respect to the fringe masks loaded onto the DMD. MS-BLIP was operated at a working distance of 2 m, with an FOV of 1.5 m × 1.0 m, and at a 3D imaging speed of 20.8 fps. Under these working conditions, the depth resolution was quantified to be 3.7 mm, and the lateral resolution was measured to be 1.7 mm. Figure 5(b) shows six time-lapse 3D images of the object (the full sequence is shown in Visualization 2). The close-up view of the vase mouth presents detailed structural information on its surface. The depth-encoded color change of the branches reflects the rotation movement of the object. By tracking the 3D position of the tip of the branches [marked by P in Fig. 5(a)

MS-BLIP of human-body movement
To explore the potential of MS-BLIP in human-computer interaction, we imaged the full-body movements of a volunteer (the full sequence is shown in Visualization 3). All the experiments were conducted in accordance with the protocol (Project CÉR-22-649) approved by the human ethics research committee of Institut National de la Recherche Scientifique, Université du Québec. A photo of the scene is shown in Fig. 6(a). The volunteer had a height of ∼1.6 m, wearing proper protective clothing and a laser goggle. Both hands were exposed to the laser illumination with  intensity (and fluence) of 7.8 × 10 −7 W/cm 2 (and 3.9 × 10 −11 J/cm 2 ), which is much lower than the maximum permissible exposure (i.e., 0.2 W/cm 2 and 2 × 10 −2 J/cm 2 ) regulated by the ANSI safety limit [45]. A part of an optical table on the left of this volunteer was also included in the FOV. Akin to the 3D imaging of rotational movement, the dove prism was set to 45°. The MS-BLIP system was operated with a working distance of 2.8 m and an FOV of 1.7 m × 1.1 m. It had a 3D imaging speed of 10.4 fps. The depth and the lateral resolutions were measured to be 4.7 mm and 2.4 mm, respectively. Figure 6(b) shows five time-lapse 3D images of the instantaneous poses of the volunteer. The detailed surface structures of the volunteer's lab coat and hands are illustrated by the close-up views. The edge of an optical table and optomechanical components can also be seen as a static background. As displayed in Fig. 6(c), we tracked the 3D positions of two selected points over time: the ring finger's tip of the volunteer's left hand and the edge point of a marker on the volunteer's coat. In the experiment, the 3D image acquisition began when the volunteer sat on a stepladder with both hands on the knees. Afterward, the volunteer moved up both hands at 2016 ms, resulting in a change of the fingertip position along all three axes [marked by the arrows for "Hands up" in Fig. 6(c)]. The volunteer's body stayed at the same position until 4128 ms when the volunteer stood up, which results in movements along the -x axis for both points [marked by the arrow for "Stand up" in Fig. 6(c)]. In addition, the forward incline movement of the volunteer's body while standing up was also reviewed according to the position changes along the + z axes. At 6048 ms, the volunteer stretched out both hands, which is responsible for the sudden increase of the y value [marked by the arrow for "Hands out" in Fig. 6(c)].

Discussion and conclusions
We have developed MS-BLIP for robust multi-scale 3D surface imaging at video rate. Compared to existing BLIP systems, MS-BLIP employs dual-level intensity projection to tolerate the variation in 3D objects' reflectance. It implements the multi-frequency fringe projection method to robustly resolve spatially isolated 3D objects with depth discontinuity. Finally, it uses an iterative method for distortion compensation in 3D point recovery to improve the quality of reconstruction. These supports have empowered MS-BLIP for dynamic 3D visualization of translational movements of an engineered box, rotational movements of a craft vase, and full human body movements at an imaging speed of up to 20.8 fps and a measurement volume of up to 1.5 m 3 -a three order of magnitude increase compared to existing BLIP systems. Specifications used for main demonstrations are summarized in Table 1. MS-BLIP contributes several conceptual advantages to PSFPP. First, its illumination scheme could be readily adapted to other multi-frequency PSFPP systems. Holding a strict imaging relationship between the DMD and the 3D objects, MS-BLIP preserves high accuracy and high contrast of grayscale fringe patterns at different frequencies. Despite being demonstrated only with six selected frequencies, MS-BLIP can be easily adapted to accommodate varying frequencies, arbitrary sets of patterns, and different working distances and FOVs for specific studies. Second, MS-BLIP employs a nanosecond pulsed laser as the illumination source. Since the laser pulse width is much shorter than the DMD's display time, no photons would be wasted during pattern switching, which is otherwise unavoidable for a continuous-wave laser. This advantage also sheds light on implementing MS-BLIP using high-speed cameras [46,47] to increase 3D imaging speeds while maintaining signal-to-noise ratios. Third, a dove prism is used to adjust the orientation of the FOV. In existing PSFPP techniques, the aspect ratio of the projected area is fixed by the DMD's shape. Changing the aspect ratio limits the DMD's active region, which, however, reduces the overall light throughput. The dove prism increases the flexibility in the orientation of the projection region, which thus helps MS-BLIP fit different experimental scenarios.
The future work will be carried out in the following aspects. First, we could further improve MS-BLIP's imaging speed by adopting multiple cameras, a faster DMD, and a more powerful light source. Moreover, we plan to implement online feedback to adaptively adjust the intensity of projected patterns [48], which would optimize MS-BLIP's dynamic range for different working conditions. Furthermore, to take into account possible measurement inaccuracy induced by laser speckles, we could implement a superluminescent diode and a rotating diffuser in the MS-BLIP system as well as apply a filtering algorithm in image reconstruction [49]. Finally, a graphic processing unit could be used to accelerate the computation of phase maps for real-time display of the 3D reconstruction results. Besides technical improvement, we will continue exploring new applications of MS-BLIP, including automated industrial inspection [50], archaeology [51], and human-computer interaction [52].