Synchronization-free top-down illumination photometric stereo imaging using light-emitting diodes and a mobile device

: Three dimensional reconstruction of objects using a top-down illumination photometric stereo imaging setup and a hand-held mobile phone device is demonstrated. By employing binary encoded modulation of white light-emitting diodes for scene illumination, this method is compatible with standard lighting infrastructure and can be operated without the need for temporal synchronization of the light sources and camera. The three dimensional reconstruction is robust to unmodulated background light. An error of 2.69 mm is reported for an object imaged at a distance of 42 cm and with the dimensions of 48 mm. We also demonstrate the three dimensional reconstruction of a moving object with an effective off-line reconstruction rate of 25 fps.


Introduction
Photometric stereo (PS) imaging [1] is one of the most common 3D imaging methods for indoor scenarios. It can achieve better resolution than structured illumination imaging [2][3][4] or state of the art laser scanners [5], offers fast image computation [6], and it can deal with objects in motion and untextured area [7,8]. Compared to stereovision [9,10], only one camera needs to be calibrated, which reduces the computational reconstruction speed, the footprint and the cost [10]. PS imaging relies on having one fixed camera perspective and different illumination directions to image an object in 3D. This technique determines the surface normal vectors and surface albedo at each pixel of the captured images assuming a perfectly diffuse (Lambertian) surface of the imaged object [1]. Surface normal components can then be integrated to recover the 3D shape. Most work on PS has been developed combining various methods in addition to PS imaging, such as multi-view PS imaging [11], non-calibrated PS imaging [12] or self-calibrating PS imaging [13]. All the techniques report a good reconstruction accuracy, within millimeter range [11,12,14] and include real-time reconstruction [15].
Even though current work on PS imaging tackles major challenges such as uncalibrated PS imaging [13] and non-Lambertian PS imaging [16][17][18][19], most PS methods fail to demonstrate an easily deployable imaging technique that could show imaging applications in already existing building infrastructures. If this were achieved, PS imaging can provide an attractive route to using 3D imaging in industrial settings for process control and robot navigation, in public spaces for security and surveillance applications, and for structural monitoring.
There are two major obstacles that inhibit the widespread use of PS imaging for these purposes, which are the compatibility of the PS specific illumination with indoors or outdoors lighting installations, and the cabling required to synchronize several luminaires with each other and with the camera, which may potentially be mobile. Usually the camera and the luminaires are placed in the same plane, and a particularly common configuration employs four luminaires surrounding the camera in a top/bottom/left/right or X-shaped configuation [5,8,14]. While such a setup is known to provide high-fidelity imaging results, it is incompatible with an application scenario where the luminaires are installed at the ceiling to provide room lighting, and a wall-mounted or mobile camera views the scene from the side (see Fig. 1(a)). Current PS systems use cables between the camera and the luminaires to enable synchronization, which is an undesired complication when retrofitting to existing lighting fixtures. Use of a WiFi or optically encoded clock signal can be a solution to remove the cabling, though additional infrastructure would be needed to implement this and the transmitters, camera and clock signal must be synchronized. Achieving synchronization using a "self-clocking" Manchester-encoded modulation scheme makes the approach described here easy to use, does not require additional infrastructure and can work in environment where WiFi is not available. Finally, traditional PS imaging often has a strong visual flicker and illumination low duty cycle, which is detrimental for indoors or outdoors lighting. In this work, we present efforts to make PS imaging synchronization-free, reduce flicker, and demonstrate compatibility with ceiling lighting and both wall-mounted and mobile cameras. In this scenario, PS imaging would coexist with light fidelity (LiFi) networks [20] or visible light positioning (VLP) [21], potentially using the same light-emitting diode (LED) luminaires for all of these functions [22] as well as general lighting.
We demonstrate PS imaging using a hand-held mobile phone camera running at 960 fps and ceiling-mounted LEDs as illustrated in Fig. 1(a) and (b), operating in the presence of additional unmodulated lighting. Four LEDs operated by a controller board were mounted on a gantry were modulated with a bespoke binary multiple access (MA) format, referred to as Manchester-encoded binary frequency division multiple access (MEB-FDMA), that removes the need for synchronization and reduces flicker through Manchester encoding while maintaining a 50% duty cycle. A mobile phone set within the scene acquired a stack of frames at 960 fps, which were then processed to obtain the surface normal components, and finally integrated to obtain the topography of the object. As the object is static, we do not reconstruct the full 3D object but rather its topography, commonly known as 2.5D reconstruction. For a static 48 mm diameter sphere, we report an root mean square error (RMSE see Eq. (8)) of 2.69 mm at a distance of 40 cm with an angle of reconstruction from the top-down illumination of 120 • , which represents 78 % reconstruction of the surface of the sphere. Finally, we also demonstrate a dynamic imaging scheme, using a high-speed camera, where an ellipsoid rotates at 7.5 rotations per minute (RPM) and report an effective off-line 2.5D reconstruction of 25 fps.

Orthogonal LED modulation
One key feature in our setup is the removal of the requirement to synchronize LEDs with the camera or among each other, both of which is needed in the time division multiple access (TDMA) that is used in conventional PS imaging. This is achieved through MA, and frequency division multiple access (FDMA) has been used by the authors before to achieve unsynchronized PS imaging [23]. However, FDMA has some drawbacks, in particular strong perceived flicker since some of the LEDs have to be modulated at a fraction of the camera frame rate, and the sinusoidal modulation requires analog control of the LED brightness. Here, we use a bespoke modulation scheme, called Manchester-encoded binary FDMA (MEB-FDMA), which works with direct digital modulation of the LEDs, has significantly reduced flicker compared to FDMA, and keeps the advantage of not having to synchronize sources and camera. A comparison of MEB-FDMA with other MA schemes (FDMA, code division multiple access CDMA [24] , space division multiple access SDMA [25], wavelength division multiple access WDMA [26] and TDMA) that could be used for PS imaging is provided in Table 1. MEB-FDMA and some other MA schemes have the additional feature that they enable visible light positioning of receivers within the imaged scene through a relative signal strength approach (Sec. 3.3.4 in [21]).

Phase invariant orthogonal modulation
The important property that allows us to use FDMA without having to synchronize LEDs and camera is that if one frequency carrier experiences an arbitrary phase shift due to the lack of synchronization, it still remains orthogonal to the other frequency carriers. We call this property "phase-invariant orthogonality" and introduce it here formally before describing an alternative modulation scheme that shares this property. Consider N emitters illuminating the scene over n discrete time steps. Then the N × n signal matrix s i,j ∈ {−1, 1} describes the time-sequence of on/off states of the individual LEDs. Here, s i,j = +/−1 indicates that at time j the i th LED element transmits a binary value of '1'/'0' in on-off keying (OOK), and after n binary values one 3D image frame is completed. Phase-invariant orthogonality requires that the rows of the matrix s remain orthogonal to each other even if they are time-shifted with respect to each other by an arbitrary phase ∆j. Since camera pixels operate as integrating receivers, particularly at high camera frame rates, this requirement can be formalized to Eq. (1).
Here the phase shift between rows i and i ′ is ∆j = k+α, and % is the modulo operator. Equation (1) represents the requirement from the experimental layout, however, mathematically it is equivalent to the simpler Eq. (2).
FDMA with square wave carriers is phase invariant orthogonal. If s is phase-invariant orthogonal, then its Manchester-encoded version s (1) given by Eq. (3) -where ⊗ is the Kronecker product -is also phase-invariant orthogonal, i.e. all the benefits of Manchester encoding are readily available.
Decoding of phase-invariant orthogonal encoded signals is less trivial than for CDMA schemes with synchronization. Equation (2) effectively means, that the operation of phase-shifting scatters the source signals into orthogonal sub-spaces of R n . Therefore, to enable successful decoding, the rows of the matrix s i,j need to be complemented by appropriately chosen orthonormal vectors e (i) k,j that together with the rows of s i,j span all of these sub-spaces.

Manchester-encoded binary FDMA
We construct the MEB-FDMA carriers by starting with binary-valued square-wave FDMA. In order to be phase-invariant orthogonal over a sampling period T, the frequencies ν i of the square waves must be in a fixed relationship given by Eq. (4).
A convenient choice of the integer values p i is given by Eq. (5), in which i = 1, . . . , N identifies each LED: This means that the frame length n (0) without Manchester encoding is n (0) = 2 N . We then construct a binary FDMA emitter signal s (0) i,j : When using s (0) , individual emitters may have long on and off times, leading to unacceptable visual flicker. Therefore, we use Manchester encoding: If the emitters are modulated with s i,j according to Eq. (7) using OOK, then they provide MEB-FDMA. A decoding algorithm for MEB-FDMA and underlying mathematical proofs are given in the appendix.

Properties of MEB-FDMA
For MEB-FDMA, similar to FDMA, any DC offset can be added to the received signal without affecting the decoding result. This is a prerequisite for applying the scheme to LED illumination since the intensity-modulated LED emission has only positive values. Furthermore, it allows installation of additional lighting fixtures that either do not carry a modulation signal or carry one at a much higher frequency, e.g. for LiFi. Another remarkable property is that the transmitter and receiver can use the same sampling rate. This is surprising because the scheme uses Manchester encoding and the Nyquist theorem requires oversampling by a factor 2 to reliably identify each Manchester encoded bit. However, by requiring the modulation to fulfil the stringent criterion Eq. (1), the scheme was implicitly designed such that not every single Manchester bit needs to be identified individually. This property of the scheme means that the frequency of the LED modulation is the same as the camera frame rate and thus flicker is significantly reduced.
The number of OOK-bits needed for a single frame in MEB-FDMA scales exponentially with the number of emitters. Therefore, this modulation scheme is suitable for modest numbers of modulated emitters illuminating the camera field of view, typically 4-6 emitters in our suggested application.

3D reconstruction process
The process flow of PS imaging in our setup is illustrated in Fig. 2, which comprises MEB-FDMA modulation and demodulation, PS processing and surface normal integration.

Frame acquisition and surface normal map
The transmitted signal is encoded as a clock signal, hence no trigger signal is needed to start the acquisition. Therefore, the acquisition simply starts when the recording button of the mobile phone is pressed. In practice, the LEDs have a 50% duty-cycle and thanks to the known optical fingerprint of each LED a decoding matrix can be created see Supplement 1. The received stack of images are therefore demodulated using the decoding matrix on each pixel of each image, see Fig. 2. At the end of the demodulation, four images corresponding to the four different illumination directions are retrieved.
The retrieved four images are then processed using established methods [1,14,27] to obtain the surface normal components N x , N y , N z and the albedo A under the assumption of a Lambertian surface (see Fig. 2). As we are focusing our work on a new modulation scheme, we decided to use a conventional calibrated PS method to determine the surface normal map, hence the coordinates of each LED relative to the position of the object are needed to determine the lighting vectors.

Fast marching method
The next step of the reconstruction program is to integrate the surface normal vectors to obtain the topography of the object. Surface integration is a well known challenge and there are multiple methods in the literature for addressing it [28][29][30]. In this work, the surface normal vectors are integrated with the Fast Marching method [31][32][33][34][35] to take advantage of its reconstruction speed. The algorithm was implemented in Matlab and assessed on a data set from Yvain Queau [36], see details in the Supplement 1. The reconstruction process takes a few minutes to run on a desktop PC.

Optical acquisition system
As illustrated in Fig. 1, our system consists of a mobile phone device (Samsung Galaxy 9), four white LEDs (Osram OSTAR Stage LE RTUDW S2W) placed on a gantry above the object at a height of H = 46 cm, a controller board (Arduino Uno) for the LED modulation and a computer to communicate with the controller board and run the reconstruction program [37]. A series of geometric solids are 3D printed, namely a sphere with a 48 mm diameter, a cube which is 75 mm wide, and a complex shape of a monkey head that is 130 x 94.5 mm 2 wide and 79 mm deep. 3D printing ensures that the ground truth shape of the objects is known. On the setup, the geometric center of the object is the reference (0,0,0) and the location of the LEDs is determined from this reference point. The phone and the LEDs are located in two different planes. The phone is in front of the object on the z axis at a distance of d = 42 cm from (0,0,0) with a field of view (FOV) of 43 degrees. The LEDs are located at (x,y,z): LED1 (−27, 42, 10), LED2 (−14, 42, 35), LED3 (14,42,35) and LED4 (27,42,10), all in cm. The relative position of the LEDs to the camera and object positions have an impact on the extraction of the surface normal vectors. These coordinates are the best fit regarding the FOV of the scene and the object's size. A trade-off was made between the resolution of the reconstruction and the FOV. By placing the mobile phone at 42 cm from the object we can keep a mm-range depth resolution while assuming an orthogonal projection for the determination of the surface normal components. Moreover, a strict alignment of the mobile phone is not necessary in this work, as long as the FOV contains the front of the object then the orientation of the phone will not affect the accuracy of the surface normal components.
Each LED was modulated with an individual MEB-FDMA carrier signal at a on-off keying rate of 960 b/s. The phone captured frames with a resolution of 1280 x 720 at a rate of 960 fps for 0.2 s, with a black background to simplify image processing. The capture time is limited by the on-device data storage limit.

Decoded frames
The decoded images of the three objects are displayed in Fig. 3. For the three cases, the light level clearly shows the different illumination directions. The brightness is slightly different depending on the position of the LEDs. This can be explained by the possibly imperfect match between the camera integration time and the MEB-FDMA scheme, i.e. the integration time may be shorter than the frame duration.

Surface normal vectors
From this set of images, surface normal components (N x , N y , N z ) and the albedo were calculated and are displayed in Fig. 4. For N x , left and right facing surfaces are correctly distinguished as vector components with magnitude are ranging from −1 to 1. Similarly, N y indicates up and down facing surfaces correctly, albeit with lower fidelity, and its value range is limited to −0.2 to 1 instead of −1 to 1. The poorer fidelity on N y is due to the top-down illumination design as the bottom of each object is not suitably illuminated, which is also visible in the albedo plot. Moreover, as the camera is facing the object, N z is positive and ranges from 0 to 1 with some variations due to the depth of the object. The albedo is normalized and is useful in understanding imperfections in the reconstruction. We notice that the albedo is more directional for the sphere and the cube corner than for the monkey, which is caused by the slight brightness variations seen in Fig. 3. Importantly, the surface normal components, which are the basis of the topography reconstruction, are observed to be less susceptible to these brightness variations than the albedo.  Figure 5 and Fig. 6, respectively, plot the 2.5D reconstruction of the sphere, the cube corner and the monkey head in a perspective view, as well as a 3D rendered view (Blender). To render the reconstruction on Blender, a camera is set at a distance of 10 cm from the imported 2.5D reconstruction. For the sphere, Figs. 5(a)-(c) show a satisfactory global reconstruction for the top half of the object. The bottom half is poorly reconstructed and is "flat", which is also clearly shown on the rendered view. Because of the lack of information on the negative (downward facing) y axis, 78.4 % of the visible surface is reconstructed. The standard deviation is determined by the root mean square error (RMSE) and the normalised RMSE (NRMSE) which are defined as [14]:

3D reconstruction
where n is the number of data pairs, z i is the difference between measured depth values (along the z-axis) and reference values, and (z max − z min ) is the range of measured values. According to the RMSE error map in Fig. 5(d), the most significant error is found at the bottom and on the edge of the sphere, while smaller errors in the top area are related to inaccuracies in N z . Nonetheless, most of the error stays below 5 mm. Figure 1(c) shows the projected angle that can be retrieved using the top-down illumination setup where an angle of 120 • is retrieved. An RMSE of 2.69 mm and an NRMSE of 5.61 % are obtained within the 78.4 % of the surface reconstructed. Despite the lack of information on downward facing facets, both standard deviation errors for the sphere are within the same range as in [14]. For the cube corner, despite the unequal partition of light on the cube and the important gradient variation, the reconstruction can retrieve the shape of the cube corner. Nonetheless, the top view shows that the reconstruction on the edge is deteriorating at the bottom of the object. This is explained with the top-down illumination configuration. Overall, by comparing  Finally, the monkey head has been chosen for its complex features, such as the eyes, the nose and the top of the head. The discontinuity between the face and the ears is a challenge for the Fast Marching algorithm to deal with. Indeed, the 2.5D reconstruction in Figs. 6(d)-(f) shows the shape of the nose, the eyes and also the upper head. However, the shape of the ears is harder to determine and the depth is substantially decreasing, which demonstrates the difficulty of the algorithm for dealing with those discontinuities. To quantify the error, a few features on the monkey face are measured: the horizontal size of an eye is 22 mm, the size of the nose is 1.1 mm and the distance between the eye orbits is 27 mm. After calibration, the same features are measured on the 2.5D reconstruction and we obtained the following measurements: 22.5 mm, 9.3 mm and 29.8 mm respectively. A few millimeters difference can be observed, which is close to the RMSE values obtained for the sphere. The top view reconstruction gives an idea of the percentage of the surface that is correctly reconstructed. The relative position between discontinuous regions is not handled well. However, features within each region are reproduced with good fidelity, such as the ears, on the rendered view. Small depth details, within mm, such as the earlobes and the eyes, are detectable and well reconstructed. Moreover, the distance between the face and the ears is about 50 mm and on the top view of the reconstruction the distance between the two features is also about 50 mm.
The 3D reconstruction relies on the surface normals, therefore most of the error on the reconstructed object topography will be dominated by the error on the surface normal vectors. Whenever N z values are close to zero, the gradient integration during the Fast Marching process is numerically ill-conditioned. This phenomenon is clearly shown by the artefacts on the different reconstructions in Fig. 6.

Signal to noise measurement
Our modulation scheme can operate in the presence of additional unmodulated lighting. In order to assess its robustness, the reconstruction of the sphere is tested with different levels of background light in the room. For this experiment, the ceiling light of the room is illuminated and a voltage divider has been added on the LEDs in order to control the brightness and hence modify the signal power. After measuring the optical power of the signal and background light at the object, the SNR, in dB, is determined following Eq. (10): with P signal the optical power in Watts of the LEDs and P noise the optical power in Watts of the ceiling light. Figure 7(a) shows a graph of the RMSE and the percentage of surface reconstructed versus the SNR for the sphere in the out-of-plane configuration. The SNR ranges from 0 dB to 5 dB. Across the SNR, the RMSE does not show a specific trend and the error ranges from 4 mm to 6 mm which is acceptable in our range of application. The percentage of surface reconstruction that is achieved over the measured range of SNR does show a dependance on the SNR. Figure 7(b) shows that as the SNR decreases, the reconstruction of the bottom part of the sphere is more and more challenging, which is a consequence of the top-down illumination configuration. This is explained by our reconstruction process. To avoid high error values being incorporated in the final image, a threshold is set 10 mm above the expected reconstruction value. Any values in the final reconstruction above the threshold are discarded. This means that as the SNR decreases, we can reconstruct less area of the object, but the portion that is reconstructed is not affected by the SNR. Nonetheless, 65 % of the object view can be reconstructed even with an SNR just below 0 dB.

Dynamic imaging
Finally, we reconstructed a moving object using our top-down illumination setup. A stepper motor (RS PRO Hybrid, Permanent Magnet Stepper Motor, 1.8 • step) was added in order to rotate a 3D printed ellipsoid which was 100 mm long and 60 mm wide. A high-speed camera (Photron MiniUx100) replaced the mobile phone for its larger video capture memory thus enabling a longer acquisition time. The camera frame rate was set at 1000 fps with a shutter speed of 1 ms and was matched to the LED modulation rate. The stepper motor rotated at a speed of 7.5 RPM. Therefore, the acquisition ran for 8 s and the real-time video of the ellipsoid in motion can be found under the folder 'Dynamic imaging' in [38] (see Visualization 1). The 3D reconstruction is done off-line with the same reconstruction program pipeline as Fig. 2. The reconstruction requires at least 32 frames for a full reconstruction, and here we chose to record 40 camera frames for each 3D frame to match the motor step duration and to achieve effective full 3D reconstruction at a standard video rate of 25 fps.
Detailed analysis has been carried out on 21 representative 3D video frames, all separated by an angle of 18 • . Therefore, our full 3D reconstruction relies on 21 topographies of the ellipsoid out of 200 possible reconstructions. To be able to clearly see the reconstruction and to match it with the display speed of the real-time video of the ellipsoid in motion, we decided to display the surface normal components and the reconstruction at 3 fps. This video display rate is 10 times slower than the effective rate we can achieve but it still represents a real-time 3D reconstruction video matching the object in motion. Two videos, one for the surface normal components and one for 3D reconstruction, can be found under the folder 'Dynamic imaging' in [38] (see Visualization 2 and Visualization 3 respectively).
The surface normals behave very similar to the static situation in that a high fidelity is observed in N x , while N y and N z have lower but still useful fidelity. It is important to notice that the ellipsoid is constantly moving and despite its motion, its boundaries are sharp and well-defined which shows that the imaging rate is adequate.
In the 3D reconstruction video, the 21 reconstructed frames are repeated three times, showing both a color-coded plot of the reconstruction as well as a rendered view. Some errors can be observed at the edge of the ellipsoid which are artefacts caused by poor numerical condition due to N z ≈ 0. A flat reconstruction of the bottom of the object is also visible which is expected with the top-down illumination. Similarly to the sphere result with the static configuration, the ellipsoid is not entirely reconstructed. Some of the bottom part is missing which is not only due to the top-down illumination but also to the support piece that has been used to hold the ellipsoid in a tilted position. Nonetheless, the global 2.5D reconstruction of each view is satisfactory and of comparable quality to the static scenes.

Conclusion
In this work, we have been able to demonstrate an accurate 2.5D reconstruction of objects with different shape complexity with a RMSE error of 2.69 mm using a new photometric stereo imaging configuration that can readily be employed in conventional room lighting scenarios. We have also shown the 3D reconstruction of a moving object with an off-line effective 3D frame rate of 25 fps. Most importantly, MEB-FDMA encoding enables simple installation through removing the need for synchronization, and as importantly, modulates LEDs above the visual flicker recognition threshold, thus significantly simplifying the deployment, which was not possible before with successively flashed LEDs. Furthermore, we demonstrated that this method can be implemented using commercially available and hand-held mobile devices. Our work on synchronization-free top-down illumination photometric stereo imaging is currently at a proof-of-concept stage. However, future work will be focus on applying the method to digital lighting applications in public areas or industrial applications for surveillance, and also process and structural monitoring.