Optics-free imaging of complex, non-sparse and color QR-codes with deep neural networks

We demonstrate optics-free imaging of complex color and monochrome QR-codes using a bare image sensor and trained artificial neural networks (ANNs). The ANN is trained to interpret the raw sensor data for human visualization. The image sensor is placed at a specified gap (1mm, 5mm and 10mm) from the QR code. We studied the robustness of our approach by experimentally testing the output of the ANNs with system perturbations of this gap, and the translational and rotational alignments of the QR code to the image sensor. Our demonstration opens us the possibility of using completely optics-free, non-anthropocentric cameras for application-specific imaging of complex, non-sparse objects.

sensor. In this paper, we demonstrate imaging of objects with low temporal and low spatial coherence (a liquid-crystal display) with the object spaced 1mm or more away from the sensor. Previously, we demonstrated imaging with an optics-free camera (using only the bare image sensor) of simple objects [14]. Machine-learning was used to perform classification of these simple objects without image-reconstructions for human consumption [15]. Such nonanthropocentric cameras promise enhanced privacy, among other advantages. However, it is not clear, if more complex objects could be imaged in the same manner. Compared to our previous work, where only images from the relatively simple MNIST dataset were considered, here we utilize arbitrary QR codes, which have much higher spatial frequencies (see discussion in supplement). Furthermore, identification of QR codes is a very useful application. Rather than using regularization-based singular-value decomposition, here we utilize a deep artificial-neural network (ANN) to perform the conversion from the Machine ("raw sensor") image to human-readable form. We note that such conversion may be unnecessary in the future, when only machine inferencing is required [12].
Similar to our previous work [14,15], the experiment is performed with a bare image sensor (Mini-2MP-Plus, Arducam) placed at a distance z away from a liquid-crystal display (LCD, Acer G276HL 1920 × 1080). The QR code is displayed on the LCD as illustrated in Fig.  1(a). The QR code is comprised of 21 × 21 boxes, each of which can be either white (or colored in the case of color codes) or black. There is a white 4-box-wide frame around the periphery of the code, resulting in a total size of 29 × 29 boxes. The physical size of 21 × 21 boxes is 6mm × 6mm, resulting in a box width of 286μm. The sensor pixel width is 2.2μm. Each QR code is created using Python ("qrcode" library), with randomly generated 10character strings. The experiment was performed in a dark room with only the LCD turned on. The exposure time for each image was 66ms.
Previously, we used the singular-value decomposition (SVD) method with regularization to invert a transfer function in order to obtain the images for human consumption from the raw data [14]. Here, we train artificial neural networks (ANNs) to achieve the same results. ANNs have the main advantage that they can scale to larger images and higher resolutions in a more efficient fashion. Furthermore, their performance can be enhanced by additional data and transfer learning. Finally, although not demonstrated here, ANNs could be adapted to perform inferencing directly from the raw data and bypass the image reconstruction step entirely.
We utilized an ANN based on the encoder-decoder architecture as shown in Fig. 1(b) [11]. A separate ANN was trained for each value of z=1 mm, 5 mm and 10 mm. Each ANN was comprised of 87 hidden layers and trained on 100 passes over a set size of 30,000 training images. Subsequently, the trained network was validated with a set of 5,000 images, which were excluded from the training set. Each batch size had 32 frames from the sensor. Each frame was comprised of 3 color channels, each of size 320 × 240 sensor pixels. The output of the ANN was a single frame of size 320 × 240 pixels. The frame sizes at each major layer of the network is illustrated in Fig. 1(b). We empirically determined that the images with full sensor resolution performed worse, which led us to use this image size. Using the fact that a QR code is a binary array, we formulated a classification problem in which the ANN predicts a 0/1 for each box in the 29×29 QR code array. This allows us to use average classification accuracy (labelled accuracy in Fig. 2) as the metric. We also augmented the dataset (only for the monochrome codes) by synthetically rotating the images by a random angle between −5 and 5 degrees along the central axis normal to the image, which improved performance. The training and validation accuracy averaged over 30,000 (training) and 5,000 (validation) images were plotted as functions of epoch in Fig. 2(a) for z=1 mm, 5 mm and 10 mm, labelled as ANN 1 , ANN 5 and ANN 10 , respectively. The smallest value of z performs the best. This is expected, since in any optics-free system, free-space propagation over longer distances will tend to increase the mixing of the spatial details of the image. Exemplary images reconstructed by each ANN are illustrated in Fig. 2(b). Clearly, qualitatively good reconstructions of the QR code are obtained at z=1 mm. However, we note that reconstruction is not of sufficient fidelity for reading by a conventional QR-code scanner.
The sensitivity of our optics-free camera is important for practical applications. In order to study this, we experimentally collected images, while varying the gap (z) from 1 mm to 10 mm (100 images were captured at each gap), and then used each of the previously trained ANNs to reconstruct the images, and computed the average accuracy at each gap ( Fig. 2(c)). The changing gap is equivalent to defocus in a lensed camera. As expected, each ANN performs best at the gap that it was trained on ("focus"). The rate of degradation of accuracy with z seems to be similar for all 3 ANNs. We conclude from this study that focus precision of −/+ 0.5 mm should be sufficient to maintain accuracy to within ~10% of its peak value. We also explored the robustness of the ANNs to other perturbations such as relative translation and rotation between the sensor and the object. The results, summarized in the supplement suggest that the ANNs can be very robust to small perturbations of the system. Finally, we collected color QR codes on a white background. The QR codes were equally divided between the 3 primary colors: red, green and blue. The same network architecture was used as before, but the output was of size 29×29×3. The ANNs were trained over 30,000 images and validated over 5,000 images. The results are summarized in Fig. 3(a). Exemplary images for each of the 3 colors are shown for z=1 mm (Fig. 3(b)), 5 mm (Fig. 3(c)) and 10 mm (Fig. 3(d)).
Next, we also trained a new network (ANN 1C* ) with QR codes of 6 colors (including 3 nonprimary colors). A total of 60,000, 5000 and 1000 images were used for training, validation and testing, respectively. The experiments were performed for z=1mm, and an accuracy of 96% was achieved after 70 epochs (Fig. 4(a)). Exemplary images are also illustrated in Fig.  4(b). In order to demonstrate the versatility of the approach, we also trained another network on a dataset acquired by displaying colored emojis (Emojipedia.org) on the LCD. 30,000, 5500 and 1110 images were used for training, validation and testing, respectively. An accuracy of 77% was achieved. All experiments in Fig. 4 were performed at z=1mm. Also, note that although the QR codes contain binary values in each color channel, the emojis contained 8-bit values.
Finally, we also explored the impact of sensor noise on ANN reconstructions by synthetically adding Gaussian noise with 0 mean and varying standard deviations, and then processing these with the trained ANN 1 (monochrome QR codes at z=1mm). The results summarized in the supplement show that our approach can attain reconstruction accuracy of 90% even with 10% standard deviation noise. In conclusion, we demonstrated an optics-free camera comprised of a trained artificial neural network and a bare image sensor that is able to convert the raw sensor frames to human-recognizable images. It is important to note that our approach does not rely on high temporal or spatial coherence, and can be used for fast imaging of objects spaced 1mm or more from the sensor. Such optics-free cameras have the potential to enable ultra-thin, lightweight, and inexpensive application-specific nonanthropocentric imaging.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.  Results from the ANNs. (a) Accuracy for training and validation with z=1 mm, 5 mm and 10 mm. (b) Exemplary sensor, ground truth and ANN-output images z=1 mm and 5 mm. More images are included in the supplement. (c) Effect of defocus on accuracy. Accuracy is highest at the trained value of z ("focus").