Object detection neural network improves Fourier ptychography reconstruction

High resolution microscopy is heavily dependent on superb optical elements and superresolution microscopy even more so. Correcting unavoidable optical aberrations during post-processing is an elegant method to reduce the optical system’s complexity. A prime method that promises superresolution, aberration correction, and quantitative phase imaging is Fourier ptychography. This microscopy technique combines many images of the sample, recorded at differing illumination angles akin to computed tomography and uses error minimisation between the recorded images with those generated by a forward model. The more precise knowledge of those illumination angles is available for the image formation forward model, the better the result. Therefore, illumination estimation from the raw data is an important step and supports correct phase recovery and aberration correction. Here, we derive how illumination estimation can be cast as an object detection problem that permits the use of a fast convolutional neural network (CNN) for this task. We find that faster-RCNN delivers highly robust results and outperforms classical approaches by far with an up to 3-fold reduction in estimation errors. Intriguingly, we find that conventionally beneficial smoothing and filtering of raw data is counterproductive in this type of application. We present a detailed analysis of the network’s performance and provide all our developed software openly. © 2020 Optical Society of America under the terms of the OSA Open Access Publishing Agreement


Introduction
Superresolution microscopy received its accolades with the Nobel price in chemistry in 2014 and has since been firmly established as a method of choice in biomedical imaging core facilities. A constant burden that goes along with ever higher resolution is the stark dependence on superb system alignment and performance of the employed optical elements. In practice, this is impossible to guarantee at all times and hence post-acquisition computational aberration correction has seen a rapid development recently [1][2][3][4][5][6].
Fourier ptychographic microscopy (FPM) [7] stands out from the superresolution family as it is a label-free technique. It is one of the latest microscopy methods to be developed and offers a range of benefits over conventional brightfield imaging. Its main features are (1) retrieval of the optical density of a sample without the need for interferometric detection, (2) correction of optical aberrations induced by the employed optics through recovery of the microscopes complex-valued coherent transfer function [8], (3) imaging with extraordinarily large space-bandwidth product, and (4) its ability to achievable resolution much larger than dictated by the microscope objective lens and thus allow even nanoscopic resolution [9,10].
FPM setups illuminate the sample from a multitude of directions sequentially and capture the scattered light using an objective lens that forms an image of the sample on a camera. Different illumination angles cause different features of the sample to be more pronounced. One can easily verify this with a flashlight and a relief surface. Mathematically, the sample's complex scattering field gets phase-modulated (a rigorous derivation is provided as supplementary information). This modulation is non-linearly linked to both the pupil function and a sample-dependent phase-delay, typically called quantitative phase or simply phase. The pupil function describes optical aberrations and transmission strength of the imaging system, whereas the phase is a quantitative measure related to the sample's optical density. Fig. 1. Flowchart of the calibration and reconstruction process. Raw data is Fourier transformed and (optionally) pre-processed. Then the system parameters are extracted with the neural network (FRCNN) and processed via the FPM phase retrieval algorithm using the alternating projections method. Outputs are the sample's amplitude, phase, and the recovered system-(and patch-) specific pupil function.
To disentangle the pupil and phase components, FPM relies on a reconstruction algorithm (see Fig. 1) that minimises the error between the recorded frames and computer-generated 'raw' images based on a forward model of the imaging process. The parameters to optimise in this process are the sample phase and the pupil function, whereas knowledge about the illumination geometry serves as necessary constraint.
The accurate extraction and calibration of the illumination geometry from the raw data is the main focus of this article and we show how the problem of illumination estimation can be cast as an object detection problem (find and locate an image feature) in the Fourier domain, which permits the use of recently published high-performance object-detection neural networks. In the following, we explain how illumination calibration can be formulated in such a way and proceed by applying and evaluating a suitable neural network to the task. All developed code is freely available on GitHub (github.com/IAmSuyogJadhav/NN-Illumination-Estimation-FPM/).

Theory and methods
The illumination of a scattering sample with a plane wave is equivalent to shifting the object spectrum by the amount of the lateral illumination wave vector component, which is successively low-pass filtered by the objective's pupil function.
As derived rigorously in the supplementary document the oblique illumination in FPM introduces a spectral shift of the pupil function in the recorded spectrum that is directly proportional to the illumination wave vector (in + and -directions). These shifts are clearly visible in the raw data and are a dominant feature that is largely independent of the sample (see Fig. 2(b) for an example). Detecting and locating one of these disks in the Fourier domain is thus equivalent to determining the illumination angle. In addition to the illumination angle, it is possible to retrieve the effective coherent cut-off frequency by accurately determining the radius of the displaced pupils. The FPM phase retrieval (with additional embedded pupil function recovery [8]) makes use of this information to disentangle the pupil from the underlying spectrum, which is governed by the combined extent of all raw image spectra. This is commonly implemented as an iteratively solved error minimisation problem (see Supplement 1 for further details). To summarise, the problem of calibrating the brightfield illumination in FPM can be simplified into an object detection problem in which a "noisy disk" needs to be automatically identified and accurately located. This is a common task in computer vision and can be completed with impressive robustness and fidelity by specialised object-detection neural networks. Generally, neural networks (NNs) are more and more regularly used for computer vision and related machine learning tasks. In microscopy, a whole range of application areas for NNs has emerged [11], including denoising [12,13], digital staining [14][15][16][17], counting and labelling [18], tracking [19], image reconstruction [20][21][22], computational microscopy [23][24][25][26], virtual focusing [27,28], aberration estimation [29], and segmentation [30][31][32]. In the context of FPM, attempts have been made to perform the whole phase retrieval process with a neural network although using neural networks for the full FPM reconstruction pipeline is still an area of active research. So far, such approaches have found little use in practice, due to limited performance increases with respect to the classical phase retrieval [33] or due to impractically long reconstruction times in unsupervised networks [24]. Reliable and fast deep-learning-based phase imaging has been shown to be possible in networks with supervised training [23], but care must be taken when applying such networks to sample types they were not trained on. Moreover, supervised networks require training sets composed of raw data and conventionally reconstructed FPM images, highlighting the need for high-fidelity ground truth reconstructions. The performance thus still hinges on a reliable illumination estimation. Illumination calibration, in contrast, is largely independent of the sample and thus well-suited for neural network processing without training on vast sets of experimental ground truth data. Popular networks for this type of task are two-stage region-proposal convolution neural networks (RCNNs). A fast and very robust implementation of this network type is faster RCNN (FRCNN) [34], which is our architecture of choice. Its simplified architecture is shown in Fig. 2(a) to illustrate the two stages.
In brief, the network's first stage consists of the Resnet-101 unit, which contains a sequence of 33 residual blocks of four varieties (referred to as Conv2, Conv3, Conv4, and Conv5 for ease of reference in literature). The first 3 blocks are of type Conv2, the next 4 of Conv3, the next 23 of Conv4, and the remaining 3 of Conv5. Each of these blocks has three convolution layers connected serially and includes a residual link that skips all the three blocks such that the input may be directly passed to the output additively. The architectures of the three convolution layers of Conv2 to Conv5 are described in Table 1. Note that 1 × 1 convolutions of layer 1 and layer 3 are used to decrease the number of features. This is the most common application of this type of filter and hence these layers are often called feature map pooling layers. The Resnet-101 unit is followed by the region proposal network (RPN) unit. As the name suggests, it creates proposals for regions that likely contain the objects of interest. It takes in the feature map generated by Resnet-101 and simply creates several candidate anchor boxes (or region proposals) at each pixel in the feature map. The anchor boxes pass through a classifier and a regressor in parallel, such that the anchor boxes with good classification accuracy are identified by the classifier, and appropriate coordinates of the bounding boxes in the original image are identified by the regressor. The bounding boxes with good classification accuracy are therefore identified as the outputs. Our architecture choice is motivated by the following characteristics of our data. Consider the three region proposals (RPs) shown in Fig. 2(b), which may be generated by an object detection approach towards learning bounding box locations. It is evident that RPs A and B are ambiguous in conclusion of foreground or background while RP C may be much more useful towards detection. In this situation, among all the RPs created by an object detection approach, most will be ambiguous though and not useful for learning. Conventional single-stage approaches cannot deal with a poor ratio of useful RPs to those less meaningful. Two-stage approaches, on the other hand, use their second step of classification towards object detection, which allows them to handle even poorer ratios.
We used a pre-trained version of FRCNN [35] with ResNet-101 [36] as backbone that uses dilated convolutions [37] in Conv5 to benefit from transfer learning. This can be justified as low-level features are largely independent of the detection task at hand. Thus, using pre-trained networks shortens the training period for microscopy system-specific features like disks or rings. Furthermore, image spectra of natural images are commonly alike (decreasing amplitude with higher spatial frequencies). Furthermore, as microscopy systems are commonly designed to record at the Nyquist limit, the size of the apparent pupil in the raw data spectra relative to the image size is comparable between most microscopes. This limited search space is beneficial, as training on a limited range of cut-off values is sufficient to realise a universally applicable illumination finder. This leads to the conclusion that the application of the network to spectra of microscopy images is possible without the requirement for additional sample-specific training.
We trained the network on the magnitude spectra of computer-generated raw data obtained using the FPM forward model (see Supplement 1 for details). Note that since object detection with conventional NNs is not directly applicable to the complex-valued Fourier domain, we operated only on the Fourier transform magnitudes. This step can be justified as the imprint of the pupil function on the spectrum's phase contains only values between ±π and is thus more susceptible to the influence of the sample spectrum than the pupil magnitude.

Results and discussion
An example reconstruction using illumination calibration with FRCNN is shown in Fig. 3(a). The reconstruction quality of FRCNN is on a par with a reconstruction for which illumination calibration was performed with the classical circular edge detection (CED) method (github.com/Waller-Lab/Angle_SelfCalibration) [38] (shown in panel b). However, looking at the disk detections in Fourier space (panel c), an improvement in favour of FRCNN can be seen (all frames of this data set are contained in Supplement 1, Figure S4). To objectively quantify the performance of FRCNN with respect to CED, we performed disk detection with both techniques on over 1000 images (128 × 128 pixels) generated from ground truth images using the FPM forward model. The error in localisation of disks is shown in the violin plot [39] of Fig. 4(a). The violin plots (github.com/bastibe/Violinplot-Matlab) were generated in MATLAB and show each data point as well as an estimate of the probability density of the data. The mean absolute error for CED was 2.4 pixels while the the mean absolute error of FRCNN was 0.9 pixels (∼ 3× reduction in error).
Intriguingly, we find an almost bimodal distribution in CED, which indicates that a failure of the algorihm might be at times severe. Nevertheless, even the portion of "successful" disk estimations in CED are below the performance of FRCNN. Further, the error spread of FRCNN is smaller and only shows very few outliers, which speaks for its robustness.
Next we investigated the effect of image size. Because of spatially varying pupil aberrations, the imaging forward model can only be considered linear shift-invariant (LSI) in a small image area called an isoplanatic patch. Calibration (and reconstruction) should hence be performed on image sizes not larger than such a patch to fulfill the LSI imaging model to correctly estimate the pupil and illumination. However, a smaller image size makes illumination detection accuracy more difficult as Fourier space pixels become larger and overall less information is available to determine the illumination angle. Hence we compare FRCNNs trained separately on image sizes of 64×64, 128×128, 256×256, and 512×512 pixels to find the best performing model given a certain isoplanatic patch size. When comparing localization error values, in pixels, from different image sizes, the error values from larger image sizes appear larger as a relatively small error in a large image corresponds to more pixels than a 'similar' error in a smaller image. To remove this inherent bias, we use as metric both the error in µm −1 as well as in pixels. The conversion formula is δd pxl = Np δd µm −1 .
(1) Difference between performance on raw and filtered raw data are measured using the Kolmogorov-Smirnoff test with outlier removal via the generalised extreme student deviate test (ns = not significant, ** = null hypothesis (no difference between distributions) rejected below the 1% significance level).
In Eq. (1), p is the effective camera pixel size in the sample plane. Since the conversion factor between pixels (δd pxl ) and µm (δd µm −1 ) scales with image size N, we get error values that conform to the same scale and thus can be compared over different patch sizes. As illustrated in Fig. 4(b), we find that increasing patch size reduces the error in terms of physical parameter (µm −1 ), while in terms of pixel accuracy a larger patch size increases the error. The lower limit for necessary wave vector estimation precision is determined by the degree of spatial coherence of the illumination light [38] while the maximal image size is determined by the isoplanatic patch size. Therefore, it is possible to choose the image size that offers both a suitable isoplanatic patch size and achieves high illumination estimation precision. Note that we also used CED on these patch sizes, which performed always at least three times less accurate than FRCNN (results contained in Supplement 1). Thirdly, we explored the effect of pre-processing of the Fourier spectra before feeding them to the neural network. In standard machine learning tasks, pre-processing of the raw data improves results and smoothing and denoising an image is generally deemed beneficial for algorithms to detect image features. On the other hand altering the image spectrum might obscure the actual location of the shifted pupil. The effects of pre-processing on the disk center estimation are summarised in the violin plots in Fig. 4(c), which compares the localisation error (in pixels) between un-processed raw spectra and filtered spectra using the full pre-processing pipeline. We find a limited or even at times adverse effect of pre-processing (mirrored by the violin plots of localisation errors in Fig. 4(c)). A detailed analysis (see Supplement 1) shows that bilateral filtering can have a small positive effect (yet not significant) on some patch sizes, whereas other filters for image smoothing can be highly disadvantageous. In contrast to most deep-learning application in computer vision, our experiments thus indicate that pre-processing provides no advantage when applied to object detection tasks in Fourier space, but might indeed worsen the performance while further increasing computation time.
Lastly, we investigated the generalisability of illumination calibration via neural networks. We used both conventional refractive objective data ("normal") and reflective objective data possessing a prominent obscuration ("obscuration") as input of three differently trained FRCNNs. Network 1 had seen only "normal" objective data, network 2 was trained on reflective objective data only, and network 3 was trained on both types of data. Note that the network architecture was the same in all cases and only the obscuration was modelled additionally in the forward model for networks 2 and 3. The patch size was 256x256 and no Fourier space pre-processing was applied. As is evident in Table 2, the presence of the obscuration significantly worsened the performance of the network that had never seen it during training.
Vice versa, the network trained on obscuration data had a steep decline in calibration accuracy when applied to "normal" data. Two other observations can be made though: Firstly, the presence of an obscuration is beneficial for illumination calibration (the overall error is smaller). We assume that any additional feature of the pupil would have this effect as more useful RPs would be present for the network to work with. Secondly, the performance of a more broadly trained network is largely on a par with a specialised one. Intriguingly, this extension of illumination calibration to reflective objectives only required a small adaption of the forward model, which would not have been feasible in such a simple and straightforward way with conventional approaches like CED.

Summary
It is interesting that illumination estimation can be posed as an object detection problem which is a common deep learning task, albeit not in Fourier space where it has not been conventionally tried to the best of our knowledge. We observe that use of faster RCNN for illumination estimation shows improvements over traditional methods like CED [38] with a 3-fold reduction in disk localisation error. Further, deep learning allows us to design tailor-made algorithms unique to particular microscopy setups with different isoplanatic patch sizes. The increased degree of abstraction in neural networks further eliminates the need for devising dedicated feature detection routines for distinct microscopy setups -the network can adapt to any pupil shape, as for example found in reflective microscope objectives. This can also help mitigate the loss of precision usually observed when an algorithm is tried on a type of data that is substantially different from the data it was designed to be used on. This also renders our approach highly user-friendly, as it is free from user-set parameters for successful illumination estimation. We additionally investigated the effect of pre-processing. Contrary to common knowledge in the context of many other image processing tasks involving neural networks, where such pre-processing proves very useful, we found that it is less viable on small image patch sizes in FPM illumination calibration. Finally, with the progress made in computational hardware in recent years, deep learning showed to be computationally feasible and could provide much more precise estimations with less computational overhead compared to CED by a factor of 2.
Looking ahead, using neural networks for the full FPM reconstruction pipeline is also an area of active research. An interesting approach in this respect is the combination of classical reconstruction routines and neural networks, where only the first FPM image of a time-series is reconstructed classically and serves as the sole training set for a neural network [40]. Given a long enough time-series, training of the network and application to consecutive frames is then considerably faster than classical reconstruction of each frame, while maintaining equivalent image quality. Moreover, as fewer raw frames are required for reconstruction with neural networks, the overall frame-rate can be increased tremendously. In the end though, it hinges on a reliable illumination estimation.

Author contributions
FS conceived and supervised the project, implemented the forward model, analysed data, and wrote the manuscript. SJ implemented and trained the networks, and performed simulations. DKP provided expertise on neural networks. KA and BSA provided guidance and research tools. All authors assisted in writing the manuscript.