Deep-learning-augmented Computational Miniature Mesoscope

Fluorescence microscopy is essential to study biological structures and dynamics. However, existing systems suffer from a tradeoff between field-of-view (FOV), resolution, and complexity, and thus cannot fulfill the emerging need of miniaturized platforms providing micron-scale resolution across centimeter-scale FOVs. To overcome this challenge, we developed Computational Miniature Mesoscope (CM$^2$) that exploits a computational imaging strategy to enable single-shot 3D high-resolution imaging across a wide FOV in a miniaturized platform. Here, we present CM$^2$ V2 that significantly advances both the hardware and computation. We complement the 3$\times$3 microlens array with a new hybrid emission filter that improves the imaging contrast by 5$\times$, and design a 3D-printed freeform collimator for the LED illuminator that improves the excitation efficiency by 3$\times$. To enable high-resolution reconstruction across the large imaging volume, we develop an accurate and efficient 3D linear shift-variant (LSV) model that characterizes the spatially varying aberrations. We then train a multi-module deep learning model, CM$^2$Net, using only the 3D-LSV simulator. We show that CM$^2$Net generalizes well to experiments and achieves accurate 3D reconstruction across a $\sim$7-mm FOV and 800-$\mu$m depth, and provides $\sim$6-$\mu$m lateral and $\sim$25-$\mu$m axial resolution. This provides $\sim$8$\times$ better axial localization and $\sim$1400$\times$ faster speed as compared to the previous model-based algorithm. We anticipate this simple and low-cost computational miniature imaging system will be impactful to many large-scale 3D fluorescence imaging applications.


Introduction
Fluorescence microscopy is indispensable to study biological structures and dynamics [1]. However, the emerging need of compact, lightweight platforms achieving micron-scale resolution across centimeter-scale fields-of-view (FOVs) has created two new challenges. The first challenge is to overcome the barrier of imaging at large scale while preserving resolution [2]. Recently developed table-top systems [3,4] have enabled multiscale measurements with sufficient resolution, however, they are complex and bulky. The second challenge is to perform large-scale imaging in a compact, lightweight platform. Miniaturized fluorescence microscopes, i.e. miniscopes [5], have enabled neural imaging in freely moving animals. However, most of the miniscopes rely on a gradient index (GRIN) objective lens [5] that limits the FOVs to <1 mm 2 . Wide-FOV miniscopes have recently been developed by replacing the GRIN with a compound lens [6,7], but at the cost of degraded resolution and increased size, weight, and system complexity. In general, fundamental physical limits preclude meeting the joint requirements of FOV, resolution, and miniaturization using conventional optics.
Computational imaging techniques have shown unique capabilities of overcoming the limitations of conventional optics by jointly designing optics and algorithms. Lightfield microscopy (LFM) [8] and related technologies [9,10], achieve single-shot high-resolution 3D fluorescence imaging [11,12]. LFM works by attaching a microlens array (MLA) to an existing microscope to collect both spatial and angular information, which enables reconstructing the 3D fluorescence from a single shot. While miniaturized LFMs [13,14], enable single-shot 3D imaging on modified miniscope platforms, the FOV is limited by the GRIN lens. Lensless imaging is another computational imaging technique for single-shot 3D imaging, where a mask [15] or a diffuser (random microlens array) [16,17] is placed directly in front of a CMOS. However, the removal of focusing optics imposes penalties to the measurement's contrast and signal-to-noise ratio (SNR) [4], severely limiting the sensitivity for imaging weak fluorescent signals [4]. We recently developed Computational Miniature Mesoscope (CM 2 ) [18] that aims to overcome all the key limitations of FOV, resolution, contrast, SNR, and size and weight in existing miniature fluorescence imaging systems. The CM 2 combines the merits of both LFM and lensless designs. It places a 3 × 3 MLA directly in front of a CMOS sensor for imaging, like the lensless design. This ensures compactness and light weight while further exploiting the microlens's focusing power for providing high image contrast. The CM 2 captures an image with multiple views, which enables robust recovery of 3D fluorescence in a single shot, like the LFM. Previously, we demonstrated CM 2 V1 that achieved 3D fluorescence imaging in a 7 × 8 mm 2 FOV with 7-m lateral and 200-m axial resolution [18]. CM 2 V1 is the first standalone computational miniature fluorescence microscope with an integrated illumination module. Using a four-LED array in an oblique epi-illumination geometry, CM 2 V1 can uniformly illuminate a 1-cm 2 FOV and achieves a ∼24% light efficiency. In this work, we significantly advance the CM 2 technology and report CM 2 V2 that integrates innovations in both hardware and computation to address limitations in light efficiency, image contrast, reconstruction quality and speed, as summarized in Fig. 1.
On the hardware, we present updates that significantly improve the image contrast and light efficiency. First, we complement the 3×3 MLA with a hybrid emission filter [19] (Fig. 1a) that suppresses the spectral leakage suffered by the V1 system. Second, we design and 3D-print a miniature freeform LED collimator (Fig. 1a) that improves the excitation efficiency while preserving compactness and light weight. This new illuminator achieves ∼80% efficiency, a ∼3× improvement over the V1 design, and provides a confined uniform illumination with up-to 75 mW excitation power across an 8-mm diameter FOV. Built around a back-side illuminated (BSI) 3D fluorescence imaging across a mesoscale FOV. Built on off-the-shelf and 3D-printed components, we expect this simple and low-cost miniature system can find utility in a wide range of large-scale 3D fluorescence imaging and neural recording applications.

Methods
2.1. The CM 2 V2 hardware platform CM 2 V2 is a standalone miniature fluorescence microscope that is built with off-the-shelf and 3D-printed components, as illustrated in Figs. 1a and 2a. It mainly consists of two parts, including a newly designed illumination module and an upgraded imaging module. As compared to the V1 platform, the V2 platform features freeform LED-collimators that improve the illumination efficiency by ∼3×, a hybrid emission filter design that improves the image contrast by ∼5×. To design the illumination module, our goal is to achieve ∼50 mW total excitation power across a centimeter-scale FOV, which is sufficient for one-photon widefield Calcium imaging in mouse brains [4]. In addition, the illumination module needs to be highly efficient without incurring excessive heat burden. Our solution is to incorporate a compact, lightweight freeform collimator in-between the surface-mounted LED (Lumileds, LXML-PB01-0040) and the excitation filter (Chroma Technology, no. 470). The collimator is based on a refraction-reflection freeform design [24]. It consists of an inner refractive lenslet and an outer parabolic reflective surface (Fig.  2b). The lenslet collimates the light within a ∼52-degree conical angle. The parabolic surface satisfies the total internal reflection (TIR) condition and collimates the light emitted at high angles. The LED is placed around the shared focal point of the lenslet and the parabolic refractor. Each collimator is ∼4 × 4 × 1 mm 3 in size, weighs ∼0.03 grams, and is 3D-printed with clear resin (printed on Formlabs Form 2, no. RS-F2-GPCL-04). The design achieves an efficiency of ∼80% in Zemax simulation, which considers the finite-sized LED emitter, broadband LED emission spectrum, and angle-dependent transmission spectrum of the excitation filter.
The entire illumination module consists of four LED illuminators placed symmetrically around the imaging module. After performing optimization in Zemax, the LED illuminator is placed ∼6.7 mm away from the imging optical axis and tilted by ∼45 degrees to direct the light towards the central FOV. The Zemax simulation shows that this design provides nearly uniform illumination confined in an 8-mm circle (Fig. 2c). The experimental validation on a green fluorescence calibration slide (Thorlabs, no. FSK2) closely matches with the simulation (Fig. 2d). The total excitation power is measured to be up-to 75mW (at maximum driving current of 350 mA) at ∼470 nm excitation wavelength.
The imaging module is built around an off-the-shelf 3 × 3 MLA (Fresnel Technologies Inc., no. 630) to form a finite-conjugate imaging geometry with ∼0.57 magnification. The lateral resolution is primarily limited by the NA of a single microlens, which is ∼6 µm measured experimentally (see Supplement 1). We incorporate an interference-absorption emission filter pair to improve the signal-to-background ratio (SBR) in the raw measurement. An interference filter (Chroma Technology, no. 535/50) is placed in front of the MLA. An additional long-pass absorption filter (Edmund Optics, Wratten color filter no. 12) is placed after the MLA to suppress the leakage light. The emission spectra of the emission and absorption filters are optimized for the green fluorescence, as detailed in Supplement 1. As compared to the interference-filter only measurement, this hybrid filter design improves the SBR by >5× on a phantom consisting of 10-m fluorescence beads (Fig. 2e). This improvement makes the new CM 2 V2 platform more robust in low-light fluorescence imaging conditions.
The CM 2 V2 is built around a backside-illuminated (BSI) CMOS sensor (IDS Imaging, IMX178LLJ), which gives a 4.15-m effective pixel size. The dome-shaped 3D-printed housing (printed on Formlabs Form 2, black resin, no. RS-F2-GPBK-04) provides mechanical support and light shielding. The size and weight of CM 2 V2 is only limited by the CMOS sensor. The CM 2 V2 prototype is ∼36 × 36 × 15 mm 3 in size, including the commercial CMOS PCB board. The custom parts excluding the PCB is ∼20 × 20 × 13 mm 3 in size and weighs only ∼2.5 grams.

3D Linear Shift Variant model of the CM 2
Our goal is to build an accurate and efficient 3D linear shift variant (LSV) model to describe the CM 2 image formation. Using the synthetic data simulated from this model, we will later train the proposed CM 2 Net to perform 3D reconstruction. In this section, we describe a sparse PSF calibration procedure and a low-rank approximation-based 3D-LSV model.
First, to calibrate the spatially varying PSFs, we scan a 5-m point source on a 3-axis translation stage. The point source is scanned across an 8 mm × 8 mm × 1 mm volume with steps of 1 mm laterally and 100 m axially, which yields a stack of 9 × 9 × 11 calibrated PSFs, as illustrated in Fig. 3a. Several example calibrated PSFs are shown in Fig. 3b, which highlight the following key features of the CM 2 image formation. At a given lateral position, the off-axis foci shift laterally with the depth, akin to the lightfield. At a given depth, the PSFs are still shift variant because of the spatially varying aberrations from the microlenses and the missing side foci at large off-axis locations (when lateral location > 1.7 mm) [18]. As a result, to fully characterize the CM 2 3D PSF, it necessitates a 3D-LSV forward model. Unfortunately, scanning the point source on the entire dense grid at our desired 3D resolution (4.15 µm × 4.15 µm × 10 µm) across the targeted imaging volume (∼8 mm × 8 mm × 1 mm) would require ∼370 million PSF measurements, which is impractical. Next, we describe a computational procedure to address this challenge.
We develop a low-rank approximation based 3D-LSV model for simulating the CM 2 measurement in the following steps (more details in Supplement 1): 1. We denote the sparsely calibrated PSFs as ( , ; , , ), where ( , ) are the pixel coordinates of the PSF image, and ( , , ) is the 3D location of the point source. In total, the calibrated PSF set contains = 891 images (the effect of the PSF calibration grid is studied in Supplement 1). Each raw PSF image contains ∼6.4M pixels, which is too large to be directly operated on for the low-rank decomposition. To address this issue, we develop a memory efficient scheme by exploiting the highly confined foci in the PSF image. We remove most of the dark regions in the images and then align the cropped foci. The alignment step essentially compensates for the depth-dependent lateral shift in the off-axis foci. We denote this "compressed" and aligned PSF calibration set as ( , ; , , ), where ( , ) are the new pixel coordinates after cropping and alignment.
2. We approximate the calibrated PSFs by a rank-singular value decomposition (SVD): where ( , ), { = 1, . . . , } denotes the th basis PSF and ( , , ) is the corresponding coefficient volume. Equation (1) approximates the set of calibrated PSFs as a linear combination of basis PSFs. The first five basis PSFs and coefficient volumes are shown in the first two rows in Fig. 3c. We choose = 64 that has a small ∼2.5% approximation error on the calibration set (Fig. 3d). The choice of incurs a tradeoff between the model accuracy and computational cost. In addition, this low-rank approximation also helps suppress noise in the raw PSF measurements (more details in Supplement 1).
3. To obtain the coefficient volumes at any uncalibrated 3D location, we perform 3D bilinear interpolation from the sparse calibration grid to the dense reconstruction grid. This procedure relies on the assumption that the PSFs are slowly varying in 3D [25], which means that 1) the basis PSFs can be accurately estimated from a sparse set of PSF measurements, and 2) the decomposition coefficients are smooth in 3D. The interpolated coefficient volumes for the first five basis PSFs are shown in the third row in Fig. 3c.
4. The final 3D-LSV model is computed by weighted 2D depth-wise convolutions in the lateral dimension , , followed by a summation along the axial dimension : Here, ( , , ) is the 3D fluorescence distribution of the object. Both the basis PSF ( , , ) and coefficient volume ( , , ) have been placed back to the original sensor pixel coordinates by accounting for the expected lateral shift at each depth . The pixel coordinates ( , ) in the image and the object space coordinates ( , ) are related by the magnification by = , = .

CM 2 Net design
To enable fast and accurate 3D reconstruction from a CM 2 measurement, we implement a modular deep learning model, CM 2 Net, to incorporate the key feature of the CM 2 physical model. Each CM 2 image contains 3 × 3 multiplexed views to capture projection information about the 3D object [18]. This multi-view geometry introduces two challenges to the network design. First, the image features needed for 3D reconstruction are non-local, instead they are separated by a few thousands of pixels. To fully capture the non-local information requires a sufficiently large receptive field, which is not easily achieved by a standard convolutional neural network. Second, the view-multiplexing requires the network not only to reconstruct 3D information, but also to remove crosstalk artifacts. To address these challenges, CM 2 Net combines three modules to break the highly ill-posed inverse problem into three simpler tasks, including view-demixing, view-synthesis, and lightfield-refocusing enhancement, as illustrated in Fig. 4. The first module, view "demixing-net", demultiplexes a CM 2 image into nine "demixed" views, each corresponding to the image captured by a single microlens without crosstalks from other microlenses. To perform this task, demixing-net synthesizes the information contained in the entire CM 2 measurement. To facilitate this process, we first construct a "view stack" by cropping and view-aligning nine patches from the raw measurement based on the chief ray of each microlens (Fig. 4). This input view stack contains multiplexed information, which demixing-net seeks to demultiplex. The ground-truth output is the demixed view stack containing nine crosstalk-free images, which is made possible on simulated training data using our 3D-LSV model. Our results show that this task can be accurately performed by utilizing the distinctive aberration features from different microlenses. In Supplement 1, we further perform ablation study on demixing-net and highlight that it significantly reduces the false positives in the reconstructions.
The demixed view stack is akin to a 3 × 3 view lightfield measurement, which is processed by two reconstruction branches. The first branch is "view-synthesis-net", which directly performs the 3D reconstruction based on disparity information in the views, as inspired by deep-learning  Fig. 4. CM 2 Net structure. The raw CM 2 measurement is first preprocessed to form a multiplexed view stack. The demixing-net removes the crosstalk artifact and outputs the demixed view stack by learning view-dependent aberrations. The demixed view stack is processed by the "shift-and-add" lightfield refocusing algorithm to form a geometrically refocused volume. The enhancement-net branch removes the refocusing artifacts and enhances the reconstructed 3D resolution. The view-synthesis-net branch directly processes the demixed views to perform the 3D reconstruction. The sum of the output from the two branches is further processed to form the final reconstruction. CM 2 Net is trained with a mixed loss function combining the demixing and reconstruction losses.
enhanced LFM [22]. The second branch explicitly incorporates the geometrical optics model of lightfield. The demixed views are first processed by the lightfield refocusing algorithm [26] to generate a refocused volume, and then is fed into "enhancement-net" to remove artifacts and enhance the reconstructed resolution. The refocused volume already provided most of the 3D object information but suffers from three artifacts, including severe axial elongation due to limited angular coverage, boundary artifacts from the "shift-and-add" operation, and missing object features at the peripheral FOV regions due to inexact view matching between the 3 × 3 MLA. To achieve the best performance, the outputs from the two branches are summed and further processed to yield the final 3D reconstruction. To highlight the effectiveness of this design, we conduct ablation studies and visualize the respective activation maps of the two branches in Supplement 1. Our results show that the lightfield refocusing enhancement-net achieves high-quality reconstruction at the central FOV region, and the view-synthesis-net improves the performance at the peripheral FOV regions. Together, the two reconstruction branches utilize complementary information to achieve high-resolution reconstruction across a wide FOV. Overall, CM 2 Net is trained entirely on simulated data from our 3D-LSV model. The loss function combines a demixing loss and a reconstruction loss: loss = 1 demix + 2 rec , which promotes the fidelity of the demixed views and the 3D reconstruction results, respectively. For both loss components, we use the binary cross entropy (BCE) since it promotes sparse reconstructions [27], which is defined by BCE( ,ˆ) = log(ˆ) + (1 − ) log(1 −ˆ), and the summation is over all the voxels indexed by , and andˆdenote the ground-truth and reconstructed intensity, respectively. The weights of the two loss functions ( 1 , 2 ) are set to be (1, 1) after performing hyperparameter tuning, which concluded that the demixing and reconstruction losses have equal importance. CM 2 Net is implemented in Python 3.7 with TensorFlow 2.3. The multiple sub-networks are trained together in an "end-to-end" fashion on an Nvidia P100 GPU (16 GB) with a batch size of 2. We use Adam optimizer with an adaptive learning rate schedule. The initial learning rate is 10 −4 and automatically decreases by a factor of 0.9 after the loss on a small validation set (∼400 patches) plateaus for 2 consecutive epochs. The training takes ∼48 hours to complete. Additional implementation details are provided in Supplement 1.

Synthetic training data generation
We generate a large-scale training dataset for CM 2 Net based on the 3D-LSV model (Eq. (2)) from a set of synthetic volumes. The FOV of each synthetic volume follows a uniform random distribution between 6.5 mm and 7.5 mm, [6.5 mm, 7.5 mm]. The degree of view-multiplexing is determined by the FOV [18]. For FOV < 2.7 mm, no view-multiplexing is present. As the FOV increases, the overlap between neighboring views increases approximately quadratically. At the largest FOV = 7.5 mm, ∼64% of overlap is present. The depth range is fixed at 800 m. The volumes are sampled at 4.15 m laterally (matching the effective pixel size of CM 2 V2) and 10 m axially (10× higher sampling than the physical scanning step size). We randomly place spherical emitters into the volumes by the following steps. Due to the large sampling grid size, we first generate each ground-truth emitter on a 5× finer grid (0.83 m × 0.83 m × 2 m). Next, we perform 5 × 5 × 5 average binning to make the ground-truth volume having the same grid size as the final reconstruction. The emitter's diameter follows a uniform random distribution [8 m, 20 m], which approximately matches the typical size of neuronal cell bodies. The emitter's intensity is set by the surface area, i.e. proportional to the diameter squared, which matches with our experimental measurements. The size range used in our data leads to a 6.25× intensity variation range. To further vary the emitter's intensity at a given size, a random scaling factor following a uniform distribution [0.8, 1.2] is added, which approximately matches with the contrast from one-photon fluorescence microscopes on Calcium indicators [28]. The emitter density in each volume follows a uniform random distribution [10,100] (number of emitters / mm 2 ), which simulates different fluorescence labeling densities used in cortex-wide neuronal imaging applications [3,4].
We first generate noise-free measurements using the 3D-LSV model. We then add realistic levels of mixed Gaussian and Poisson noise. The parameters for the additive Gaussian noise (normalized mean = 0.048, standard deviation = 0.017) are estimated by multiple dark measurements taken with the same acquisition parameters as the real experiments (30 ms exposure time, 40 dB gain). The Poisson noise is added by estimating the expected photon budget (∼500 peak number of photons, and a unit effective gain) in typical widefield one-photon imaging [4]. To train the view demixing-net, we generate the ground-truth non-overlapping views using the same 3D-LSV model with the single-microlens PSF.
After synthesizing the measurements, we crop the overlapped views (1920 × 1920 pixels) based on the chief ray location of each microlens at the in-focus image plane. Next, we stack the 9 cropped views to form a 1920 × 1920 × 9 multi-channel input to CM 2 Net. Finally, CM 2 Net is trained on 9700 uniformly cropped patches (320 × 320 pixels) from 270 synthetic objects.

3D-LSV simulator enables accurate 3D reconstruction across a wide FOV
To demonstrate our 3D-LSV model is essential to achieve accurate 3D reconstruction across a wide FOV, we compare two CM 2 Net trained with two different forward models. The first network, termed LSV-CM 2 Net, is trained by our 3D-LSV model. The second network, termed LSI-CM 2 Net, is trained by our previous depth-wise LSI model [18], which assumes the on-axis PSF is invariant at each depth. We also benchmark the network reconstructions against the depth-wise LSI model-based deconvolution algorithm [18].
The 3D reconstructions on a cylindrical volume (∼7-mm diameter, 0.8-mm depth) from LSV-CM 2 Net, LSI-CM 2 Net, and model-based deconvolution are shown in Fig. 5a, 5b, and 5c, respectively. In each figure, we overlay the reconstruction (in red) onto the ground truth (in green) and visualize the XY and XZ maximum intensity projections (MIP). When the reconstruction matches with the ground truth (i.e. True Positives), the overlayed region appears in yellow. When the reconstruction misses certain particles (i.e. False Negatives), the region appears in  green. When the reconstruction creates false particles (i.e. False Positives) or suffers from axial elongations, the region appears in red. By visual inspection, LSV-CM 2 Net can accurately reconstruct the entire 7-mm FOV throughout the 0.8-mm depth range. as highlighted by the three zoom-in regions of the XZ MIPs taken from the central and two peripheral regions. In contrast, the LSI-CM 2 Net suffers from severe artifacts especially beyond the central 3-mm diameter region. The model-based reconstruction matches well with the ground truth across the entire volume, but suffers from severe axial elongations [18]. A major improvement of CM 2 Net over the model-based deconvolution is the significantly reduced axial elongation, as also shown in our experiments in Section 3.3. In addition, CM 2 Net dramatically reduces the reconstruction time and memory burden. To perform the large-scale reconstruction in Fig. 5 (∼230 million voxels), the model-based method requires ∼1.4 hours and ∼150-GB RAM. In contrast, CM 2 Net takes only ∼3.6 seconds on an entry-level GPU (Nvidia RTX 2070, 8GB RAM), which is ∼1400× speed-up and ∼19× memory cost reduction.
To demonstrate the potential applications of CM 2 Net to reconstruct complex brain structures, we perform simulation studies on imaging 3D neuronal populations and mouse brain vessels in Supplement 1. Our results on neuronal imaging show that CM 2 Net can achieve high reconstruction performance on both sparsely (20 neurons / mm 2 ) labeled and densely (100 neurons / mm 2 ) labeled neuronal populations across a cortex-wide (7.5 × 6.6 mm) FOV and is robust to the complex brain geometry. Our result on a mouse blood vessel network highlights a few key properties of the particle-dataset trained CM 2 Net. First, the view-demixing module can perform reliable demultiplexing on axially overlapping small vessels. The demixing result highly matches with the ground truth, demonstrating that the demixing network is robust to overlapping views from continuous objects, even though it is trained entirely on sparse fluorescent beads. Second, the lightfield refocused volume on demixed views can correctly resolve complex 3D geometry, which lays the foundation for the final 3D reconstruction. Third, the reconstruction module can correctly reconstruct the vessel network, albeit with discontinuity artifacts. We attribute the artifacts to the sparsity constraint implicitly enforced by the particle-dataset trained CM 2 Net on the 3D reconstruction. Overall, CM 2 Net can provide high-quality reconstruction on brain vasculature across a wide (6 × 4 mm) FOV and can resolve the complex 3D geometry.

Quantitative analysis of CM 2 Net performance
We quantitatively show that the trained CM 2 Net can provide high-quality reconstruction and is robust to variations in the emitter's lateral location (FOV), seeding density, depth, size and intensity in simulation. To perform the evaluation, we simulate a testing set consisting of 180 volumes that uniformly falls in nine density ranges [10:10:100] (number of emitters/mm 2 ). The data synthesis procedure follows the procedure in Section 2.4.
We quantify the detection capability of CM 2 Net using recall, precision, F1 score, and Jaccard index. Recall measures the sensitivity / detection rate by the ratio between the correctly reconstructed and the actual total number of emitters. Precision measures the specificity by the ratio between the correctly reconstructed and the total reconstructed number of emitters. F1-score and Jaccard index combine these two complementary metrics. In addition, we quantify the 3D localization accuracy by lateral and axial root mean squared localization error (RMSE) [29]. A global threshold needed to binarize the reconstructed volume when computing the metrics is set by maximizing the F1-score on the testing set [23]. More details on the quantitative metrics are provided in Supplement 1. We compute the statistics of each metric at a given condition (e.g. a lateral location), when all other parameters (e.g. emitter's depth, density, size) are randomized.
First, the performance at different lateral locations are evaluated in Fig. 6a. We aggregate the emitters into 7 bins [0 mm: 0.5 mm: 3.5 mm] (distance from the center) and compute the statistics. The averaged precision and recall (blue) remain >0.93 and >0.68 when the distance is <3 mm (i.e. FOV < 6 mm). Precision and recall reduce to ∼0.85 and ∼0.37, respectively, when the distance is ∼3.5 mm (FOV = 7 mm). Lateral / axial RMSE (orange) is less than 5 m / 15 m within the 6-mm FOV and degrade to 8.7 m / 21 m at the edge. The standard deviation (the error bar) increases with the distance, indicating that the reconstruction is more consistent at the central FOV. To better visualize the detection performance, we calculate recall and precision maps in Fig. 6a (details in Supplement 1). The precision map shows that CM 2 Net provides nearly isotropic, high specificity within the 7-mm FOV. The recall map shows that CM 2 Net provides a high detection rate in the central 6-mm FOV, and degrades at the outer regions.
To understand the origin of the degradation in the peripheral FOV, we perform ablation studies by feeding the CM 2 Net's reconstruction module with the ground-truth demixed views (see Supplement 1). The result shows consistently high recall (>0.89) for the entire 7-mm FOV, showing the robustness of the reconstruction module. This implies that the degraded recall is due to imperfect view-demixing at outer FOV regions. To further diagnose the system, we compare the intensity distribution of the point source for PSF calibration and the recall map, and find qualitative correspondence. We hypothesize that the training of view-demixing net is impacted by the rapid intensity fall-off (∼85% drop at the 7-mm FOV edge) of the imperfect point source.
We evaluate the metrics for different emitter densities at [10: 10: 100] (emitters / mm 2 ) in Fig. 6b. As expected, both precision and recall decrease, whereas both lateral and axial RMSEs increase with the density. Precision remains >0.92 for all emitter densities, indicating very few false positives in the reconstruction despite the large (10× span) density variations. Recall decreases approximately linearly from ∼0.83 at 10 emitters/mm 2 to ∼0.61 at 100 emitters/mm 2 . Lateral / axial RMSE increases approximately linearly from 2 m / 9.3 m at the lowest density to 6.4 m / 14 m at the highest density. This means that as the density increases, CM 2 Net suffers from more false negatives and lower localization accuracy.
We m. To explain the decrease of recall at these large defocus depths, we visualize the on-axis PSF and show that it degrades more severely as the source moves closer to the MLA. This results in lower SNRs in the measurement, which leads to more false negatives. Lateral / axial RMSE is < 5 m / 14 m for all depths, and degrades only slightly at large defocus. The slight drops in the first and last bins are attributed to the smaller sample size (in combination of the smaller bin size and fewer true positives) that introduces errors in the statistics. We observe that the minimum RMSE is centered around the 100-µm bin, which suggests that there may be a slight defocus between our nominal and the actual focal plane. Overall, the RMSE analysis shows that 3D localization is generally robust to defocus within the 800-µm depth range.
Finally, we quantify the metrics for different emitter diameters at [7 m: 2 m: 21 m]. For diameters ranging 11-20 m, precision is >0.9 and recall is >0.71. As the diameter decreases, both precision and recall drop approximately linearly to 0.55 and 0.48 for 8-m emitters, respectively. Lateral / axial RMSE decreases from 5.9 m / 12.7 m for 8-m emitters to 2.8 m / 11.5 m for 20-m emitters. We attribute the worse performance for smaller emitters to two factors. First, since the emitter's intensity is proportional to the size squared, the SNR rapidly decreases as the size reduces. Second, due to the coarse sampling in the reconstruction, the number of voxels for each emitter is <5 when diameter is <11 m (see the top panel of Fig. 6d).
The averaged precision and recall for the entire testing set is ∼0.7, ∼0.94 respectively, comparable to the state-of-the-art deep learning neuron detection algorithm [23]. The averaged lateral and axial RMSEs are 4.17 m and 11.2 m respectively, close to the reconstruction grid size (lateral 4.15 m and axial 10 m), which indicates that the localization accuracy is close to one voxel. This study establishes that CM 2 Net can detect emitters with few "hallucinated" sources (∼4% false positive rate on average) and high detection rates (∼30% false-negative rate on average) with good localization accuracy in a broad range of conditions.

CM 2 Net achieves high 3D resolution, wide-FOV reconstruction in experiments
We demonstrate that the generalization capability of simulator-trained CM 2 Net enables high 3D resolution reconstruction in experiments with high detection performance.
We first image a cylindrical volume embedded with 10-m green-fluorescent beads (details about the sample preparation and experimental setup are provided in Supplement 1). The phantom is estimated to have 10-20 emitters / mm 2 . First, to remove the non-uniform background and match the intensity statistics with the simulation data, we preprocess the raw experimental measurements with histogram matching. Next, we manually cropped 9 views to input to CM 2 Net for 3D reconstruction (more details in Supplement 1).
The CM 2 Net reconstruction is shown in Fig. 7a and is validated against widefield measurements from a standard table-top epi-fluorescence microscope in Fig. 7b. First, we validate the full-FOV reconstruction by comparing the XY MIPs of the reconstruction and the widefield -stack from a 2×, 0.1 NA objective. To further assess the reconstruction at greater resolution, two zoom-in XY MIPs of the reconstruction from the central and edge of the FOV are compared with the high-resolution -stack from a 20×, 0.4 NA objective. By visual inspection, the reconstruction matches well to the widefield measurements. The reconstruction quality maintains at peripheral FOV regions, a marked improvement over our previous model-based reconstruction [18].
A major goal we aim to achieve using CM 2 is 3D high-resolution imaging across a wide FOV. To highlight this capability, we compare the FOV achieved by CM 2 V2 (∼7 mm) with the 2× objective (∼8 mm) and 20× objective (∼800 m) (marked in Figs. 7a and 7b.) Representative axial profiles of the 10-m beads reconstructed by CM 2 Net, model-based deconvolution, and 2× and 20× widefield measurements are compared in Fig. 7d. CM 2 Net achieves an axial elongation of ∼24 m, which is ∼8× better than the model-based deconvolution (∼184 m) and outperforms 20×, 0.4 NA measurement (∼39.7 m).
To quantify the detection performance, we compute recall, precision, and F1-score by comparing the XY MIPs of the CM 2 Net reconstruction and widefield 2× measurement. CM 2 Net achieves recall ∼0.78 and precision ∼0.80. In comparison, the recall and precision in simulation at the corresponding density is ∼0.83 and ∼0.97, respectively. The simulator-trained CM 2 Net degrades slightly on experiment with ∼5% higher false-negative rate and ∼17% higher false-positive rate. We attribute the reduced performance to the undesired extra views in the experimental   To quantify spatially variations in performance, we construct the recall and precision maps in Supplement 1. Recall in most regions are >0.75, indicating <25% false-negative rates. Precision is >0.8 except for a few patches with <2 beads, indicating only a few false positives in the reconstruction. In Fig. 7c, we show the F1-score map generally achieves a high value of >0.75. An overlay between the full-FOV reconstruction and the widefield 2× measurement is shown in   This experiment shows that CM 2 Net provides high 3D resolution reconstruction across a wide FOV with high sensitivity and precision. The 24-m axial elongation achieved by CM 2 Net is ∼4× better than the 100-m axial spacing in the PSF calibration. This shows that the axial interpolation in our 3D-LSV model is effective to achieve axial "super resolution" in real experiments. Both the recall and precision agree with the simulation, validating that the 3D-LSV simulator-trained CM 2 Net can generalize well to experimental measurement.

Experimental demonstration on mixed size fluorescent beads
Fluorescent emitters with different sizes and brightness result in different local contrast and SNRs in the CM 2 measurement [18]. This is an important consideration as we develop CM 2 towards realistic biological applications. To demonstrate this capability, we conduct proof-of-concept experiments on mixed size fluorescent beads. Our result shows that CM 2 Net can robustly handle such sample variations in real experiments.
We image a cylindrical volume (diameter ∼6.5 mm, depth ∼0.8 mm) embedded with mixed 10-m and 15-m beads (details in Supplement 1). The phantom is estimated to have 10-20 emitters / mm 2 . In the CM 2 measurements, the 15-m beads are ∼2.2× brighter than 10-m beads, matching with their surface area ratio and our synthetic data model in Sec. 2.4. The CM 2 Net reconstruction is shown in Fig. 8a. The 3D reconstruction is validated against widefield measurements in Fig. 8b. First, we assess the full-FOV reconstruction by comparing the XY MIPs of the CM 2 Net reconstruction and the 2× -stack measurement. We further compare two zoom-in regions from the center and corner FOVs with the high-resolution 20× -stack measurement. By visual inspection, CM 2 Net reliably reconstructs both 10-m and 15-m beads. The XZ MIPs of the CM 2 Net 3D reconstruction are in good agreement with the 20× -stack measurements. The axial confinement on both 10-m and 15-m beads by CM 2 Net are better than the 20× measurements.
To quantitatively assess the CM 2 Net reconstruction, we compute recall, precision, and F1-score maps in Supplement 1 by comparing the XY MIPs from the CM 2 Net reconstruction and widefield 2× measurement. CM 2 Net achieves averaged recall ∼0.73 and precision ∼0.84 across the 6.5-mm FOV. As compared to the mono-10-m bead experiment, we attribute the slightly decreased recall to the greater intensity and SNR variations. We attribute the increased precision to the reduced FOV and less contamination from the extra views (see analysis in Supplement 1). An overlay between the full-FOV reconstruction and the 2× measurement is shown in Supplement 1 to provide further visual inspections. The results show that CM 2 Net is robust to the emitter size and intensity variations in the experiment.
This experiment again highlights the wide-FOV and high-resolution 3D imaging capability of CM 2 V2. Our training data containing randomized emitter sizes and intensities are effective to make CM 2 Net robust to experimental variations. As a result, CM 2 Net can provide high-quality 3D reconstruction with good sensitivity and precision on mixed size emitters that have large differences in the feature size and local SNR.

Conclusion
In summary, we have presented a new Computational Miniature Mesoscope (CM 2 ) system, which is a deep learning-augmented miniaturized microscope for single-shot 3D high-resolution fluorescence imaging. The system reconstructs emitters across a ∼7-mm FOV and 800-m depth with high sensitivity and precision, and achieves ∼6-m lateral and ∼25-m axial resolution.
The main hardware advancement in CM 2 V2 includes a novel 3D-printed freeform illuminator that increases the excitation efficiency by ∼3×. Each 3D-printed LED collimator can provide up-to 80% light efficiency yet weighs only 0.03 grams. It is low-cost and rapidly fabricated on a table-top 3D printer. In addition, we adapted a hybrid emission filter design that suppresses the excitation leakage and improves the measurement SBR by more than 5×.
The computational advancement includes three main parts. First, we developed an accurate and computationally efficient 3D-LSV forward model that characterizes the spatially varying PSFs across the large (cm 2 × mm-scale) imaging volume supported by the CM 2 . Second, we developed a multi-module CM 2 Net that achieves robust, high-resolution 3D reconstruction from a single-shot CM 2 measurement. Third, using the 3D-LSV simulator to generate the entire training dataset, CM 2 Net provides high detection sensitivity and precision and good localization accuracy on fluorescent emitters across a wide FOV and generalize well to experiments. In addition, our numerical studies show that CM 2 Net can achieve high reconstruction performance on both neuronal populations and vascular structures across a cortex-wide FOV and is robust to the complex mouse brain geometry.
Our demonstration on the utility of freeform optics fabricated by 3D printing may be a fruitful area for future research, especially for miniature microscopes and other miniature optical devices. In recent years, freeform optics has emerged as the ideal solution to bypass many limitations in conventional optics, such as compactness and imaging performance [30]. At the same time, non-conventional optics has been enabled by novel 3D printing process, such as micro- [14,31], diffractive [32], and volume optics [32]. We envision that novel 3D-printed freeform optics can be incorporated in future CM 2 platforms to enhance its imaging capability.
The CM 2 V2 platform is built on a back-side illuminated (BSI) CMOS sensor, which significantly improves the measurement's SNR and dynamic range over the conventional CMOS sensor in the V1 platform. The size and weight of the CM 2 V2 prototype is limited by the availability of miniature BSI CMOS sensor. However, we do not anticipate this to be a major roadblock for future development as encouraged by the recent development of MiniFAST [33] BSI CMOS-based miniscope. With further advancement on high-speed data transmission and high pixel-count BSI CMOS sensor platform, we expect CM 2 can be further miniaturized to be suitable for "wearable" in-vivo neural recordings on mice and other small animals.
Our 3D-LSV model is essential to achieve high 3D resolution reconstruction across a large imaging volume. A notable result we have demonstrated is that the axial resolution is not limited by the axial step used for the 3D PSF calibration. This allowed us to bypass the large data requirement in the alternative depth-wise LSV framework [14,16,21] and to perform data-efficient PSF calibrations across a centimeter-scale FOV and millimeter-scale depth range. We expect the same sparse 3D PSF calibration, low-rank decomposition, and 3D interpolation procedure are applicable to other computational 3D microscopy techniques, such as LFM and lensless imaging. In addition, it may be possible to develop hybrid 3D PSF calibration procedures by combining physical measurements and numerical modeling to further improve the model accuracy, as recently shown in high-resolution LFM imaging [34,35].
Our CM 2 Net incorporates both the view-multiplexing and lightfield information in the CM 2 image formation. We have shown that the view-demixing module significantly suppresses the false positives in the 3D reconstruction. The simulator-training scheme was essential to enable the training of the view-demixing sub-network. This highlights several key advantages of simulatorbased training over experiment-based training schemes. It not only forgoes the laborious physical data collection process, but also enables access to novel data pairs that are impractical to collect experimentally. The reconstruction module combining the lightfield-refocusing enhancement and view-synthesis branches is shown to learn complementary information from the demixed views and enable highly accurate 3D reconstructions, and is readily applicable to other LFM modalities.
Our numerical study on reconstructing brain vasculature indicates that the emitter-dataset trained CM 2 Net implicitly enforces a sparsity constraint to the 3D reconstruction that produces discontinuity artifacts. We trained CM 2 Net on individual emitters since our targeted application is to image neurons labeled with genetically encoded calcium indicators in mouse brains [36]. To better adapt CM 2 Net to other complex structures such as blood vessels, one can perform transfer learning on a dataset tuned to the specific application. Conveniently, Our 3D-LSV simulator is directly applicable to generate non-emitter data, as shown in our study.
An outstanding challenge to expand the utility of the CM 2 is tissue scattering [18]. There are several promising solutions we envision that are applicable to CM 2 , such as miniature structured illumination technique [37] and scattering-incorporated 3D reconstruction frameworks [38,39], which will be investigated in our future work.

Deep learning-augmented Computational Miniature Mesoscope: supplementary material
This document provides supplementary information for the manuscript "Deep learningaugmented Computational Miniature Mesoscope". First, we provide more details on the CM 2 V2 hardware design, prototype assembly and lateral resolution characterization. Next, we further explain the detailed computational pipeline used in CM 2 V2, including the sparse PSF calibration process, the3D-LSV forward model, and the implementation details of CM 2 Net. In addition, we elucidate on the distinct role of each network module in CM 2 Net by performing a series of ablation studies, as well as by visualizing the feature maps from the sub-networks. Third, we provide additional quantitative analysis for the experimental results. Finally, we provide simulation studies on applying the trained CM 2 Net to imaging complex neuronal structures.

A. Additional Details on CM 2 V2 Hardware Design
The CM 2 V2 platform consists of two main parts, including the imaging module and the illumination module. The overall design is visualized in Figs 1a and 2a in the main text. Built from 3D-printed components and off-the-shelf optics and electronics, the CM 2 V2 platform has an overall dimension of 36 mm × 36 mm × 15 mm including the backside-illuminated (BSI) CMOS PCB board (IDS Imaging, monochrome BSI CMOS IMX178LLJ, 2076 × 3088 pixels, 2.4-µm pixel size, 12-bit dynamic range, 58 fps). The entire assembled CM 2 V2 prototype weighs 11.5 grams, of which the CMOS PCB takes most of the size (36 mm × 36 mm × 4 mm) and the weight (9 grams) whereas the custom parts weigh only 2.5 grams in-total.
In the imaging module, we use an off-the-shelf rectangular MLA with a 100% fill factor (Fresnel Technologies Inc., no. 630, focal length = 3.3 mm, lens pitch = 1 mm, thickness = 3.3 mm). The MLA is directly placed on top of the CMOS sensor to achieve single-shot 3D imaging with a compact form factor. The MLA is diced into a 3 × 3 array using a high-precision automatic dicing saw (Disco Dad, no.3220) to keep the sides of the MLA clear in order to reduce vignetting and edge effect. Fig. S1a shows the top and side views of the 3D printed MLA housing and how it is mounted onto the CMOS PCB board (IDS Imaging, board-level IMX178). The red region indicates the sensor region of the CMOS. The green regions are the PCB that host extra electronics.
The hybrid emission filter set consists of two filters: a glass interference filter (Chroma Technology, no. 535/50, 1 mm thickness) and a thin-film absorption filter (Edmund Optics, Wratten color filter no. 12, 0.1 mm thickness). We place the interference filter in front of the MLA and the thin-film absorption filter behind the MLA. This design reduces the maximum angle of incidence onto the interference filter to achieve better filtering whereas the remaining leakage is suppressed by the absorption filter. All the optical elements and the CMOS sensor are held by a 3D-printed housing (designed in TinkerCAD, printed on Formlabs Form 2, black resin, no. RS-F2-GPBK-04). The assembled imaging module is mounted to the CMOS PCB by four mini set screws and hex nuts (Thorlabs, no. HW-KIT3). The assembled CM 2 V2 prototype has a calibrated de-magnification of 1.7× which results in an effective pixel size of 4.15 µm. The CM 2 V2 prototype does not require precise alignment of the optics. The field varying PSFs are experimentally calibrated and later computationally analyzed by our 3D-LSV model.
The illumination module consists of four freeform LED-illumination units held together by a 3D-printed dome-shaped base plate, which also blocks the ambient light. Each freeform LED-illumination unit contains the following parts: a surface mounted LED (Lumileds, LXML-PB01-0040), a 3D-printed freeform collimator (printed on Formlabs Form 2, clear resin, no. RS-F2-GPCL-04), a glass excitation filter (Chroma Technology, no. 470/40, 4 mm), and a 3D-printed LED housing (printed on Formlabs Form 2, black resin, no. RS-F2-GPBK-04). To achieve better surface quality, we post-process the 3D-printed collimator by the following procedure. A thin layer of clear resin (diluted 5 times with isopropyl alcohol) is applied to the outer parabolic surface of the collimator and cured under a UV lamp. The four illumination units are wired sequentially and connected to an LED driver (LED-dynamics Inc., 3021-D-E-350, 350 mA). In Fig. S1b,   the exact mechanical design and the ray tracing diagram of the freeform illuminator. The ray tracing in Zemax has incorporated the LED source spectrum and the incidence-dependent transmittance profiles of the interference-excitation filter from the manufacturer (Chroma Technology). The ray tracing result shows the freeform illuminator efficiently refracts and reflects light from the LED source and allow them to pass through the interference coating. The small divergence of outcoming light is because the LED source has an emitting area of 1 mm 2 .
The four LED units are placed symmetrically around the imaging path with a lateral offset of 6.7 mm from the optical axis and tilted by 45 degrees. The positions and orientations of the four illumination units are modeled and optimized in Zemax to achieve the maximum light delivery efficiency and overall excitation uniformity at the imaging plane. The assembled illuminator units (including 3D printed housing, LED source, freeform collimator, and the excitation filter) are shown in Fig. S1c Top. The four units are installed into the LED base plate. Fig. S1c Bottom two photos show the wired illumination array and fully assembled CM 2 V2 prototype, respectively. For better illustration, the CM 2 V2 assembly schematic is provided in Fig. S1d.
After the CM 2 V2 is fully assembled, the excitation is validated on a green fluorescence calibration slide (Thorlabs, no. FSK2). The total excitation power across the designed 8-mm region is up-to 75 mW (at maximum driving current of 350 mA), measured by a power meter (Thorlabs no.PM121D). The illuminator is turned on at 300 mA continuously for 1 hour and no overheating issues were observed.

B. Spectral Properties and the Design of Hybrid Emission Filter Set
The non-normal incidence of uncollimated light in CM 2 V1 results in a substantial wavelength shift of the transmitted spectrum of the emission filter. In Fig. S2a, we plot the spectral profiles of the LED source (blue solid region), green fluorophore emission (green solid region), excitation filter (red curve), emission filter at 0, 20, 40, 60-degree incidence (yellow, purple, green, and blue curve, respectively), and the absorption filter (black curve). The transmittance window of the interference-based emission filter shifts to shorter wavelength with a degraded profile (the oscillating curves) at non-normal incidence. It has a wide spectral overlay with the excitation spectrum, results in a background fluorescence in CM 2 V1 measurements. By adding an incidenceindependent absorption filter, whose long-pass spectral profile is plotted in Fig. S2a (black curve). Experimentally, we find that adding the absorption filter to the imaging path provides better suppression of background fluorescence at the cost of slightly reduced transmittance at emission wavelength. Fig. 2e in the main article has clearly shown that the hybrid emission filter provides much enhanced image contrast and reduced the background haziness by ∼5 times.  We experimentally characterize the lateral resolution of CM 2 V2 by imaging a fluorescent resolution target. Fig. S3 shows the captured raw measurement of the fluorescent resolution target. The zoom-in region shows that the features at both Group 6 Element 3 (6.2 µm) and Group 6 Element 4 (5.52 µm) can be resolved. For better illustration, we further plot the horizontal and vertical line profiles of the selected elements on the resolution target. Based on this measurement, we conclude that the lateral resolution for CM 2 V2 is ∼6 µm. In order to acquire the experimental data, we mount the CM 2 V2 onto the setup shown in Fig. S4. The position of the sample is adjusted by the XY translation stage, and the Z translation stage controls the axial focus of the CM 2 V2. To make sure the measurement is within the range of the calibrated PSFs, a stack of images are collected at different working distances, and the one image with the best reconstruction quality is chosen. The camera parameters are slightly sample-dependent and are generally set to be with 30 ms exposure time, 0 gain, 12-bit dynamic range, and 30 FPS to satisfy our targeted imaging requirements.

CM 2 V2 COMPUTATIONAL PIPELINE A. Sparse PSF calibration process
The spatially varying PSFs of the assembled CM 2 V2 are experimentally calibrated. The calibration process collects the 3D-LSV PSFs at a set of 3D sparse locations within the targeted imaging volume. To perform the calibration, we first build a point source setup consisting of two parts: a point source and a 3-axis high-precision automatic scanning stage. The point source is built by first stacking multiple layers of thin diffusing films (Parafilm) on top of a bright green LED source (Thorlabs, M530L4), whose spectrum approximately matches with the green fluorescence. A 5-µm mounted pinhole (Thorlabs, P5D, stainless steel) is placed immediately after the diffusing films. The multi-layer diffusing film homogenizes the LED illumination before entering the pinhole. The resulting point source provides a large divergence angle (>60 degrees), i.e. a large illumination NA, which is approximated as a point source. The angular intensity distribution from this point source is characterized in Fig. S12b. The point source is mounted on a 3-axis scanning stage (Thorlabs, MT3Z8) that is automatically controlled by a custom MATLAB script.  To calibrate the 3D LSV PSFs, we empirically choose a sparse sampling grid and scan the point source across the grid. The 3D grid spans [-4 mm to 4 mm] along both lateral dimensions and [-500 µm to 500 µm] along the axial dimension. The scanning step size is 1 mm laterally and 100 µm axially, respectively. In total, we collect PSFs on a 9 × 9 × 11 grid and 891 measurements.
The sampling grid is essential for the 3D-LSV model to accurately characterize the system. The grid should be dense enough to capture the key variations present in the 3D PSF across the 3D volume while as sparse as possible to allow efficient physical calibration, data storage, and computation. To demonstrate the effect of the sampling grid, we compare the reconstruction accuracy of the 3D-LSV model using the current PSF sampling grid and that from a down-sampled grid in Fig. S5. By comparing the reconstructed PSF at an "unseen" 3D location, the result shows that the reconstruction accuracy degrades when the sampling grid becomes sparse. Specifically, the reconstruction performance with laterally (xy) down-sampled PSFs suffers more severe loss of accuracy compared to that from axially (z) down-sampled PSFs. This is because the PSF variation is more prominent in the xy plane due to the severe off-axis aberrations and PSF truncation at the peripheral FOV. We can further increase the 3D-LSV model accuracy at the cost of PSF calibration time and computational time and memory storage.
After determining the PSF sampling grid, the MATLAB script controls both the stage scanning and provides the triggers for the CM 2 V2 image acquisition. To further account for the non-uniform angular profile of the point source, the MATLAB script adaptively adjusts the exposure times at different scanning locations. This ensures all the measured PSFs are not saturated or too dark. The exposure time is recorded to later normalize the PSF measurements before the 3D-LSV modeling. The whole calibration process takes about 2 hours to complete. Then we perform TSVD on the registered array PSFs to obtain basis array PSFs and their coefficients at a sparse grid of calibrated locations. The coefficients are then 3D interpolated to the entire imaging volume with 3D bilinear interpolation. Next, the basis PSFs patches are placed back to their original locations. Lastly, the CM 2 measurement is computed by k weighted 2D convolutions between the object volume and the basis PSFs, followed by a summation along the axial dimension z.

Calibrated PSFs
We provide additional details of the 3D-LSV model and experimentally validate the accuracy of the model. Fig. S6 shows a diagram of the 3D-LSV decomposition and interpolation process. The experimentally calibrated PSFs (2076 × 3088 pixels) are first normalized by the exposure time recorded in the calibration process. Next, small PSF patches (160 × 160 pixels) are extracted from the PSF measurements based on the chief ray locations. Due to the finite conjugate imaging geometry of the CM 2 system, the PSF at a different depth exhibits a different amount of lateral shift that is approximately linearly increases with the depth (i.e. axial sheering). To enable efficient PSF decomposition, we estimate the amount of axial sheering based on the on-axis PSF stack and then align the PSF patches with the in-focus PSF by numerically "undo" lateral shift. The 9 aligned PSF patches are regrouped into a 3 × 3 foci array (480 × 480 pixels). Next, the 891 calibrated array PSFs are decomposed using the singular value decomposition (SVD) and truncated to the leading 64 terms. There are two products from this TSVD process: the 64 basis PSFs H (480 × 480 pixels) and 64 coefficient volumes M (9 × 9 × 11 voxels). To match our reconstruction sampling, the coefficient volumes are further interpolated onto a dense 1920 × 1920 × 80 grid using the 3D bilinear interpolation method. To construct the full-sized (2076 × 3088 pixels) CM 2 measurement, the basis PSF patches are then put back to their original pixel locations. The locations are determined by the chief ray locations and the estimated axial sheering. Lastly, the measurement is computed by k weighted depth-wise 2D convolutions between the object volume and the basis PSFs, followed by a summation along the axial dimension z. In Fig.  3c of the main article, we visualize the first 5 basis PSFs and their coefficient volumes. Fig. S7 shows the next 10 basis PSFs and their interpolated coefficient volumes.

3D-LSV model
Depth-wise LSI model To visually validate the synthetic measurement generated by this 3D-LSV model, we compare a simulated image and an experimental measurement taken from a 5 × 5 × 0.8 mm 3 volume with 10-µm fluorescence beads with similar seeding density in Fig. S8. The forward model used in our V1 system is based on a depth-wise linear shift invariant (LSI) approximation [1]. To highlight the improvement of the proposed 3D-LSV model over the depth-wise LSI model, we compare the synthetic measurements from the same object using both models in Fig. S8a. In the zoom-in panels taken from the same regions at the peripherical FOV where the aberrations are more apparent, we highlight that the 3D-LSV model can synthesize the key image features in the real experiment. This is essentially important to train CM 2 Net based on only simulation data, yet the trained network is directly generalizable to real experimental measurements.

C. Analysis on the denoising capability of the 3D-LSV model
In this section, we analyze the denoising capability of the 3D-LSV model with different number of basis PSFs. We first decompose the raw calibration PSF stacks with different number of basis using the truncated Singular Value Decomposition (TSVD) and then compute the corresponding 3D-LSV model and simulate the PSF at a seen position. Next, we compare the absolute error between the simulated and the measured PSFs and plot the background intensity profile. Since when constructing our 3D-LSV model only the simulated PSF is used even at the calibration "seen" 3D locations, this procedure allows assessing the effective denoising capability of the 3D-LSV model. Fig. S9 shows that all simulated PSFs have smoother background profiles, validating that the low-rank model can suppress the noise particular in the background region. Moreover, the denoising performance for 3D-LSV model is related to the number of basis. When the number of basis is too large, the LSV model not only well approximates the system aberrations, but also inadvertently capture the random noise present in the raw calibration PSF measurements. This shows a trade-off between the reconstruction accuracy and denoising capability. Since the noise level at for our raw PSF measurement is low, we choose a relatively large number of basis PSFs (K = 64) to provide a good reconstruction accuracy with a moderate denoising effect.

Number of basis: 16
A.U. The absolute error between the simulated and measured PSF shows that the 3D-LSV model accuracy improves when the number of basis PSFs becomes larger. The enlarged profiles for all simulated PSFs show smoother background intensity profiles compared to the measured PSF, which indicates that the low-rank model can suppress the noise in the raw PSF measurement.

D. Implementation details of the CM 2 Net and its building block
In this section, we describe the details of the network implementation and its training and testing procedure. The CM 2 Net contains three sub-networks, including the view demixing-net, viewsynthesis-net, and lightfield-refocusing enhancement-net. The three sub-networks all use the same residual network structure and only differ in the input/output dimensions. The detailed structure of a residual block is provided in Fig. S10. The input tensor first goes through a 2D convolution layer which has 64 convolution kernels with size 3 × 3. After the 2D convolution layer, the intermediate tensor is fed into a batch normalization layer, followed by a nonlinear activation layer using Parametric ReLU (PReLU) layer, a nonlinear function with learnable slope in the negative side of the axis. The tensor then goes through another pair of 2D convolution and batch normalization layers to further increase the receptive field. Lastly, the tensor is elementwise added to the input tensor of this block, which forms a residual connection. The numbers of input and output channels, denoted by (a, b) for the demixing-net, view-synthesis-net, and enhancement-net are (9,9), (9,80), and (32, 80), respectively.  The CM 2 Net takes a stack of 9 cropped views from a CM 2 measurement as the input. In training phase, the input data dimension to demixing-net is 320 × 320 × 9, where the third dimension denotes the number of channels. The output size from the demixing-net remains the same. The non-learnable lightfield refocusing module applies a "shift-and-add" refocusing operation on the demixed 9 views: where RFV is the refocused slice with ∆z defocus distance, (u, v) is the coordinates of the 3 × 3 microlens array, VS is the demixed view stacks, the amount of shift is determined by the system magnification M, nominal focal distance z 0 , size of the microlens d, the microlens index (u, v) and the defocus distance ∆z. To implement this, we convert the amount of shift to be [-18: +17] pixels (+ means shifting towards the center view, -means the opposite), which ensure that the refocused stack consisting of 36 planes will cover the targeted 800-µm depth range. Note that this refocusing algorithm generates artifacts around the image boundaries. Therefore, we discard the outmost 32 pixels in both lateral dimensions, resulting in a 256 × 256 × 36 refocused volume. The enhancement-net is trained to first resample the refocused volume onto an axially finer sampling grid (256 × 256 × 80) matching the ground-truth volume, and then enhance the 3D reconstruction. The resampling is performed by a 2D convolution layer of kernel size 3 with 80 channels, which increases the depths from 36 to 80. To perform the view-synthesis, similarly only the central 256 × 256 × 9 region is extracted from the demixed views and fed into the view-synthesis-net. The output dimension from the view-synthesis-net is 256 × 256 × 80, the same as the ground-truth volume.
In the inference phase, we directly feed the full-FOV measurement to perform a single-pass 3D reconstruction, which bypasses the stitching artifacts suffered by patch-wise reconstructions.

E. Ablation study on the demixing module of the CM 2 Net
In this section, we show in simulation that the view-demixing task can be reliably performed and the demixing-net module substantially improves the quality in the downstream 3D reconstruction. First, we demonstrate the effectiveness of view-demixing by comparing the demixing-net demixed views against the ground truth in simulation. Both the CM 2 measurement and the nine individual crosstalk-free views are simulated using our 3D-LSV model using the PSFs of the MLA and each microlenses, respectively. In Fig. S11, we show results on a testing volume consisting of fluorescent beads with random sizes between 10 µm and 20 µm in a cylindrical volume (diameter ∼7 mm, thickness ∼0.8 mm). The large FOV results in strong view multiplexing in the raw CM 2 measurement, as evident in the example "input view" centered around the central microlens. The "demixed view" from the demixing-net closely matches with the ground-truth view from the central microlens without crosstalk artifacts, as highlighted in the three zoom-in panels taken from widely separated regions. The demixing-net can robustly perform this task since each fluorescent bead imaged by different microlenses contain distinct aberrated features.
Next, we show that the view-demixing step significantly improves the 3D reconstruction quality. Specifically, we perform the following ablation study in Fig. S11b. The same dual-branch reconstruction network (including the lightfield-refocusing enhancement module and the viewsynthesis module) is used to process either the demixing-net demixed views or the multiplexed views from the raw measurement on the same simulated data. The XY maximum intensity projections (MIPs) of the 3D reconstructions from the two networks and the ground-truth volume are shown in Fig. S11b. As highlighted in the two zoom-in regions, the reconstruction from the network with the demixing-net closely matches with the ground truth, whereas the one without view-demixing suffers from a large number of false positives (marked by the yellow arrows).
We further quantify the reconstruction quality by recall and precision. The CM 2 Net with the demixing-net achieves an average recall and precision of 0.7 and 0.94 respectively, while the one without the demixing-net achieves much lower recall of 0.57 and precision of 0.24. This comparison highlights that although the CM 2 Net without the demixing-net is sensitive enough to recover 57% of the emitters, the 3D reconstruction suffers from very low precision. The comparison on the precision values implies that the demixing-net helps to reduce the falsepositive rate from 0.76 to 0.06, a 13× improvement.

F. Ablation study on the reconstruction module of the CM 2 Net
In this section, we perform an ablation study to analyze the potential reasons for the decrease of recall near the edges of the FOV. We first separate the CM 2 Net into demixing network and the reconstruction network (consisting of the enhancement-net and the view-synthesis-net). First, we train another network with only the reconstruction network and take the ground-truth demixed views as the input. The recall for emitters at different lateral location are then evaluated on the testing set using the same method in Section 3.2. The reconstruction results achieve recalls close to 1 consistently for the entire 7-mm FOV range (Fig. S12a yellow curve), showing the effectiveness and robustness of our reconstruction network design. To further evaluate the trained CM 2 Net, we quantify the recall of the reconstruction network in the pre-trained CM 2 Net by directly inputting the ground-truth demixed views. The result (orange curve) shows a slight degradation as compared to the re-trained reconstruction network but remain >0.89 across the 7-mm FOV. Both results are much higher than that from the CM 2 Net predictions on the raw CM 2 measurement (blue curve). This indicates that the CM 2 Net's reconstruction module provides superior performance, and the degraded performance at the outer FOV is due to imperfect view-demixing at these regions.
Upon visual inspection, we hypothesize that the view-demixing network is impacted by the low light intensity and reduced SNR of the point source near the edge of the FOV. To show this, we calculate the intensity map of the point source under the central microlens in Fig. S12b and show that the point source's intensity drops as much as ∼85% near the edge as compared to the intensity at the central FOV. Moreover, we compare the point source intensity map and recall map on the same pixel grid. Both maps exhibit similar non-isotropic distributions and significant drops at outer FOV regions. Finally, we show comparisons between the ground-truth and the view-demixing-net predicted demixed views (1-mm 2 patch) for both a central and an edge image patches in Fig. S12c. The view-demixing result for the central FOV is highly accurate, whereas the result near the edges suffer from a large amount of missing particles. We also visually observe that the SNR for the demixed views at the edge FOV are much worse than the central FOV.
Overall, this ablation study shows that the combination of the view-synthesis-net and enhancementnet can achieve superior reconstruction performance and is robust to large SNR variations and high dynamic range of the measurements. We provide evidence that the low recall near the edge is possibly due to the unevenly distributed point source used in the PSF calibration. Comparison of the 3×3-view overlay view-demixing results between patches from the edge and central FOV. The central patch that achieves ∼1 recall has higher SNR and shows nearly perfect view-demixing result. However, the view-demixing prediction on the patch from the edge FOV suffers from low SNR and missing particles.

G. Ablation study on the reconstruction sub-modules of the CM 2 Net
In this section, we perform ablation studies on the two branches in the reconstruction module to show how the light-field refocusing (LFR) enhancement branch and the view-synthesis (VS) branch complement each other. First, we compare the network performance by removing either the LFR or the VS branch in the CM 2 Net to show the effectiveness of the combined network structure. To do this, we build two new reconstruction networks with only the LFR or the VS branch (the network structure is shown in Fig. S13a). Both networks use the same data set and follow the same training scheme as the full CM 2 Net. After training for 48 hours on all the networks, we test and evaluate the performance by the quantitative analysis procedure shown in Section 3.2. The results are summarized in Fig.  S13b. This ablation study shows that the CM 2 Net achieves the best F1-score on all the conditions by combining both the VS and LFR branches. The overlay between the MIPs of the ground truth and the feature maps from the LFR network. The contrast for feature maps is enhanced for visualization. The ROI 1 and ROI 2 show that the LFR feature maps highly matches with the ground truth at the central FOV (yellow indicates matched particles). Bottom: The overlay of the MIPs of the ground truth and the CM 2 Net reconstruction, LFR feature maps MIP and VS feature maps MIP at the FOV edges. The particles detected by the CM 2 Net (in light blue) is the combination of the particle detected by the LFR branch (in yellow) and the VS branch (in white). The CM 2 Net learns the complementary information provided by the VS and LFR branches to achieve high detection rate across a wide FOV.
visualize the final feature maps from the two branches of the CM 2 Net. We directly extract the feature maps from the trained CM 2 Net (as shown in Fig. S14a) and compare the MIP of the feature maps with the MIP of the ground truth at the central and edge FOVs. In Fig. S14b, we show that the feature maps from the LFR branch highly matches with the ground truth at the central FOV while miss a large amount of particles at the edge. This is because the input (lightfield refocused volume based on the demixed views) to the LFR-network suffers from artifacts at the peripheral regions from inexact view matching from the demixed views and the boundary artifacts from the "shift-and-add" operation. To address this challenge, CM 2 Net uses the additional VS branch to the reconstruction module to help improve the performance at the peripheral FOV regions. In Fig.  S14b, we compare the ground truth volume, the CM 2 Net reconstruction, the feature maps from the LFR branch, and the feature maps from the VS branch to visualize the improvement owing to VS branch. We show that at the FOV edges, the VS branch can detect particles that are missed by the LFR branch, and thus improve the CM 2 Net's recovery of emitters across the large FOV.

PERFORMANCE EVALUATION METRICS
We use particle-wise detection metrics, including recall, precision and F1-score to quantitatively evaluate the CM 2 Net 3D reconstruction. We choose this set of metrics because they focus on quantifying the error made on the reconstructed emitters, whereas voxel-value averaged metrics, e.g. mean squared error, are biased by the background for sparse objects like ours. We describe the details on the computations of the three metrics using MATLAB language. Starting from the reconstructed volumes from the CM 2 Net, we first detect all the recovered particles within the volume. The detection is done by binarizing the recovered 3D volume with a global optimal thresholding (imbinarize) and then extracting the locations and sizes of connected 3D components (recovered particles) from the binarized volume (bwconncomp, regionprops3). We then compute a distance matrix by calculating the Euclidean distance between every recovered and ground-truth particle and solve a linear assignment problem based on the distance matrix to assign the recovered particles to the corresponding ground truth (matchpairs).
If the Euclidean distance between the recovered and the matched ground-truth particles is larger than a pre-defined distance threshold (i.e. 2× the particle size in our simulation), we count it as a False Negative (FN), meaning that the CM 2 Net fails to correctly reconstruct the particle. If a recovered particle does not find a match in the ground truth, we count it as a False Positive (FP). A True Positive (TP) is when the distance between the recovered and the matched ground-truth particles is smaller than the distance threshold (examples are shown in Fig. S16a). Lastly, the recall, precision, F1-score, Jaccard index (JI), and lateral and axial Root mean squared localization error (RMSE) are computed as follows: where x r i , y r i , z r i is reconstructed emitter's position and x g i , y g i , z g i is the corresponding ground truth position.
Recall and precision quantify the detection performance in two complementary aspects. Recall quantifies the CM 2 Net's sensitivity by the fraction of emitters it correctly reconstructed out of the total in the ground-truth volume. Precision measures the fraction of emitters it correctly reconstructed out of the total reconstructed emitters. Both F1-score and JI are metrics combining recall and precision, which are highly correlated. Lateral and axial RMSE measure the localization accuracy. The global threshold for binarizing the recovered volume is set by maximizing the F1-score on the evaluation set [2]. The recall, precision, lateral and axial RMSE under different conditions are reported in the main text (Section 3.2 and Figure 6). The F1-score and JI under the same set of conditions are shown in Fig. S15.
The metric maps reported in Sections 3.2 and 3.3 are computed by the metric value on nonoverlapping patches (250 µm × 250 µm in Section 3.2 and 500 µm × 500 µm in Section 3.3). When quantifying the metrics on experimental data (Sections 3.3 and 3.4), we first co-register the CM 2 Net reconstruction and the widefield measurement by manually selecting a few matched feature points in the central FOV. We then extract particles by binarizing both the widefield measurement and the CM 2 Net reconstruction. Next, we follow the same evaluation pipeline as above. Since the amount of image distortion suffered by the widefield and CM 2 measurements are different, good registration can only be achieved around the central region, while particle pairs at the peripheral region are expected to have a larger amount of separations. Therefore, the Example patch padding procedure for computing the recall (left) and precision (right) map, respectively. The red box shows the effective 500 × 500-µm 2 non-overlap FOV used to quantify the metrics. To maintain the distance threshold at the FOV edge, we choose a larger CM 2 Net reconstruction patch (700 × 700-µm 2 ) when calculating the recall map, and a larger widefield measurement patch (700 × 700-µm 2 ) when calculating the precision map. The yellow circle shows the particles around the edges, which requires the patch padding to find a proper match within the defined distance threshold.
distance threshold we choose for the experimental data is ∼5× the particle diameter, which is larger than the one used for the simulation (2× the particle diameter). To maintain the same distance threshold for particles at the edge of each patch, we perform particle-pairing on slightly expanded patches (see in Fig. S16b). Specifically, when computing the recall map on a 500 × 500-µm 2 patch, we select a larger CM 2 Net reconstruction patch (700 × 700-µm 2 ) and pad 0 to the corresponding widefield measurement patch, since recall requires detecting all the true positives. We visually show that without considering the distance threshold at the edge, the recall value will decrease since some matched pairs (yellow circle) between the reconstructed and the matching ground-truth are separated by the edge of the patch (red box) while within the distance threshold. When calculating the precision map, we select a larger widefield measurement patch (700 × 700-µm 2 ) and pad 0 to the corresponding CM 2 Net reconstruction patch in order to identify all the reconstructed particles corresponding to the matching widefield measurement. We visually show that without considering the distance threshold at the edge, some matched pairs (yellow circle) are separated by the edge of the patch (red box) while within the distance threshold, which contaminates the precision quantification.
The binarized MIP image pairs are shown in Figs. S20b and S22b for the 10-µm and mixed-size bead phantom, respectively. Moreover, since the particles in the experiment are not uniformly distributed, we observe NaN values (both the widefield measurement and the reconstruction in this image patch are empty) when perform quantification for the experimental data. For fair evaluation, we exclude the NaN values when calculating the mean and standard deviation. When making the metric maps, we linearly interpolate the pixels having NaN values, and applies a binary mask to indicate the sample region.
In Fig. 6, the line plots show the quantitative metrics under different conditions. Each point and the associated error bar represent the mean and standard deviation, respectively. The testing data set contains in-total 180 random volumes and ∼5×10 5 emitters. Each point in Fig. 6b is computed on ∼20 volumes for each range of emitter density. Each point in Fig. 6a, c and d is computed on emitters for each range (labeled on the horizontal axis) of lateral location, depth, and emitter size.

A. Preparation of phantom objects and table-top widefield measurements
Fluorescent 3D objects are prepared according to the following protocol. Green fluorescent particles (Thermo Fisher Scientific, Fluoro-Max Dry Fluorescent Particles, 10-µm and 15-µm) are added to ∼1 mL of clear resin (Formlabs, no. RS-F2-GPCL-04). The mixture is then diluted and moved onto a standard 3 × 1 inch microscope slide (Thermo Fisher Scientific, no. 125493) and curved under a UV lamp for ∼30 minutes. We use a micropipette (Thermo Fisher Scientific, Adjustable Volume Pipette, 10-100 µL, no. FBE00100) to control the transferred volume of mixture (∼40 µL) so that the object is ∼800 µm thick and 7-mm wide.
To verify the 3D reconstructions from CM 2 Net, we experimentally collect axial focal stacks (z-stack) on a commercial table-top epi-fluorescence microscope (Nikon TE2000-U) with GFP filter sets (Thorlabs, no. MDF-GFP) and a scientific CMOS camera (PCO Imaging, Pco.edge 5.5). To acquire the full-FOV widefield measurements, we use a low-magnification objective lens (Nikon, CFI Plan Apo Lambda 2×, 0.1 NA) with 50-µm axial step size across the 800-µm depth range. To acquire a high-resolution axial scan at zoom-in regions, we use a high-magnification objective lens (Nikon, CFI Plan Achromat 20×, 0.4 NA) with 25-µm axial step size across the 800-µm depth range.

B. Preprocessing steps on experimental data
In this section, we provide additional details on the preprocessing of experimental CM 2 measurements. The goal is to match the intensity distribution between the simulated and the experimental data so that our simulator-trained CM 2 net can generalize well to real measurements. The main discrepancy between the measured and simulated images is the low-frequency fluorescent background due to the reflection from the microscope slide. To remove the background and match the intensity statistics, we preprocess the experimental measurements with histogram matching. The reference histogram is estimated from the entire training dataset. The histogram matching builds a monotonic mapping from the experimental data to the simulated data. A comparison between the histograms of the simulated data, the experimental data with and without histogram matching are provided in Fig. S17, where one can clearly see the background is much suppressed after the preprocessing. This can be further verified in their histograms. The distribution of preprocessed experimental data matches simulated data much better than raw experimental measurements. We find that other commonly used background removal methods (such as thresholding, morphological opening, etc.) do not generalize well on real experimental measurements. Another preprocess step is to adjust the centers of the cropped 9 views to compensate for the small displacement of the chief ray positions between the simulation and experiments. The adjustment is done by manually aligning a few on-axis particles in each view.

C. Recall, precision, and F1-score maps in phantom experiments
In this section, we report the details of recall, precision, and F1-score evaluations on the two experiments in the main article. The recall and precision maps are computed on non-overlapping patches (500 µm × 500 µm) across the entire FOV.
For the 10 µm beads phantom in Fig. 7, the recall and precision maps are shown in Fig. S20a. The CM 2 Net reconstruction achieves a recall ∼0.78 and precision ∼0.80. In comparison, the recall and precision in simulation at the corresponding density range is ∼0.83 and ∼0.97, respectively (Fig. 6b). This shows that the simulator-trained CM 2 Net degrades slightly on experiments with ∼5% higher mis-detection rate and ∼7% higher false-positive rate at this imaging condition. To further analyze the potential reason that causes the performance decrease in the experiment, we compare the widefield measurement, the 3×3 overlay of the predicted view-demixing results, and the CM 2 Net reconstruction MIP. We find that the decreased performance originates from the view-demixing results (see Fig. S18). We attribute the reduced performance to the undesired extra views in the experimental measurements due to an extra column of partial microlenses adjacent to the main 3 × 3 microlens array (see Fig. S19).    In region 1, the binarized map for widefield measurement match perfectly with the CM 2 Net reconstruction. Patch 2 is from the edge FOV that has a low precision but with perfect recall (i.e. all ground-truth particles are matched but with unmatched falsely reconstructed particles). Patch 3 is from the edge FOV and has a low recall but with perfect precision (all detected particles are matched but with unmatched ground-truth particles). Note that since the image wrapping is unavoidable in the real experimental data and the registration is based only on the central particle pairs, the particle pairs at the peripheral FOV are expected to exhibit slight separations.
For the mixed-size beads phantom in Fig. 8, the recall and precision maps are shown in Fig.  S22a. The CM 2 Net achieves averaged recall ∼0.73 and precision ∼0.84 across the entire 6.5-mm FOV. As compared to the mono-10µm bead experiment, we hypothesize that the decreased recall is attributed to the greater intensity and SNR variations caused by bead size variation in the measurement. The increased precision is due to the reduced FOV and less contaminations from the extra views (see Fig. S21).
For better visualization, we further show the overlays of the registered and paired binarized maps between the widefield measurement and the CM 2 Net reconstruction. We zoom-in on three patches (size: 1 mm 2 ) in the metric maps and the corresponding overlay map. Patch 1 from the central FOV achieves the highest F1 score 1.0. The patch with index 2 and 3 is extracted from the edge FOV that either has a low precision with perfect recall (all ground truth particles are matched but with redundant detection) or a low recall with perfect precision (all detected particles are matched but failed to detect some ground truth particles). Patch 2 is from the edge FOV that has a low precision but with perfect recall (i.e. all ground-truth particles are matched but with unmatched falsely reconstructed particles). Patch 3 is from the edge FOV and has a low recall but with perfect precision (all detected particles are matched but with unmatched ground-truth particles). Note that since the image wrapping is unavoidable in the real experimental data and the registration is based only on the central particle pairs, the particle pairs at the peripheral FOV are expected to exhibit slight separations. For direct visualization, we enlarge the corresponding overlap maps in Figs. S20b and S22b. We show in patch 1, the particles detected from the widefield measurement match perfectly with the particles reconstructed by the CM 2 Net. In patches 2 and 3, we show unpaired particle detected from either the CM 2 Net reconstruction (low precision) or the widefield measurement (low recall), which are consistent with our metric map, which validates the accuracy of our evaluation procedure.  Fig. S21. Extra views in the mixed-size bead experimental measurement.Compared to the mono-10µm bead experiment, the reduced FOV (6.7mm to 6.5mm) reduces the level of contamination from the extra views, which results in decreased false positives and higher precision in the reconstruction. The recall, precision, and F1-score maps for the object in Fig. 10 of the main text (sample: 10-µm and 15-µm mixed fluorescent beads in a cylindrical volume with 6.5-mm diameter and 0.8-mm depth). The x labels indicate the metric value is zero and the dashed circles show the expect 6.5-mm diameter region of the phantom object. (b) The overlay of the registered and binarized widefield measurement and CM 2 Net reconstruction. Patch 1, 2, and 3 (size: 1-mm 2 ) show example regions of having high F1-score, low recall, and low precision, respectively.

NUMERICAL STUDY ON COMPLEX OBJECTS A. Simulation on cortex-wide 3D neuronal imaging
In this section, we perform pilot numerical study on imaging fluorescently labeled neurons across the entire mouse cortex to show the potential capability of applying CM 2 V2 and the trained CM 2 Net for large-scale neural imaging. One challenge for in-vivo neural imaging is that the neurons are distributed in the highly curved cortex, which requires the imaging device to be robust to the complex 3D surface geometry. In the main text, we performed thorough analysis on the cylindrical volume embedded with spherical fluorescent emitters. Here, to demonstrate the computational pipeline can be applied to image neurons with irregular shapes, we test CM 2 Net (trained using the dataset in the main text) and directly apply to simulated CM 2 measurements on the open-source dataset on VIP neurons labeled mouse brain that were imaged by the mesoscale selective plane-illumination microscope (mesoSPIM) [3].
To generate the testing data, we first segment the cerebral cortical region (7.5 × 6.6 × 0.8 mm 3 ) within a 800 µm depth range from the whole mouse brain volume. We then scale the segmented volume to match with the CM 2 sampling grid. Next, we apply a threshold to remove the fluorescence background and extract the locations and sizes of the connected 3D components (neurons) from the brain volume in MATLAB (bwconncomp, regionprops3). The original dataset aggregates the neuronal activity across 7-8 min in a single 3D volume, resulting a high neuron density of ∼530/mm 2 . We randomly select the neurons from the brain volume to down-sample the number of neurons in each measurement, which simulates the sparse neuronal activity at each time point expected in practice. We simulate two imaging volumes with two neuron densities, including 20/mm 2 and 100/mm 2 . Next, we apply the 3D-LSV model on these two volumes to synthesize the CM 2 measurements (as shown in the inset in Figs. S23b and S24b, respectively). Finally, the CM 2 measurement is input to the CM 2 Net to perform 3D reconstruction. The imaging volume and CM 2 Net reconstruction for sparsely labeled neurons is shown in Fig. S23. First, to visualize the curved brain geometry, we register the ground truth volume with the cerebral cortex contour extracted from Allen Mouse Brain Atlas [4](shown by the white dotted line). Next, we assess the full-FOV reconstruction by comparing the depth-encoded MIP of the ground truth volume and the CM 2 Net 3D reconstruction. We further compare four zoom-in regions from the center (blue and yellow patches) and edge (green and red patches) FOVs. By visual inspection, CM 2 Net successfully reconstruct the irregular 3D brain geometry and robustly reconstruct the neurons with different intensities and sizes with isotropic high resolution. The reconstruction FOV is limited to around 6.6 mm (labeled by the white circle). There are several miss-detections when exceeds the effective FOV (shown in green patch). These results agree well with the FOV analysis in Section 3.2. The synthesized CM 2 measurement is shown as the inset. At this high neuron density, the CM 2 Net provides high-quality reconstruction in the central FOV (around 6.6 mm, shown in blue, yellow and green patches). The performance largely degrades at the edges (shown in the red patch), which follows the observation in Section 3.2.
The imaging volume and the CM 2 Net reconstruction for a densely labeled neuronal volume is shown in Fig. S24. The left and right cerebral cortex contour is labeled in the ground truth volume (white dotted line). First, we validate the full-FOV reconstruction by comparing the XY and XZ MIPs of the depth-coded ground truth volume and the CM 2 Net 3D reconstruction. To further assess the reconstruction at greater resolution, four regions from the center (green and yellow patches) and corner (blue and red) FOVs are enlarged for comparison. By visual inspection, the reconstruction matches well to the ground truth at the central 6.6×6.6 mm 2 FOV (labeled with white circle). The performance largely degrades when exceeds the effective FOV (example shown in red patch) and it is more severe compared to the low density case, which follows the quantitative analysis in Section 3.2.

B. Simulation on 3D imaging of mouse brain vessels
In this section, we perform a numerical study on a mouse vasculature to demonstrate the ability of CM 2 to image more complex structures. A major concern for deep learning based reconstruction is the lack of generalization ability. Although we trained the CM 2 Net on only sparse particles to mimic the neuron-like structures, here we study how the CM 2 Net can be applied to more complex and non-particle objects. We test the CM 2 Net on the open-source whole mouse brain vessels   S25. Numerical study on a mouse vessel network. (a) Visualization of the ground truth volume. The volume size is 6 × 4 × 0.8 mm 3 . The small vessels are axially overlapped and distributed in the curved mouse brain. (b) Overlap between the ground truth (green) and network prediction (red) of the demixed views. The first figure is the synthesized CM 2 measurement. The labeled patches are the corresponding input for extracting the demixed views. The result shows that the network view-demixing result is matched well with the ground truth (yellow). (c) Overlap between the ground truth (green) and network prediction (red) of lightfield refocused volume (generated by the shift-and-add operation from the network demixed view stack). The slices shown in the figure are uniformly distributed across the entire refocused depth range. The orange and magenta arrows indicated the small vessels focused at the 18th and 27th slices. The result shows that the refocused volume provides accurate estimation of the 3D information. Results from (b) and (c) show that the particle-trained view-demixing network can generalize well to the continuous dense object. (d) Depth-encoded MIPs of the groundtruth volume and CM 2 Net reconstruction. The CM 2 Net can accurately recover the volumetric distribution of the vessels but with discontinuity artifacts due to the sparsity enforced by the particle-trained CM 2 Net. segmentation data set [5] to show the particle trained CM 2 Net can still recover the volumetric mouse vasculature. The preprocessing steps are similar to the previous section. First, We segment the cerebral cortical region within the 800 µm depth (volume size: 6× 4.2 × 0.8 mm 3 ) and extract the vessels from the brain volume. Next, we keep only small vessels within the volume with size ranging from 800-4000 pixels to form the imaging volume. Next, we scale the volume to match the CM 2 sampling grid and simulate the CM 2 measurement by our 3D-LSV model (first figure in Fig. S25b). Finally, the measurement is input to CM 2 Net for 3D reconstruction.
The ground-truth volume and the CM 2 Net view-demixing and reconstruction results are shown in Fig. S25. First, we show the complexity of the imaging volume in Fig. S25a, where axial overlapped small vessels are distributed in the curved mouse brain. Next, we assess the view-demixing results by visualizing the overlay between the ground-truth demixed views (green) and the network demixed results (red). Figure S25b shows that the network demixed views are matched well with the ground truth (overlay shown in yellow), demonstrating that the view-demixing network is robust to demultiplex the overlapped views for dense continuous objects, even though it is trained entirely on the sparse discrete particle data set. The successful generalization of the CM 2 Net view-demixing module lays the foundation for robust reconstruction of the dense object. On the one hand, the demixed views are the images captured from different microlenses, which captures the information within different FOV regions. With an accurately demixed views stack, the VS branch can learn high quality information across the full FOV. On the other hand, the lightfield refocused volume generated from an accurate view stacks can provide focus information for the LFR enhancement branch, which helps the enhancement branch to accurately recover the object's depth information. To demonstrate this, we show two small vessels digitally refocused at different depth in Fig. S25c. Figure S25c shows the highly matched overlay between the ground truth and the predicted lightfield refocusing volumes at different depth planes. Finally, we validate CM 2 Net final Reconstruction by comparing the XY, XZ and YZ MIPs of the depth-encoded ground-truth and CM 2 Net reconstruction in Fig. S25d. By visual inspection, the CM 2 Net reconstruction matches well with the ground truth in all projections, showing that the CM 2 Net can preform high-quality volumetric reconstruction for such dense object. However, we also observe discontinuity artifacts in the reconstruction result. This is attributed by the fact that the CM 2 Net was only trained on the sparse particle objects, which implicitly enforces a sparsity prior on the reconstruction volume. We expect one can remove the discontinuity artifacts by fine tuning the CM 2 Net on a more complex training dataset.