High frame rate video mosaicking microendoscope to image large regions of intact tissue with subcellular resolution

: High-resolution microendoscopy (HRME) is a low-cost strategy to acquire images of intact tissue with subcellular resolution at frame rates ranging from 11 to 18 fps. Current HRME imaging strategies are limited by the small microendoscope field of view ( ∼ 0.5 mm 2 ); multiple images must be acquired and reliably registered to assess large regions of clinical interest. Image mosaics have been assembled from co-registered frames of video acquired as a microendoscope is slowly moved across the tissue surface, but the slow frame rate of previous HRME systems made this approach impractical for acquiring quality mosaicked images from large regions of interest. Here, we present a novel video mosaicking microendoscope incorporating a high frame rate CMOS sensor and optical probe holder to enable high-speed, high quality interrogation of large tissue regions of interest. Microendoscopy videos acquired at > 90 fps are assembled into an image mosaic. We assessed registration accuracy and image sharpness across the mosaic for images acquired with a handheld probe over a range of translational speeds. This high frame rate video mosaicking microendoscope enables in vivo probe translation at > 15 millimeters per second while preserving high image quality and accurate mosaicking, increasing the size of the region of interest that can be interrogated at high resolution from 0.5 mm 2 to > 30 mm 2 . Real-time deployment of this high-frame rate system is demonstrated in vivo and source code made publicly available.


Introduction
The high-resolution microendoscope (HRME) is a fiber optic fluorescence microscope designed to image superficial epithelial cell nuclei in vivo to aid in visualization and detection of dysplasia and cancer in mucosal surfaces [1,2]. HRME images of dysplasia and cancer show characteristic changes in the size, shape and density of nuclei in a wide variety of tissues, including the oral cavity [3,4]; uterine cervix [5,6]; esophagus [7,8]; and colon [9]. Images may be interpreted visually by a clinician or analyzed quantitatively with software algorithms to generate an automated diagnostic prediction in real time [10][11][12][13].
One important limitation of HRME is its small field of view which is determined by the optically transmitting area of the fiber bundle (typically ∼0.5 mm 2 for an 800 µm diameter bundle). Therefore, in practical application, multiple images must be acquired when evaluating tissue regions of interest larger than the probe FOV. Developments in video mosaicking are a promising approach to computationally increase the FOV of system while preserving high-resolution. This is accomplished by co-registering successive frames in the video sequence and inserting them into a mosaicked image as the probe is smoothly translated across the specimen. However, video mosaicking of images acquired with a handheld microendoscope has several technical and practical challenges. First, camera exposure times need to be short enough to avoid motion blur as the probe is translated. Second, the acquisition frame rate needs to be sufficiently fast so that there is sufficient overlap in area imaged by successive frames to allow for co-registration; it is more challenging to meet this criterion for systems with a small FOV. Finally, video mosaicking necessitates the ability to maintain good contact with the specimen as the probe is smoothly translated across the region of interest. If these factors cannot be satisfied, video mosaicking approaches result in poor quality mosaics and/or fail altogether.
Several studies have demonstrated the potential and limitations of video mosaicking using microendoscopes, and relatively few have demonstrated real-time mosaicking in vivo. Bedard et al. presented a real-time video mosaicking approach using HRME instrumentation with a 4 µm lateral resolution, 800 µm circular FOV, and acquisition frame rate of 11 frames per second (fps) [14]. Bedard demonstrated both in vivo and ex vivo mosaics ranging from ∼2-5 mm 2 without any physical attachments to grasp the optical probe and stabilize its movement. Maximum probe translation speed was 1 millimeter per second (mm/s), limited primarily by a 10 ms exposure and 11 fps frame rate. In practice, it is challenging to successfully translate a handheld probe across tissue surfaces at such slow speeds.
Yin et al. reported real-time video mosaicking using a handheld dual-axis confocal microendoscope on ex vivo specimens [15]. Using a probe with a 1 µm lateral resolution and 350 µm circular FOV, their system achieved real-time mosaicking visualization at ∼30 fps. Ten individuals were tasked with surveying a 3 × 2 mm section of mouse kidney with and without feedback from real-time video mosaicking. A ∼35% improvement in area coverage was observed when using the mosaic visualization to guide probe movement. Maximal translational speed for their system was not specified, but given a minimum frame-to-frame overlap of 80% and 30 fps acquisition rate the maximum translation speed is ∼2 mm/s. Hughes et al. presented a system which achieved faster probe translation speeds through increased acquisition frame rates [16]. Their line-scanning confocal system had a lateral resolution of 1 µm, 240 µm circular FOV, and achieved a frame rate of 120 fps. They assessed average mosaic lengths within a 3 min video of freehand motion across porcine colon during a simulated surgical procedure. They were able to register 86% of frames into smaller scale mosaics that were on average 2.3 mm in length. They demonstrated successful mosaicking at translational speeds up to 5 mm/s, but were only able to construct relatively small-area mosaics (∼0.6 mm 2 on average) due limited FOV of their line-scanning confocal system. Although their system achieved a high frame rate and could theoretically support faster translational speeds, they did not demonstrate real-time software deployment.
While these studies demonstrate capability to perform video mosaicking of tissues at speeds ranging from 1-5 mm/s, much of the data presented has been acquired from ex vivo specimens which do not present the same physical challenges as imaging intact tissue from live subjects in real clinical practice. In practice it is difficult to achieve stable and continuous movement of a thin, flexible, handheld optical fiber at such slow speeds, ranging from 1-5 mm/s. By way of comparison to freehand writing, average pen tip velocities recorded by 17 healthy participants asked to write a Japanese character ranged from ∼10 mm/s when writing in a 2 cm frame up to ∼50 mm/s when writing in a 15 cm frame [17].
We sought to design a system that could overcome these practical limitations by increasing the speed at which the probe could be translated and enabling stable handheld manipulation of the optical probe. Here, we present a novel microendoscope system which incorporates a high frame rate CMOS sensor and optical probe holder to achieve scanning of oral mucosa at >15 mm/s and construction of video mosaics >30 mm 2 . Real-time deployment of our system is demonstrated in vivo at 90 fps and source code made publicly available. Figure 1 shows an optical diagram ( Fig. 1(A)) and SolidWorks illustration ( Fig. 1(B)) of the high frame rate microendoscope. The optical assembly is an adaptation of previous designs which utilize an optical fiber (FIGH-30-850N; Fujikura), LED illumination source, objective lens (RMS10X; Olympus), dichroic mirror, and filter set suitable for fluorescence imaging of proflavine stained specimens [1,2,11]. The primary change to the system is the integration of a high frame rate CMOS sensor (BFS-U3-04S2M-CS; FLIR Systems Inc., Richmond, BC, Canada). This 0.4 megapixel sensor uses a global shutter and has a maximum framerate of 522 frames per second (fps). In order to account for the smaller sensor format, the magnification of the system was reduced using a shorter focal length tube lens (AC254-100-A-ML; Thorlabs Inc., Newton, NJ, USA). The optical assembly is mounted inside a 3D printed enclosure and the CMOS sensor is connected to a laptop (Surfacebook 2; Microsoft, Redmond, Wash, USA) via USB connection ( Fig. 1(C)). The LED is powered via the USB connection through an LED driver circuit mounted on an Arduino Uno microcontroller ( Fig. 1(D)). When driven at full brightness, the system delivers ∼4 mW of blue light at the distal tip of the optical fiber. The LED brightness can be modulated using a potentiometer on the driver circuit.

Optical fiber holder
Smooth movement of the optical probe is crucial for video mosaicking. We sought to design a probe holder which would preserve the thin profile at the distal end of the probe while providing a thicker handle to firmly grip the probe. For optical target imaging experiments a fiber holder with a 200 mm long handle and a subtle curve tapering towards a 2 mm diameter tip was fabricated ( Fig. 1(E)). This design was printed in two pieces out of PC-ISO polymer using a Fortus 450mc printer (Stratasys Ltd., Eden Prairie, MN, USA) and then assembled onto an optical fiber using epoxy to adhere the two pieces together. The resolution of the first design was not adequate to create a smooth contact surface at the distal end, so a spherical tip was added using epoxy and then polished to create a smooth contact surface. For in vivo imaging experiments, an optimized fiber holder design with a smooth contact surface was fabricated on a higher resolution stereolithography printer (Form 2; Formlabs, Somerville, MA, USA). The second design is a linear cylinder with a 100 mm long handle tapering to a 4 mm tip and contains channels to insert the optical fiber and secure it using nylon tipped set screws so that it is flush at the distal end ( Fig. 1(F)). This allows for the fiber tip to be aligned more precisely with the end of the holder, improving contact with the imaging target, and thus, mosaic quality. Both optical fiber holder designs were created in SolidWorks 2019 (SolidWorks Corp., Waltham, MA, USA).

Optical target imaging
To assess performance of the video mosaicking microendoscope in the lab, we first acquired high frame rate videos while translating the probe by freehand across an optical target with a plastic guide. A plastic guide was fabricated using vector graphics software (Adobe Illustrator 2020) and a CO 2 laser cutter (Universal Laser Systems VLS3.60) to precisely cut the guide out of an acetate film. The plastic guide was then adhered on top of lens paper that had been dyed using a yellow highlighter. The lens paper was backed with packing foam to provide some mechanical flexibility as the probe was pressed against the template during translation. Video files were recorded using SpinView (FLIR Systems Inc., Richmond, BC, Canada) with a 100 µs exposure, 0 dB gain, and 200 Hz framerate.

In vivo imaging
In vivo imaging was performed in the oral cavity of healthy adult volunteers. Proflavine at a concentration of 0.01% was applied topically to the tissue site prior to imaging. For in vivo imaging of proflavine stained tissue, exposure and gain were increased to 1 ms and 18 dB respectively. fpsVideos were acquired using SpinView at 100 fps. A circular disk was laser cut from water-activated kraft paper tape (KGPTI-0200; Alanson Products, Miami, FL, USA) and adhered to the tissue in order to demarcate a region of interest. Video documentation of the oral cavity imaging procedure was acquired using a smartphone with magnifying optics (EVA 3 Plus, MobileODT, Tel Aviv, Israel). Data collection was approved by the Institutional Review Board of Rice University (ID#: IRB-FY2016-172) and written informed consent was obtained from the participant. Figure 2 shows the sequence of image processing steps for construction of mosaics from videos post acquisition. Software to perform these processing steps using a video file as input was developed using Mathematica 12.1 (Wolfram Research Inc., Champaign, IL, USA). Video files were processed on a MacBook Pro running macOS 10.14 (Apple, Cupertino, CA, USA). Videos were reviewed to identify a start and end timepoint for processing. Preprocessing of video frames prior to mosaic construction was performed in five steps (Fig. 2(A)). First, raw frames between the specified time points were extracted as 8-bit grayscale images. Next, high-frequency intensity patterns imposed by the fiber bundle were removed by down sampling the image by a factor of 2.5 using Lanczos method to 288 × 216 pixels. Then, intensities across all frames were averaged to compute a mean intensity image for each video. The mean intensity image was binarized to create a mask for the region within the video sequence corresponding to the fiber bundle. Additionally, the mean intensity mask was used to identify any consistently dim pixels (arising, for example, from damaged fiber cores or unresponsive camera pixels) and interpolate them using neighboring pixels. Finally, brightness equalization was performed using the mean intensity image to reduce boundary effects when overlaying images during mosaic construction.  Figure 2(B) is a flow diagram of the sequence of steps used in mosaicking the pre-processed video frames. The first two steps in constructing the mosaic were to extract a 2D feature descriptor and then register the translational shift between successive frames. For these steps, a Mathematica function called 'ImageCorrespondingPoints' was used to determine keypoints contained in successive images. The descriptor used was a combination of feature extractors available in Mathematica and included KAZE and SURF features [18,19]. Pixel locations of matched keypoints were then used to solve for a rigid translation (i.e. no rotation or non-rigid transformations) between keypoints and to compute the pixel shift between each pair of images. Next, the cumulative sum of the pixel shift coordinates was used to determine the total dimensions of the mosaic, and a blank image of the required dimensions was initialized. Finally, frames were inserted sequentially based on the registered shifts.

Real-time deployment software
Software to perform video acquisition and real time mosaicking was developed in NI LabVIEW 2020 (National Instruments, Austin, TX, USA) and run on a consumer-grade laptop (Dell 210-AROH, Round Rock, TX, USA). A custom graphical user interface provides a live preview of the camera feed, a button to trigger video mosaicking, and panels to visualize the resulting mosaic at two scales (16 × 16 mm and 2 × 2 mm). In order to preserve a high acquisition frame rate while simultaneously analyzing frames and writing them to permanent storage, a producer/consumer architecture was utilized. In order to support real-time registration and mosaic insertion, individual frames are down sampled to 0.04 MP (240 × 180 pixels) using bilinear interpolation; however full-resolution frames are enqueued on a separate loop and written to permanent storage for post-acquisition analysis.
Prior to beginning video mosaicking, a blank 18 megapixel (4200 × 4200) image is initialized with the starting location for frame insertion at the center. The user draws a region of interest over a camera preview inside the fiber bundle to designate which pixels are used for building the mosaic. The real-time mosaicking procedure is similar to that described in Section 2.5, but is implemented using OpenCV with an AKAZE feature detector and called using a LabVIEW Python node. For computational efficiency, the real-time implementation omits the static pixel interpolation and brightness equalization steps. For each frame acquired, rigid registration is performed using the prior frame and the calculated shift returned to LabVIEW. In the case that real-time registration fails, then the shift is interpolated using the last shift which was successfully registered. Once imaging is finished and the frames are saved to permanent storage, a command line node is used to call a Mathematica executable. This executable takes in the file path of the saved images and implements the full workflow described in section 2.5 to generate a higher quality mosaic image. The executable reports progress via the command prompt and logs processing time required for reconstruction. Benchmarking of computational time required for high quality mosaic reconstruction was performed using videos acquired from in vivo imaging experiments, with average times reported from five runs. Code 1 contains the software implementation, which is publicly available on GitHub [20].

Quantitative metrics for assessing video mosaic quality
Errors in mosaic registration can occur when probe movement is too fast which can result in motion blurring and/or insufficient overlap between successive frames. Poor probe contact, tissue deformations, low signal, and oversaturation can also make microendoscopy video registration challenging. Therefore, three quantitative metrics were computed frame by frame during mosaic construction to assess frame registration accuracy, image sharpness, and probe speed during acquisition. These metrics were exported at each time point and plotted as a time series for mosaics presented.
Mosaic registration accuracy was assessed using root-mean-square error (RMSE) of the overlapping regions for each frame inserted into the mosaic. This metric was used to reject frames with high registration error. Two error thresholds for rejection were considered by computing the registration error between randomly shuffled frames in the video sequence and taking the 1 st and 10 th percentiles of that distribution as possible rejection thresholds. These thresholds and number of frames rejected are plotted on the registration error line plot for each mosaic. Statistical testing for differences in registration error between real-time and post-processed mosaicking methods was performed using a non-parametric signed rank test ('SignedRankTest' function) in Mathematica.
In order to assess loss of contact or motion blurring during acquisition, image sharpness was also assessed. The image sharpness was computed as the standard deviation in image intensities after high-frequency filtering using a 5 × 5 pixel Laplacian of Gaussian kernel. Because oversaturated regions can heavily skew this sharpness metric, only non-saturated pixel regions were used in computing sharpness.
Lastly, probe speed was quantified by converting the Euclidean distance of the pixel shift to millimeters, and then multiplying by the acquisition frequency to obtain velocity in millimeters per second. At time points where frames were rejected for high registration error, linear interpolation was used to determine probe speed. As frame-to-frame velocity can be quite variable, a moving average for probe speed using 50 frame intervals was also plotted.

Graphical visualizations
Mosaic images were exported from the mosaicking software in PNG format. Annotation of these files to add scalebars and illustrate magnified insets at various points in the mosaic were created using Pixelmator Pro 1.6 (Pixelmator Team Ltd., Vilnius, Lithuania). Mosaic images were rotated, cropped, and exported using Pixelmator. No brightness or contrast adjustments were made. Time series plots for registration accuracy, image quality, and probe speed were generated in GraphPad Prism 8.4 (GraphPad Software, La Jolla, California, USA). Figure 3 shows the mosaic constructed from a video sequence obtained by freehand imaging of a curved target on fluorescent lens paper. The microendoscope probe was moved through a rectangular plastic guide fixed atop the lens paper. The target shape was cursive letters of the word 'Rice' with dimensions of 53.5 × 22 mm. The optical probe was moved in single motion from top left of the target by tracing out the letters. The video sequence used for mosaic construction contains 4,124 frames and was acquired in 20.62 seconds. The resulting mosaic image has dimensions of 53.6 × 22.0 mm and a filled in area of 132 mm 2 . Average registration error for the video sequence was 0.027 (min-max: 0.006-0.171). One frame was above the 10th percentile rejection threshold and excluded from the mosaic. Average image sharpness for frames in the sequence was 0.034 (min-max: 0.013-0.044), but some motion blur was observed in frames with the lowest sharpness. Average registered probe speed during acquisition was 9.2 mm/s (min-max: 0-111 mm/s). Excluding frames in the bottom quartile for image sharpness (>0.031 sharpness), the maximum probe speed registered was 95 mm/s. Figure 4 shows the mosaic constructed from a video sequence obtained by freehand imaging of an 8 mm circular region of oral mucosa. Cell nuclei are visible as bright dots within the magnified image insets (Fig. 4(C)). The optical probe was moved in a spiral motion from the outside of the circle to the center (Visualization 3). The video sequence used for mosaic construction contains 1,114 frames and was acquired in 11.17 seconds (99.7 fps). The resulting mosaic image has dimensions of 8.1 × 8.1 mm and a filled in area of 21 mm 2 . Average registration error for the video sequence was 0.028 (min-max: 0.016-0.099). Two frames were above the 1st percentile rejection threshold and excluded from the mosaic. Average image sharpness for frames in the sequence was 0.021 (min-max: 0.013-0.025). Comparing frames with the lowest and highest sharpness, both are in focus and free from motion artifacts. The primary difference is that the lower sharpness frames are from regions of the tissue with lower overall brightness. Average probe speed during acquisition was 2.8 mm/s (min-max: 0-15 mm/s). Figure 5 demonstrates the workflow of the real-time software deployment and compares the quality of real-time vs post-processed mosaics. Panel A contains a diagram of the user-interface used for real-time mosaic acquisition and visualization. The user interface contains three panels: one for the raw video feed of the camera sensor and two for the real-time mosaic visualization (a zoomed in view to examine fine details and quality of the mosaic and a zoomed out view to observe the overall structure and path of the video acquisition). Visualization 4 shows a screen capture of the software interface during acquisition with the corresponding widefield video embedded for reference. The widefield view shows the optical probe moving across a rectangular region of oral mucosa. Panel B contains the mosaicked image using the post-processed mosaic reconstruction routine. The resulting mosaic image has dimensions of 14.1 × 14.1 mm and a filled in area of 33 mm 2 . Panel C contains magnified insets illustrating the qualitative differences between the real-time and post-processed mosaics. The real-time mosaic contains residual "dead leaves" artifacts from overlaying frames with uneven illumination. Average registration error was lower using the post-processed reconstruction approach (real-time vs post-processed: 0.043 vs 0.027, p<0.001). Averaged over five runs using this 1,668 image sequence, post-processed reconstruction required 93 milliseconds of additional processing time per frame (155 seconds total, 11 fps).

Discussion
In this manuscript, we present an improved widefield microendoscopy system capable of imaging larger surface areas at high-speed, and make our methods publicly available to the research community. The improved mosaicking capability is achieved through incorporation of a highframe rate CMOS into the optical design as well as an optical probe holder to support stable, handheld translation for in vivo imaging. During optical target and in vivo imaging experiments, we obtained frame rates between 90-200 fps over USB connection to a laptop. These frame rates supported accurate mosaic registration at speeds of at least 50 mm/s in optical targets and 15 mm/s in oral mucosa. Prior work has demonstrated freehand mosaics ranging from 2-6 mm 2 [14][15][16]21]. Our system achieves mosaics that are over 100 mm 2 in optical targets and 30 mm 2 in oral mucosa. Qualitatively, the mosaics presented accurately reproduce template geometries with minimal artifacts introduced from the mosaicking process. Quantitatively, we defined a rejection error threshold for successful mosaic registrations by comparing actual registration errors to that of a random distribution from the same video sequence. For video sequences ranging from 4 to 20 seconds long, we observed very low frame rejection rates (0.2-2% for in vivo imaging). The oral mucosa mosaics presented contain clearly visible nuclei structures with a total area that is 60 times larger than a single image FOV. This improved microendoscope system achieves high-speed acquisition with real-time mosaicking while maintaining the low cost and complexity of previous HRME systems. The resolution of the current system is limited by the 4 µm core spacing of the optical fiber used. While a higher resolution could be achieved through the use of a fiber with a smaller core spacing, this improved resolution would be accompanied by a reduction in the FOV of the system, necessitating longer times to scan equivalent areas of tissue. Numerous clinical studies involving HRME have demonstrated that the resolution of the fiber used in this work is sufficient for real-time diagnostic image analysis approaches using nuclei morphology assessment. Therefore, the potential diagnostic capability of the system is at least as good as previous HRME systems and may be better due to mitigation of motion blur and ability to survey larger areas of tissue.
The software implementation of our system is designed to combine the advantages of the real-time and post processed mosaic generation approaches. The real-time visualization of mosaic generation allows users to see changes in nuclear morphology throughout the area being imaged and can help guide probe placement. Since the real-time implementation sacrifices image quality for speed, our software also saves full-resolution files to permanent storage and the post processed pipeline can then be leveraged on to generate higher quality mosaic images. The post-processing mosaicking software used in this work can process a 1000 frame mosaic (∼20 mm 2 ) in approximately 90 seconds (90 milliseconds per frame).
Our mosaicking approach did not account for non-linear tissue deformations which occur when translating over a soft tissue surface [22,23]. The high frame rate of our system helped mitigate the impact of non-linear deformations on frame registration. Average overlap between frames in the video sequences analyzed ranged from 92% in optical targets to 96% in oral mucosa, and future developments could utilize this increased sampling to perform more sophisticated mosaic registration methods, including super-resolution approaches [24][25][26].
The system demonstrated in this work lacks optical sectioning and multi-spectral imaging capabilities, which is opportunity for future development. Several recent works have demonstrated confocal imaging using handheld microendoscopes, which can improve resolution of nuclei morphology and downstream diagnostic image analysis [15,21,27]. Other works have also achieved multi-spectral imaging through fiber microendoscopes at high frame rates [28][29][30]. It is worth noting, however, that these approaches may have additional trade-offs in the cost/complexity of the optical instrumentation as well as achievable translational speeds. Thrapp et al. recently demonstrated a high-frame rate confocal microendoscope system with mosaicking capability at 120 fps [21]; however, the approach has only been demonstrated with ex vivo specimens and without real-time visualization. Combining the real-time software workflow demonstrated here with developments on easy-to-use, low-cost confocal systems [27], is an important area for future work.
Lastly, this work provides a proof-of-principle for effectively combining widefield and highresolution imaging for multimodal diagnostic imaging through the use of a simple, monolithic probe holder design. The rigidity and length of the optical probe holder enables high-resolution contact-based imaging at a sufficient distance to collect simultaneous widefield video. The probe holder design made available can be readily printed and assembled using commercially available 3D printing systems and medical grade resins. The design is well suited for diagnostic imaging of oral and cervical tissues, two applications for which HRME has been extensively demonstrated. Due to size and rigidity constraints, it is not suitable for endoscopic imaging applications.
High-speed imaging with video mosaicking can significantly reduce the time required to survey tissues, with the potential to better guide biopsy selection, and strengthen correlation with subsequent tissue biopsies. Video mosaicking microendoscopy is well suited for integration with other new technological developments such as in vivo projection of biopsy guidance maps and deep-learning based image analysis algorithms [13,31]. Multimodal imaging approaches that incorporate these advances alongside video mosaicking microendoscopy would provide clinicians with powerful and versatile new diagnostic imaging tools to aid in the detection, surveillance, and treatment of dysplasia and cancer.