Automated video-mosaicking approach for confocal microscopic imaging in vivo: an approach to address challenges in imaging living tissue and extend field of view

We describe a computer vision-based mosaicking method for in vivo videos of reflectance confocal microscopy (RCM). RCM is a microscopic imaging technique, which enables the users to rapidly examine tissue in vivo. Providing resolution at cellular-level morphology, RCM imaging combined with mosaicking has shown to be highly sensitive and specific for non-invasively guiding skin cancer diagnosis. However, current RCM mosaicking techniques with existing microscopes have been limited to two﻿-dimensional sequences of individual still images, acquired in a highly controlled manner, and along a specific predefined raster path, covering a limited area. The recent advent of smaller handheld microscopes is enabling acquisition of videos, acquired in a relatively uncontrolled manner and along an ad-hoc arbitrarily free-form, non-rastered path. Mosaicking of video-images (video-mosaicking) is necessary to display large areas of tissue. Our video-mosaicking methods addresses this need. The method can handle unique challenges encountered during video capture such as motion blur artifacts due to rapid motion of the microscope over the imaged area, warping in frames due to changes in contact angle and varying resolution with depth. We present test examples of video-mosaics of melanoma and non-melanoma skin cancers, to demonstrate potential clinical utility.

. An example DoG representation: At each octave, the frame is convolved with Gaussian filters with increasing variances (called scales) and the convolved images are subtracted from each other to create the DoG representation at that particular octave. For illustrative purposes we show three DOG levels at three octaves, different from what we actually used in practice.
The second stage of SIFT is to calculate descriptors for each keypoint. We used standard SIFT descriptor 1 . These SIFT descriptors are feature vectors which concatenate histograms of the gradients at different angles in a spatial region around the keypoint. See Fig. S2 for a detailed illustration of this process.

Registration
Registration parameters between two consecutive frames can be calculated by analyzing the displacement of the keypoints in these images (optical flow). First, we matched the keypoints between two consecutive frames. The initial tentative matching is done by calculating a similarity metric between keypoint descriptors using an`2 (euclidean) distance. Each keypoint in each Figure S2. Histograms of gradients around each keypoint are used as descriptors. For each of the 16 subregions (each subregion is 4 ⇥ 4 pixels) around the keypoint, an 8 bin histogram of gradients is formed. These 16 histograms are concatenated to form a length-128 SIFT descriptor. For visual purposes, we only show a few of the keypoints frame was matched with the 2 most similar keypoints in the following frame. Keypoints for which the ratio of its distance to its second closest match to its distance to the closest match was larger than 1.5 were counted as strong matches. The rest of the pairs (and the respective keypoints ) were eliminated from the list of potential matches for subsequent processing.
Unfortunately, matching the keypoint based only on similarity between their descriptors is unreliable, as microscopic images often contain repeated textural patterns with similar appearance. For any microscope motion, all keypoints should move in a tandem fashion. Therefore, the relative displacement between the matched keypoints pairs must be similar, and thus should be well approximated by a single transformation matrix: where H t is a 3-by-3 transformation matrix and K t , K t+1 are triplets of homogeneous coordinates of matched keypoints between consecutive frames t and t + 1. In general, for more than 3 matches, solutions to (1) will be inconsistent as we have many more constraints than free parameters. To robustly find a representative transformation, we used a procedure called RANSAC 4 to find an H t that describes the motion between all matched keypoints with small error. In a single iteration, RANSAC solves (1) for one randomly-chosen triplet of matches, resulting in an estimateĤ j t , where j is the iteration number. The resultingĤ j t is then used to calculate expected locations of all remaining keypoints from the earlier frame in the next frame. An estimation error E j associated with transformation matrixĤ j t is then calculated by summing the`2 distance between the estimated keypoint locations (H j t K t ) and their actual location (K t+1 ) in the following frame.

2/5
By solving (1) multiple times on randomly selected triplets, RANSAC calculates a vector of error values E j , whose`2 is taken as measure of the robustness of H j t . TheĤ j t matrix that results in the smallest error over all iterations is chosen as the transformation matrixĤ t . Testing many subsets is computationally costly, while testing on a smaller number of subsets leads to poor generalizability and thus poor registration. We observed experimentally that for the RCM images 3000 trials gave a good balance between computational complexity and generalizability.
Even with this choice of H t , there are typically keypoint matches that are significantly inconsistent with the resulting transformation. Thus we made a second pass through the keypoint matches to discard those matches whose projected location in the second frame was not consistent with their actual location. Our experiments on RCM images suggested that discarding matches for which this location error was larger than 50 pixels gave a good trade-off between consistency and having too few matches. We will refer in the sequel to the remaining set of matches as inlier matches.
After RANSAC, the final step in registration is the refinement ofĤ t to best fit all inlier matches. In addition, in our application, we are only interested in registering consecutive frames that were collected during smooth motion of the microscope. In such cases the change in scale and warping (distortion) between the registered frames should be relatively small. Moreover, when the inlier matches are mostly concentrated on a portion of the frames, we want to prevent having a transformation matrix that is overfit to that portion of the image, which can lead to distortion in the rest of the frame. These goals can be characterized by an optimization problem which aims to find the registration transformation H t that minimizes the total mismatch over all inlier matches while leading to the small frame-to-frame warping and scale change.
The total mismatch between the inlier matches can be calculated using the`2 error metric described before. The distortion induced by the registration can be quantified by analyzing the coefficients of the affine transformation matrix H. More specifically, the upper left 2-by-2 portion of the matrix controls the rotation and shearing in the registration and the entries in the first two rows of the last column (H(1,3) and H(2,3)) control the translation. Finally, the last row (H(3,1),H(3,2),H(3,3)) controls the projective geometry, which represents how the matches deviate from a 2D geometry. In order to avoid transformation matrices that lead to too much distortion, we introduced an additional cost parameter that penalizes the scale change and shear in the final transformation matrix. This parameter serves as a regularization term to force the final affine transformation matrix to favor translation and rotation motion between frames and limit shearing that may warp and distort the frame. Given the inlier matchesK t andK t+1 that we found through RANSAC, we solve the following optimization problem min H ||K t+1 H tKt || ⇤ (1 + pen), (2) pen = |1 H t (1, 1)| + |1 H t (2, 2)| + |H t (1, 2) + H t (2, 1)| + |H t (3, 1) + H t (3, 2)| ⇤ 10 2 . (3) The first two terms in the penalty limit scale changes, the third term limits shearing and warping, and the last term limits the projective warping (deviation from 2D). Here we optimize Eq. 3 over all inlier matches. As an aside, we noted that, as expected, the effect of the additional penalty parameter was minimal when the inlier matches were homogeneously spread across the frames.

Scene Cut Identification
As described in Methods, our framework was designed to automatically detect sudden and unwanted motion of the microscope and then cut the longer sequence into internally consistent subsequences as needed. We developed a method that uses the coefficients of the transformation matrix H t to determine where to cut. Since, as described in detail above in the Frame Registration section of the manuscript, the diagonal coefficients of H t encode how much warping and distortion is introduced in the final registration, we quantify warping and distortion as which is similar to the regularization term in Eq.
(3). This factor quantitatively determines the degree to which the final transformation matrix imposes translation, rotation, and shearing. We place cuts between frames when this deformation factor is larger than 1. In addition, we calculate the ratio between the area covered by the earlier frame before and after the registration transformation. If this ratio is larger (smaller) than an experimentally determined threshold of 2.5 (1/2.5), then we also place a scene cut between those two frames.
In the exposition here, we refer to the current state of the mosaic, composed from the previous and current images in the video sequence, as M and an "incoming" frame that we wish to stitch as F. After registration we obtain a transformed version of F, which we denote F 0 , which overlaps with some pixels of M. Thus in each location of the overlap area, we have two candidates pixels to choose from, one from M and the other from F 0 . The stitching step seeks a single continuous stitching boundary going through the overlap area, one side of which comes from M and the other side from F 0 , as shown in Fig. 6 in the manuscript. We model the problem of finding the optimal stitching border as a graph-cuts based labeling problem, where the label L(p) of each pixel p in the new video-mosaic is either M or F 0 . This pixel labeling problem can be approached as a graph partitioning by representing the pixels of the two images as the nodes of the graph. In this graph topology, labeling two neighboring pixels with different labels means the stitching boundary cuts the edge between these two pixels. The total cost of choosing a certain stitching boundary is associated with sum of the cost of the individual edges that the boundary cuts through (Fig. 6).
In our implementation the total cost of cutting an edge is composed of 2 terms as where the terms are the "unary cost" (C u ) and the "binary" cost (C b )). Since we want to insure that the stitching boundary goes through the overlap area, we defined the unary cost as where {M \ F 0 } is the set of pixels in the overlap area of M and F 0 and W is a very large cost value, which always dominates the binary cost.
In the overlap region, the unary cost is set to be zero as noted above for all pixels because each pixel has the same equal probability of being in the final mosaic. Therefore, only the binary cost determines where the stitching border goes within the overlap area. The binary cost term for the cutting the connection between two pixels is calculated in two terms as: where C s , is the "similarity" cost, and C h is the homogeneity cost. The similarity cost , q is 2 horizontal neighborhoods of p 1 else. and is the absolute difference between the two candidate pixel intensity values from M and F 0 . It forces the stitching boundary to go through pixels with are similar in both M and F 0 . The homogeneity cost is the absolute intensity differences between candidate pixels and their spatial neighbors ("homogeneity" cost, C h ) (Fig. 6-Left  column). It forces the stitching boundary to go through areas with small intensity variation. Once the cost (function of labels) is calculated for each pixel, we minimize the overall cost with respect to the labels using Boykov's graph-cut method 5 . In the highly unlikely case that this solution is not unique, the algorithm randomly picks one of the candidate stitching borders. In this way, the pixels of the final mosaic are uniquely chosen from pixels of either of the frames as shown in Fig. 6.