Placental vessel-guided hybrid framework for fetoscopic mosaicking

ABSTRACT Fetoscopic laser photocoagulation is used to treat twin-to-twin transfusion syndrome; however, this procedure is hindered because of difficulty in visualising the intraoperative surgical environment due to limited surgical field-of-view, unusual placenta position, limited manoeuvrability of the fetoscope and poor visibility due to fluid turbidity and occlusions. Fetoscopic video mosaicking can create an expanded field-of-view image of the fetoscopic intraoperative environment, which could support the surgeons in localising the vascular anastomoses during the fetoscopic procedure. However, classical handcrafted feature matching methods fail on in vivo fetoscopic videos. An existing state-of-the-art method on fetoscopic mosaicking relies on vessel presence and fails when vessels are not present in the view. We propose a vessel-guided hybrid fetoscopic mosaicking framework that mutually benefits from a placental vessel-based registration and a deep learning-based dense matching method to optimise the overall performance. A selection mechanism is implemented based on vessels’ appearance consistency and photometric error minimisation for choosing the best pairwise transformation. Using the extended fetoscopy placenta dataset, we experimentally show the robustness of the proposed framework, over the state-of-the-art methods, even in vessel-free, low-textured, or low illumination non-planar fetoscopic views.


Introduction
Twin-to-twin transfusion syndrome (TTTS) is a rare foetal anomaly that affects the twins sharing a monochronic placenta. It is caused by abnormal placental vascular anastomoses on the placenta, leading to uneven flow of blood between the two foetuses (Baschat et al. 2011). Fetoscopic Laser Photocoagulation (FLP) is used to treat TTTS; however, this procedure is hindered because of difficulty in visualising the intraoperative surgical environment due to limited surgical field-of-view (FoV), unusual placenta position, limited manoeuvrability of the fetoscope and poor visibility due to fluid turbidity and occlusions (as shown in Figure 1). This adds to the surgeon's cognitive load and may result in increased procedural time and missed treatment, leading to persistent TTTS. Fetoscopic video mosaicking can create a virtual expanded FoV image of the fetoscopic intraoperative environment, which may provide computer-assisted interventions support in localising the vascular anastomoses during the FLP procedures.
Classical video mosaicking methods (Reeff et al. 2006;Daga et al. 2016) that used handcrafted features (e.g. SIFT, SURF) perform poorly on in vivo fetoscopic videos due to low resolution, poor visibility, honeycomb or blur effect due to fibrebased fetoscope, floating particles and texture paucity or repetitive texture challenges that inherently exists in fetoscopy. Hence, classic computer vision methods for video mosaicking are not suitable for fetoscopic video mosaicking. Fusion of visual tracking with electromagnetic pose sensing has also been studied, but only in ex vivo experiments . A direct registration method for mosaicking has also presented which was only validated on a single in vivo fetoscopic video . Recently, deep learningbased methods have also been reported (Bano et al. 2019(Bano et al. , 2020b; Alabi et al. (2022); Casella et al. (2022)) for fetoscopic video mosaicking. (Bano et al. 2019(Bano et al. , 2020b) approach restricted the model to estimate only euclidean transformation, thus bounding the drifting error that led to inaccuracies. A recent intensity-based image registration (Bano et al. 2020a) method relies on placental vessel segmentation maps for registration. This method facilitated in overcoming some visibility challenges, but it failed when the predicted segmentation map was inaccurate or inconsistent across frames, or in views with thin or no vessels. Recent computer vision literate has introduced deep learning-based interest point descriptors (Sarlin et al. 2020) and detector-free dense feature matching (Sun et al. 2021) techniques, showing robustness in multiview feature matching. Such techniques can be explored for fetoscopic mosaicking to improve the performance of fetoscopic mosaicking.
To overcome the existing literature limitations, we propose a vessel-guided hybrid framework for creating robust and reliable mosaics. The framework optimises the performance by fusing the state-of-the-art in fetoscopic mosaicking and computer vision, namely, vessel-based registration (Bano et al. 2020a) and detector-free local feature matcher (Sun et al. 2021) for registration methods, respectively. Our framework introduces a selection mechanism based on appearance consistency of placental vessels and photometric error minimisation for choosing the best pair-wise transformations. Through both qualitative and quantitative comparison performed using the extended fetoscopy placenta dataset (Bano et al. 2020a), we show the robustness of the proposed framework over the existing methods. Our key contribution lies in proposing a mosaicking framework for computer-assisted intervention application which is robust even in the absence of vessels and presence of heavy floating particles, low illumination, nonplanar views and spotlight light source. The existing fetoscopic mosaicking methods do not show robustness to all these challenging conditions in a single framework. The proposed hybrid framework brings us closer towards translating the fetoscopic mosaicking framework into clinical settings, which in turn could help in reducing surgeon's cognitive load during fetoscopic procedures.

Method
The proposed framework consists of two parallel registration methods, namely, placental vessel-based direct registration (Bano et al. 2020a) and detector-free dense (LoFTR) matcher (Sun et al. 2021) as shown in Figure 2. Each method performs matching of two consecutive frames I t and I tþ1 followed by their registration for estimating pairwise transformations H t;tþ1 V and H t;tþ1 L from the vessel-based and the dense matcher methods, respectively. A vessel-guided transformation selection strategy is then proposed that also minimises the photometric errors, thus enabling robust mosaicking. The proposed framework allows generation of mosaics from long fetoscopic video sequences without drift accumulation.

Placental vessels registration
Placental vessel segmentation allows overcoming visibilityrelated challenges such as moving occlusions (floating amniotic fluid particles) and specular view-dependent illumination that  L between RGB frames (I t , I tþ1 ) using vessel-based registration and LoFTRbased dense feature matches' registration, respectively. Photometric errors (E V ð _ I t ; I t Þ, E V ð _ S t ; S t Þ) between RGB I t and reprojected _ I t frames and between vessel map S t and reprojected vessel map ) is then selected based on vessel segmentation consistency and minimum photometric errors. Finally, pairwise transformations are sequentially registered to form an expanded FoV image.
can result in inaccurate feature matches. We utilise the placental vessel segmentation and registration method from (Bano et al. 2020a) as this is the state-of-the-art method in in vivo fetoscopic mosaicking.
Given two consecutive frames I t and I tþ1 , a UNet (Ronneberger et al. 2015) with ResNet50 (He et al. 2016) backbone, pretrained on the fetoscopy placenta dataset 1 , is used for obtaining the predicted vessel maps S t and S tþ1 . Similar to Bano et al. 2020a), the registration between I t and I tþ1 is approximated with an affine transformation, as the use of projective transformations in fetoscopy data has empirically been shown to lead to poor results. This is because fetoscopic scene is only piece-wise planar. Intensity-based direct registration is applied using a pyramidal Lucas-Kanade framework that minimises the photometric error between S t and S tþ1 through Levenberg-Marquardt optimisation. A circular FoV mask of the fetoscopic image is used to perform registration while neglecting the black background regions. This gives the estimated affine transformation H t;tþ1 V between I t and I tþ1 .
Since the placental vessel-based registration method is driven by predicted vessel maps, it tends to fail when the predicted maps are inaccurate or inconsistent across frames or in views with thin or no vessels.

Detector-free dense feature matching for registration
Unlike classical feature matching methods that perform feature detection and description followed by their matching, the recently proposed LoFTR (Sun et al. 2021) method takes a hierarchical approach and first establishes pixel-wise dense matches at a coarse level and later refines the good matches at a fine level.
Given I t and I tþ1 , a standard convolutional neural network architecture is used to extract dense features at coarse and fine levels from both frames. Coarse local features are fed into the LoFTR module, which uses a transformer with positional encoding, and self and cross-attention layers to transform coarse features into position and context dependent local feature descriptors. A confidence matrix is obtained by matching these descriptors using a differentiable matching layer. Matches in the confidence matrix that are higher than a predefined threshold and that satisfy the mutual nearest neighbour criteria are selected as coarse-level matches. Coarse to fine feature matches M f are then obtained by taking a local window size from fine-level features at each coarse match positions and applying the LoFTR module to it. For more detail, refer to (Sun et al. 2021), in which it is shown that LoFTR produces high-quality matches even in regions with low-textures, motion blur or repetitive patterns; making it an ideal matching module for fetoscopic mosaicking.
For registration, a circular mask covering only the fetoscopic FoV is first used to obtain matches M 0 f only in the visible fetoscope region. Registration is then approximated as an affine transformation using the RANdom SAmple Consensus (RANSAC) method. The obtained transformation is refined by using only the inliers with Levenberg-Marquardt optimisation that further reduces the transformation error. This gives the affine transformation estimate H t;tþ1 L that defines the alignment between I t and I tþ1 through the LoFTR-based matching.
We note through empirical experimentation that LoFTR matching is affected by the light source intensity and the resulting view-dependent reflectance in the surgical scene that can result in drift error during sequence registration.

Vessel-guided transformation selection
Vessel-guided transformation selection aims at finding the best affine transformation from H t;tþ1 V and H t;tþ1 L based on the vessels' appearance consistency, percentage of vessels with respect to the fetoscopic FoV and minimum reprojection error. Let _ I t ¼ H t;tþ1 I tþ1 be the reprojected frame obtained by warping I tþ1 using the estimated transformation H t;tþ1 . The photometric error between _ I t and I t is obtained using, where n is the total number of pixels in a frame. Four photometric errors are computed using the input frames, segmentation maps and two estimated transformations. E V ð _ I t ; I t Þ measures the photometric error between I t and reprojected _ I t obtained using H t;tþ1 Figure 2).
A rule-based strategy is defined for transformation selection based on the qualitative observations made from the vesselbased and LoFTR matcher-based registration methods. Let x tþ1 denotes the percentage of vessel class pixels with respect to the total number of pixels in the FoV mask in S tþ1 . And y tþ1 denotes the percentage difference between vessel class pixels in S t and S tþ1 . We empirically found that vessel consistency can be guaranteed by ensuring x tþ1 > 15% and y tþ1 < 25% of x tþ1 . In pairs of frames where vessels are consistent across frames, the affine transformation estimate from the method that gives the lowest errors between E V ð _ S t ; S t Þ and E L ð _ S t ; S t Þ) is selected as the final transformation H t;tþ1 F . When the vessel consistency conditions are not satisfied, error measurements based on vessel maps become inaccurate. In this case, we select the transformation estimate of the method that reports lower among E V ð _ I t ; I t Þ and E L ð _ I t ; I t Þ errors.

Sequential registration
Once the pairwise transformations are obtained, next step is to compute the relative transformations with respect to a reference frame for mosaic generation. The relative transformation of I l with respect to a reference frame I k is computed by applying left-hand matrix multiplication, where l > k and l is the length of the fetoscopy sequence. This gives an expanded FoV image of the placental surface. To  create seamfree mosaics, blending is applied using the Enblend 2 software.

Dataset and experimental setup
For experimental analysis, we use an extended version of the publicly available fetoscopy placenta dataset that was introduced in (Bano et al. 2020a). In the extended version, each video sequence contains an additional 100 frames. The dataset contains six in vivo fetoscopic video sequences from six different FLP procedures. The addition of 100 extra frames in each sequence resulted in frames having either weak or no vessels. The number of frames in each video are reported in Figure 3. We note that there are large inter and intra-case variabilities in the fetoscopic videos. These videos are of varying visual quality having low resolution, poor visibility due to floating amniotic fluid particles and artefacts due to spotlight source, texture sparsity and non-planar views (Video 3 and 5) due to anterior placenta imaging.
Since the ground-truth transformations are not available for in vivo fetoscopy, we use the quantitative metric, referred as Nframe SSIM, proposed by (Bano et al. 2020a). The N-frame SSIM quantifies the accumulated drift error in N frames by computing the structural similarity index measure (SSIM) between the current frame and warped frame at N frame distance. N can take values from 2 to 5 frames. For comparison, we report the 1 to 5-frame distance SSIM for the state-of-the-art vessel-based (Bano et al. 2020a), LoFTR (Sun et al. 2021) matcher-based and the proposed hybrid methods in Figure 4. Additionally, we present the qualitative results in Figure 3. These results are discussed in detail in Sec. 4.
The segmentation model is implemented in PyTorch and trained on the vessel segmentation dataset using the same hyperparameters reported in (Bano et al. 2020a) and on a single Tesla V100-DGXS-32GB GPU of an NVIDIA DGXstation. For LoFTR matcher, we use the pretrained model (trained on the ScanNet dataset (Dai et al. 2017)) for obtaining the fine-level matches between two consecutive frames. We observe that matches returned through this network on fetoscopy data are already robust and repeatable. Retraining or fine-tuning of this network on fetoscopy data was not possible because of the lack of ground-truth data.

Results and discussion
Qualitative and quantitative comparisons are presented in Figure 3 (refer to the supplementary video) and Figure 4, respectively, using the 6 in vivo fetoscopy videos from the extended fetoscopy placenta dataset. Note that classical feature matching based (Reeff et al. 2006;Daga et al. 2016) and RGB intensity based (Bano et al. 2020a) techniques do not work on fetoscopic videos (as mentioned in Sec. 1). Hence, comparison is mainly performed with the existing state-of-the-art method (Bano et al. 2020a) of fetoscopic mosaicking.
From Figure 3, we observe that the vessel-based method outperformed in Video 1 and 2 due to strong vessel appearance in these videos. LoFTR matcher-based also performed well, but introduced some registration errors (marked with red circle in Figure 3). Our proposed hybrid method converged towards selecting the transformations from the vessel-based method because of the consistent appearance and dominant presence of vessels throughout these videos. Video 3 to Video 6 show more challenging scenarios where the vessels are either very thin (video 3 and 6) or thick (close-up view in Video 4) or are not present in some frames. Moreover, Video 3 and 5 shows an anterior placenta, making these views highly non-planar. This negatively influenced the vessel-based method, resulting in increase drifting errors and tracking failures in Video 3, 4, 5 and 6 at frame 110 th , 105 th , 160 th and 145 th , respectively. The errors are mostly because of insufficient vessels present in the scene or false negative in predicted vessel maps where the segmentation network failed to properly segment thin vessels. On the other hand, LoFTR matcher-based provided stable mosaics, but resulted in some inconsistencies in registration (marked with red circle in Figure 3). Our hybrid approach optimise itself to select the best from the two methods, hence the registration error are visibility reduced, giving reliable mosaics.
The quantitative results presented in Figure 4 shows the 1to 5-frame SSIM measurements for the three methods under comparison on the six in vivo video clips. We make similar observations from here as that of Figure 3, where performance of all the methods is comparable in Video 1 and Video 2. Vesselbased registration failed for some frames in Video 3 to Video 6, hence we can observe low SSIM values with increasing frame distance. This also shows that the drifting error is large in Video 3 to Video 6 for the vessel-based methods. In these videos, we observe significantly low interquartile range and high median 5-frame SSIM for LoFTR matcher-based and hybrid methods compared to the vessel-based method. Hybrid method performance is significantly better in Video 3 than the LoFTR matcherbased method, and is comparable to LoFTR matcher-based in Video 4 to Video 6. This is because the majority of the pair-wise transformations from LoFTR-based are better than the vesselbased in these videos.
The experimental results show that the proposed vesselguided hybrid framework is optimised to select the best pairwise transformations from the vessel-based and LoFTR matcher-based methods, overcoming the limitations of these methods. As a result, the proposed hybrid framework is robust even in the absence of vessels and presence of heavy floating particles, low illumination, non-planar views and spotlight light source. Our method significantly advances the literature of fetoscopic mosaicking, and paves the way towards translating such a framework into clinical settings for assisting surgeons during fetoscopic procedures. Future work involves further reducing the registration error through loop closure and bundle adjustment (Li et al. 2021), designing a real-time application based on the hybrid approach and testing its usability through in-lab and clinical trails.

Conclusions
We propose a vessel-guided hybrid fetoscopic video mosaicking framework for generating reliable virtual expanded field-ofview image of the intraoperative fetoscopic environment. The proposed framework benefited from both placental vesselbased registration (Bano et al. 2020a) method and detector-free feature matching with transformers (LoFTR) (Sun et al. 2021) method used as a robust matcher for registration, resulting in overcoming the limitations of individual methods. Using an extended version of the publicly available fetoscopy placenta dataset (Bano et al. 2020a), we experimentally showed that the proposed hybrid framework optimised itself to select the best pair-wise transformations from the two methods, hence showing significant performance improvement over the existing state-of-the-art (Bano et al. 2020a) on fetoscopic mosaicking. The proposed framework is robust even in vesselfree, low-textured or low illumination non-planar views, which shows its potential towards clinical translation for assisting the surgeons during the TTTS procedure.