Automated instrument-tracking for 4D video-rate imaging of ophthalmic surgical maneuvers.

Intraoperative image-guidance provides enhanced feedback that facilitates surgical decision-making in a wide variety of medical fields and is especially useful when haptic feedback is limited. In these cases, automated instrument-tracking and localization are essential to guide surgical maneuvers and prevent damage to underlying tissue. However, instrument-tracking is challenging and often confounded by variations in the surgical environment, resulting in a trade-off between accuracy and speed. Ophthalmic microsurgery presents additional challenges due to the nonrigid relationship between instrument motion and instrument deformation inside the eye, image field distortion, image artifacts, and bulk motion due to patient movement and physiological tremor. We present an automated instrument-tracking method by leveraging multimodal imaging and deep-learning to dynamically detect surgical instrument positions and re-center imaging fields for 4D video-rate visualization of ophthalmic surgical maneuvers. We are able to achieve resolution-limited tracking accuracy at varying instrument orientations as well as at extreme instrument speeds and image defocus beyond typical use cases. As proof-of-concept, we perform automated instrument-tracking and 4D imaging of a mock surgical task. Here, we apply our methods for specific applications in ophthalmic microsurgery, but the proposed technologies are broadly applicable for intraoperative image-guidance with high speed and accuracy.


Introduction
Rapid development of imaging technologies has facilitated the use of intraoperative imageguidance in a broad range of medical fields including neurology [1][2][3][4], ophthalmology [5][6][7], hepatology [8][9][10], and oncology [11]. Image-guided surgery provides enhanced visualization of instrument-tissue interactions and is especially useful when haptic feedback is limited such as in ophthalmic microsurgery, minimally-invasive surgery, and robotic surgery [12]. In these cases, endoscopic or microscopic video feeds are used to relay information regarding instrument position back to the surgeon. However, intraoperative imaging tends to suffer from nonuniform illumination, noise, specular reflections, and variations in the surgical environment that ultimately limit the accuracy of instrument pose estimation and instrument-tracking [13,14]. Instrument fiducials can be used to facilitate registration and tracking, but these methods assume a rigid relationship between the instrument tip and physical space [15]. These factors are also often confounded by surgical dynamics, which result in changes in instrument shape and orientation as well as movement of the instrument out-of-focus during surgery.

Ophthalmic surgery
Image-guided ophthalmic microsurgery has been demonstrated using intraoperative optical coherence tomography (iOCT) and provides significant benefits compared to conventional microscopic surgery, which offers limited contrast and feedback for submillimeter-thick tissues and precludes visualization of subsurface features [16]. The development of iOCT technology through the use of microscope-mounted handheld probes [5,17] and microscope-integrated OCT systems [6,18] addresses these limitations by enabling high-resolution volumetric imaging of supine patients during surgery. Recent studies have shown that iOCT helps to visualize structural changes during surgery that ultimately guide intraoperative decision-making, result in modification of surgical management, and enable verification of surgical goals [19][20][21][22]. Furthermore, preliminary results show that patients undergoing iOCT-assisted macular hole surgery have a higher single-operation success rate and significantly improved visual acuity post-operation [23]. Despite the aforementioned benefits, the broad adoption of iOCT technology is hindered by slow imaging speeds and a lack of automated instrument-tracking, which prevents video-rate 4D visualization and requires manual adjustment of the OCT field-of-view (FOV) [24].

4D video-rate iOCT
The safety, utility, and efficacy of commercial iOCT systems such as the Rescan 700 (Carl Zeiss Meditec) and the Enfocus (Leica Microsystems) have been well-established for ophthalmic surgery [25,26]. However, these systems operate at line-rates between 10-32 kHz, which limits current visualization to static cross-sectional images and prevents 4D imaging of surgical dynamics [27]. Recent developments in high-speed swept-source laser technology have enabled 4D video-rate OCT imaging. These research-grade systems are over an order of magnitude faster than current commercial iOCT systems and are capable of imaging at line-rates between 100 kHz -1.67 MHz and at volume rates over 10 Hz [28][29][30][31]. 4D imaging of surgical maneuvers provides enhanced feedback of instrument-tissue interactions, which can be used to monitor tissue deformation and prevent damage to underlying ocular tissues. However, these systems suffer from an inherent trade-off between speed, sampling density, and FOV. Imaging at higher speeds reduces detector integration time per scan position, thus limiting system sensitivity [32]. On the other hand, it is possible to maintain high sensitivity by increasing sampling density at the cost of speed or FOV.

Automated instrument-tracking
Currently, iOCT imaging is performed over a static FOV, thus necessitating custom scan software to offset iOCT scan positions to regions-of-interest (ROIs) and requiring a trained technician to be able to accurately follow instrument positions intraoperatively [33]. Manual adjustment of the OCT FOV during surgery disrupts surgical workflow and has been shown to increase operating times by up to 25 minutes despite the presence of a trained technician [34]. Several automated instrument-tracking methods have been previously developed to segment axial position and estimate instrument pose from OCT B-scans and volumes [35][36][37][38]. However, these methods assume that the instrument is continuously localized within the OCT FOV and are thus unable to track lateral movements of the instrument across the surgical field. This issue is confounded by the fact that iOCT imaging is often limited to small ROIs in order to achieve 4D visualization. Our group has previously demonstrated a stereo-vision tracking system that decouples lateral instrument-tracking from axial OCT imaging, enabling tracking with high speed and accuracy across a larger FOV [39]. However, stereo-vision-based tracking methods [40][41][42] require fiducial markers to be placed on the back of instruments and therefore do not account for field distortion and instrument deformation inside the eye, which result in a nonlinear relationship between fiducials and tracked features. Furthermore, movement of the patient and of the eye dynamically changes the frame of reference during operation.
Conventional instrument-tracking methods for surgery involve the use of color features, instrument geometry, or gradient-based edge detection [43][44][45][46][47]. These methods perform well in controlled lab environments, but tend to suffer in vivo due to changes in appearance and lighting conditions, motion blur, and specular reflections. Recently, deep-learning algorithms have been shown to be more robust to variations in the surgical environment while also maintaining a high tracking accuracy on the order of several pixels of error and a high tracking speed of over 30 Hz [48][49][50].
Here, we present an automated instrument-tracking method for ophthalmic microsurgery that leverages multimodal imaging and recent advances in deep-learning-based object detection. We use a high-speed spectrally encoded coherence tomography and reflectometry (SECTR) system that combines en face spectrally encoded reflectometry (SER) and cross-sectional OCT imaging for automated instrument-tracking and video-rate 4D imaging [51,52]. A convolutional neural network (CNN) is trained to detect 25-gauge internal limiting membrane (25G ILM) forceps from SER images, and OCT scanning is automatically updated based on instrument position. We also present an updated method for adaptive-sampling by optimizing input scan waveforms to densely sample instrument-tissue interactions without sacrificing speed or FOV [53]. We show that we are able to achieve resolution-limited detection accuracy across the OCT imaging range, and as proof-of-concept, we demonstrate 4D tracking of surgical maneuvers in a phantom at 16 Hz volume rate.

Automated instrument-tracking framework
The framework for our proposed automated-instrument tracking method ( Fig. 1) can be divided into three main processes: simultaneous acquisition of OCT and SER images, detection of instrument position from SER images, and scan waveform modification based on instrument position outputs to re-center the OCT FOV. Despite the fact that these processes occur in series, the overall workflow must be highly parallelized in order to minimize latency between steps. In particular, automated instrument-tracking requires dynamic modification of SECTR scan signals at high speeds necessary for video-rate 4D imaging.

SECTR acquisition
A custom-built SECTR engine was used to simultaneously generate en face SER and crosssectional OCT images. OCT imaging is performed by raster scanning a galvanometer pair (X-Y) separated by a unity-magnification 4f optical relay to maintain telecentricity. SER uses a transmissive grating to spectrally-disperse broadband illumination, which is then optically relayed using a 4f relay across the OCT fast-axis scanning galvanometer (X). SER illumination on the sample consists of a focused line extended source in the spectrally-encoded axis that is aligned to the OCT slow-axis (Y). SER and OCT beam paths use a shared swept-source laser and optics to ensure collinearity and spatiotemporal co-registration, allowing for the acquisition of a single en face SER image (X-Y) and a single OCT B-scan (X-Z) for each sweep of the fast-axis (X) galvanometer mirror. As a result, we are able to use SECTR imaging to track en face features, such as surgical instrument position, from SER images and directly correlate those features with respect to the position of the OCT scan FOV.
Here, we modified our previous benchtop design [51] by moving engine components into a compact optical enclosure placed on a portable cart for clinical application. An optical buffer was used to double the laser sweep rate in order to achieve a 400 kHz line rate for rapid volumetric imaging. In addition, fiber connections were fusion-spliced with each other to minimize insertion loss and optimize the throughput of the system. A high-speed digitizer (ATS-9373, AlazarTech) was used to simultaneously acquire SER and OCT data at 2 gigasamples/second for real-time processing and display. Custom C++ software was used to synchronize scan waveform generation via a DAQ (USB-6351, National Instruments) with data acquisition.

CNN detection
Similar to intraoperative imaging variability, SER images acquired using SECTR tend to suffer from image artifacts that preclude the use of traditional instrument-tracking and segmentation methods (Supplement 1 Fig. 1).
Here, we leverage deep-learning-based instrument-tracking techniques by using a GPUaccelerated CNN [54] for detection of surgical instruments from SER images. In particular, the network implementation utilizes the OpenCV Deep Neural Network library and the NVIDIA GPU Computing Toolkit (CUDA) for rapid training and detection. The open-source network was trained using 4730 manually-labelled SER images of 25G ILM forceps (Supplement 1 Fig. 1, Visualization 1). SER images were acquired in paper phantoms (N = 4290), retinal phantoms (N = 254) and ex vivo bovine eyes (N = 186). In addition, data augmentation was performed by utilizing a horizontal flip and 90°clockwise rotation for each image. The augmented data was then split between training (80%) and validation (20%) sets. CNN training was performed on a computer with an Intel i9-10900X CPU, an NVIDIA GeForce RTX 2080 Ti GPU and 64 GB RAM. Training was run for 400,000 epochs over the course of 24 hours, resulting in a mean average precision value of > 99% based on a complete intersection over union (CIoU) loss function [55].
CNN model and weights were integrated directly with the SECTR C++ acquisition software for real-time detection of SER images. Multithreaded programming was used to decouple SECTR acquisition from CNN detection by copying acquired SER images to a buffer. A parallel thread then passed the stored SER image through the trained network for high-speed detection at over 120 Hz.

Scan waveform modification
Finally, scan waveforms must be updated on a volume-by-volume basis for smooth 4D visualization, which corresponds to update rates of at least 16-20 Hz. Typical DAQ devices used to generate scan waveform signals operate in regeneration mode, where scan waveforms are preallocated to an onboard waveform buffer (Fig. 2, red). As a result, scanning must be paused and restarted in order to update scan waveforms. Here, we take advantage of a nonregenerative operating regime that allows us to dynamically modify waveform values per scan volume. An additional thread is used to continuously monitor the size of the waveform buffer and automatically modify and write new values immediately when the buffer is empty (Fig. 2, blue). The CNN outputs bounding box coordinates (x, y, width, height) corresponding to the top left coordinate (x, y) of the detected surgical instrument for right-handed cases or the top right coordinate (x + width, y) for left-handed cases (Visualization 1). These values were sent directly to the waveform generation thread and used to update scanning outputs. CNN position to galvanometer scan waveform voltage offset calibration was accomplished by fitting measured offsets used to center the OCT FOV around the instrument from different initial positions ( Fig. 3(a)). Due to adaptive-sampling of the fast-axis (see Section 2.2), a higher order fit was required to calibrate the nonlinear offset-to-position curve. On the other hand, the slow-axis was linearly-sampled and therefore required only a linear fit. The design of the SECTR system enabled independent calibration of the fast-and slow-axes since each scan mirror is conjugate to the other through a 4f imaging relay. Furthermore, in order to optimize imaging performance, galvanometer scanners (Saturn 5B, ScannerMax) were tuned using a previously-reported method to maximize scan speed and linear FOV for 4D imaging [56].

Adaptive-sampling
Additionally, input scan waveforms were modified by using an adaptive-sampling protocol ( Fig. 3(b)-3(d)), which allowed us to dynamically re-center the densely sampled region of each OCT volume with the automatically tracked instrument position to enhance visualization of instrument-tissue interactions. The original linear scan waveform prior to adaptive-sampling can be described mathematically as: Here, the linear X-axis waveform input is defined as a line with slope equal to the FOV (mm) scaled by a voltage scale factor V (V/mm) divided by the number of lines. On the other hand, the slow Y-axis scan waveform is constant per frame to generate a raster scan pattern. For adaptive-sampling, the slope of the linear X-axis waveform (Eq. 1) is scaled by v dense = 0.25 to densely sample the center of the FOV between X start and X end by a factor of 4. In addition, the slope of the scan waveform is scaled by v sparse = 1.50 (calculated based off of v dense ) to sparsely-sample the perhery of the frame. Scale factors were chosen to maximize sampling density at the center of the frame while maintaining a wide FOV for tracking as well as a high frame rate for 4D volumetric imaging. Adaptive-sampling For visualization and CNN detection, acquired SER images were resampled (Fig. 3(c)) based on the fast-axis galvanometer position output measured via the galvanometer controller (MachDSP, ScannerMax).

Experimental setup
The SECTR engine was interfaced with an ophthalmic imaging probe placed in a microscope configuration with axial (Z) translation freedom. A pair of 25G ILM forceps were clamped and mounted to a two-axis (XY) motorized translation stage (MLS203, ThorLabs) with 0.1 µm position resolution for precise control of instrument position and speed.
In order to quantify the accuracy of the CNN for basic surgical maneuvers, a series of 3 experiments were performed by varying probe axial/depth (Z) position, instrument orientation, and instrument translation speed (Table 1). Instrument speeds were chosen to cover and extend beyond the range of typical ophthalmic surgical maneuver speeds between 0.1-0.5 mm/s [57][58][59]. For each set of experimental parameters, SER images of the forceps placed over a paper sample were acquired (Supplement 1 Fig. 2). For Experiment 1, the imaging probe was translated between 0 mm (in-focus) to 21 mm at increments of 3 mm, which corresponds to the measured Rayleigh range of the system (Supplement 1 Fig. 3), in order to simulate motion of the instrument out-of-focus (Fig. 4(a)). For Experiment 2, both instrument depth and translation speed were varied in order to simulate out-of-plane motion and motion blur. Lastly for Experiment 3, instrument orientation and speed were varied to imitate dynamic movement of the instrument during operation (Fig. 4(b)). For each set of parameters, manual annotation of instrument position from tracked SER images (N = 200) were compared to corresponding CNN outputs to determine pixel error and tracking accuracy.   N = 200). A one-sample t-test was used to compare each distribution of calculated errors to the resolution limit of the system. Comparing manual annotations of instrument position to CNN outputs at various axial positions, we are able to achieve resolution-limited tracking accuracy (p << 1E-7) between 0 mm -9 mm, which is beyond the ∼7 mm full axial range of our OCT imaging (Fig. 5, top). Comparing tracking accuracy in this depth range with varying instrument translation speed, we maintain resolution-limited performance up to velocities of 10 mm/s (Fig. 5, middle). Similar performance can also be achieved when varying instrument orientation and speed (Fig. 5, bottom). Despite the presence of error extrema and outliers, distribution means were statistically significantly lower than the resolution limit. These results suggest that the CNN is particularly robust to intraoperative variability, including movement of the instrument in and out-of-focus, high speed maneuvers, and dynamic maneuvers involving rotation and translation of the surgical instrument. Beyond tracking accuracy, it is also important to continuously center the instrument within the imaging FOV in order to provide 4D visualization of instrument-tissue interactions. For each combination of experimental parameters, the standard deviation of CNN outputs was calculated to determine how precisely the instrument was re-centered despite changes in focus, speed, and orientation. Comparing CNN instrument position outputs, we are able to localize the instrument at the center of the image within the resolution limit for typical ophthalmic maneuver speeds below 1 mm/s regardless of instrument orientation or depth position (Fig. 6). However, a linear trend in deviation from the center of the frame is observable at higher speeds due to update rate limits. Here, we chose to update scan position waveforms per volume (16 Hz) instead of at the CNN output rate (120 Hz) in order to prevent discontinuities in volumetric rendering as well as to maintain anatomical accuracy within each volume. Nevertheless, a minimal deviation of ∼15 pix. can be seen even at translation speeds of 10 mm/s, which is beyond the speed of conventional ophthalmic surgical maneuvers. Fig. 6. Instrument deviation from center for tracked SER frames using CNN position outputs for changes in depth and speed (left) and orientation and speed (right). Resolution-limited localization of the forceps at the center of the frame is shown for typical ophthalmic surgical maneuver speeds (shaded box).
Finally, we validated our CNN-based tracking method by performing automated instrumenttracking and 4D video-rate imaging of a mock surgical task. Free-hand maneuvers of 25G ILM forceps by an untrained volunteer were used to evaluate CNN tracking performance in the presence of physiological tremors. In particular, a metallic ring was moved between 4 phantom quadrants of varying height (0 mm, 1 mm, 2 mm, 3 mm). Additional features, such as holes and a scale bar (3 mm increment), were included in order to better visualize lateral and axial changes as the instrument moves (Supplement 1 Fig. 4). Volumetric imaging was performed using 2560 (Z) x 500 (X) x 50 (Y) pix. (pixels per line x lines per frame x frames per volume) for a frame rate of 800 Hz for en face SER and cross-sectional OCT B-scans and volume rate of 16 Hz for 3D OCT. CNN detection of forceps from SER images was performed simultaneously at a rate of 120 Hz. SER images were acquired over a 25 mm (Y) x 7 mm (X) FOV and tracking was performed over a maximum FOV of 25 mm x 25 mm. Following the acquisition, standard OCT post-processing methods were used to generate images from spectral data and a 4D rendering of OCT data was produced using 3D Slicer [60]. Despite lateral movement across the entire sample, changes in depth/focus, opening and closing of the forceps tip, as well as presence of specular reflections from the metallic ring, the forceps remain localized in the OCT FOV (Fig. 7, Visualization 2).

Discussion and summary
Ophthalmic microsurgery is conventionally performed under a surgical microscope, which precludes visualization of subsurface features and underlying instrument-tissue interactions. iOCT is an emerging technology that provides depth-resolved visualization of surgical maneuvers but is currently limited by slow scans speeds and lack of automated instrument-tracking that necessitates manual adjustment of the static imaging FOV. Here, we propose an automated instrument-tracking method that takes advantage of a high-speed SECTR system to simultaneously acquire en face SER and cross-sectional OCT images. Co-registration between the two imaging modalities allows for dynamic updates of OCT scan position based on instrument position detected from SER images. Using a GPU-accelerated CNN, we are able detect surgical instrument positions at over 120 Hz with resolution-limited accuracy despite changes in focus, orientation, and speed. In addition, the proposed method has significant advantages over previously-reported OCT-based, color-based, and gradient-based detection methods by providing widefield lateral tracking in the presence of instrument deformation, soft-tissue movement, mechanical manipulation of the eye, and artifacts such as specular reflections. Furthermore, our method enables tracking for both anterior and posterior segment operations due to the network's insensitivity to distortions and aberrations induced by the optical system and the optics of the eye. We demonstrated the efficacy of our method by performing automated instrument-tracking and 4D video-rate imaging of a mock surgical task in a phantom.
Currently, the main limitation of our method is the update rate, which occurs once per volume in order to prevent intravolume updates that would degrade 4D visualization. As a result, we have a linear increase in instrument position deviation from the center of each image with increasing instrument velocities. However, ophthalmic microsurgery requires delicate manipulation of tissue, and instrument maneuver speeds are typically between 0.1-0.5 mm/s. Within this speed range, our method achieves high tracking accuracy and localization of the instrument. In addition, we show that our method is robust to extreme changes in speed (0.25-10 mm/s) and defocus (0 mm -9 mm) that are beyond traditional use cases. We also account for changes in instrument orientation which may occur during manipulation of tissue as well as from differences in surgeon preference (left-handed vs. right-handed). At higher speeds beyond ophthalmic surgical use cases, it is also possible to use extrapolation methods, such as Kalman filtering, to achieve better localization of the surgical instrument despite limited update rates.
Furthermore, we believe the proposed method can be extended for the identification and tracking of multiple instruments which are often present simultaneously in the surgical field. The implemented CNN is capable of predicting instrument positions across multiple classes at once and can be trained to detect a variety of ophthalmic surgical tools including forceps, picks, loops, membrane scrapers, and light pipes. In addition, the existing 25G ILM forceps model can potentially be extended to facilitate training through transfer learning to generate a robust model for multiple instruments [61,62]. By using an open-source network, we also eliminate the need for extensive hyperparameter tuning and optimization that is typical for many CNN implementations and, thus, broaden the applicability of our automated instrument-tracking framework. Similarly, we can extend our adaptive-sampling protocol to be able to target individual surgical instruments as specified by the surgeon as well as to switch between multiple instruments by leveraging multi-class detection outputs.
In addition, our technology can be directly integrated into the surgical microscope and potentially used for 4D in vivo imaging and automated instrument-tracking [63,64]. We believe our method ultimately enables intraoperative guidance for ophthalmic microsurgery and will facilitate the adoption of iOCT technology. Furthermore, our method can benefit iOCT-guided surgery by lowering the learning curve for surgeons and allowing them to perform an operation normally in comparison to the use of commercial iOCT systems, which require manual tracking and alignment of the static OCT FOV to ROIs. SECTR-based tracking can also be extended beyond ophthalmic applications for use in fields such as dermatology, where imaging power and speed can be significantly increased for enhanced 4D tracking and visualization [65,66].
T32-EB021937. The content was solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Disclosures. The authors declare no conflicts of interest.
Data Availability. CNN model and accuracy quantification data are available upon request. Supplemental document. See Supplement 1 for supporting content.