Automatic registration with continuous pose updates for marker-less surgical navigation in spine surgery

Established surgical navigation systems for pedicle screw placement have been proven to be accurate, but still reveal limitations in registration or surgical guidance. Registration of preoperative data to the intraoperative anatomy remains a time-consuming, error-prone task that includes exposure to harmful radiation. Surgical guidance through conventional displays has well-known drawbacks, as information cannot be presented in-situ and from the surgeon's perspective. Consequently, radiation-free and more automatic registration methods with subsequent surgeon-centric navigation feedback are desirable. In this work, we present an approach that automatically solves the registration problem for lumbar spinal fusion surgery in a radiation-free manner. A deep neural network was trained to segment the lumbar spine and simultaneously predict its orientation, yielding an initial pose for preoperative models, which then is refined for each vertebra individually and updated in real-time with GPU acceleration while handling surgeon occlusions. An intuitive surgical guidance is provided thanks to the integration into an augmented reality based navigation system. The registration method was verified on a public dataset with a mean of 96\% successful registrations, a target registration error of 2.73 mm, a screw trajectory error of 1.79{\deg} and a screw entry point error of 2.43 mm. Additionally, the whole pipeline was validated in an ex-vivo surgery, yielding a 100\% screw accuracy and a registration accuracy of 1.20 mm. Our results meet clinical demands and emphasize the potential of RGB-D data for fully automatic registration approaches in combination with augmented reality guidance.


Introduction
Complex orthopedic procedures, such as pedicle screw placement, can benefit from computer assistance in regards to safety and accuracy (Gelalis et al. [2012], Perdomo-Pantoja et al. [2019]).Nevertheless, computer-assisted orthopedic surgery (CAOS) only accounts for an estimated 5% of all orthopedic surgeries performed in North America, Europe and Asia (Joskowicz and Hazan [2016]).Many state-of-the-art navigation systems for pedicle screw placement comprise three main components: planning, registration and navigation.The latter two strongly contribute to the low clinical adoption (Härtl et al. [2013], Nadeau et al. [2015], Joskowicz and Hazan [2016]).Existing registration approaches tend to be time-consuming, cumbersome and often involve radiation.This hinders real-time application for surgical navigation, which itself suffers from limitations caused by conventional visualization techniques.
State-of-the-art CAOS systems commonly require intraoperative imaging or manual anatomy digitization by the surgeon for registration (Markelj et al. [2012]).Both are time-consuming processes.In addition, except for ultrasound, intraoperative imaging, e.g.fluoroscopy or cone-beam CT, always comes along with bulky equipment and radiation exposure for patient as well as OR personnel.Despite a multitude of techniques for 2D/3D registration (Sundar et al. [2006], Esfandiari et al. [2019], Miao et al. [2016]), which reduces radiation, registration failures are an existing problem even when reference markers are used (Zhang et al. [2019]).
Besides the registration issue, most state-of-the-art navigation systems provide visualizations on 2D monitors in the OR periphery, which can cause attention shift and may increase the cognitive load for the surgeons, e.g.hand-eye coordination (Brendle et al. [2020], Qian et al. [2017], Léger et al. [2017]).Given the recent advent of augmented reality (AR) solutions and their potential in the realm of medicine, their application in intraoperative settings to provide surgical guidance should be considered in hopes of alleviating the aforementioned limitations (Eckert et al. [2019], Birlo et al. [2022]).The use of AR for spine surgery, and pedicle screw placement in particular, has been investigated thoroughly in the past few years, showing the benefits that the technology could bring to the field (Ma et al. [2017], Elmi-Terander et al. [2019], Gibby et al. [2019], Liebmann et al. [2019], Molina et al. [2019], Farshad et al. [2021a,b], Liu et al. [2021], Uddin et al. [2021], von Atzigen et al. [2022]).While there are mitigation strategies tailored to the visualization challenges of the exsiting CAS solutions (e.g.Wolf et al. [2023]), the question about an ideal registration remains.
As a potential remedy for the aforementioned registration difficulties, one approach towards a more automatic and radiation-free registration is the intraoperative 3D reconstruction of the target anatomy using depth sensing hardware and the associated computer vision software.Ji et al. [2015] used two co-calibrated RGB cameras mounted to a surgical microscope for registration of preoperative spine CT images in a clinical trial.In their work, the authors pursued a semi-automatic segmentation approach to localize the anatomy of interest in the 2D RGB images based on manual surgeon annotations and region growing followed by a 3D reconstruction module.They reached a registration accuracy of 1.43 mm, but the manual annotations make real-time use in a surgical setting cumbersome.The path of stereo feature matching in open spine surgery was further demonstrated to be promising in recent work by Manni et al. [2020], who achieved a better than 0.5 mm 3D triangulation error on grayscale images as evaluated on data from 23 patients.Besides methods that have been reported and evaluated in academic settings, a commercial navigation system for surface-reconstructing, radiation-free spine surgery navigation is available that operates based on a structured light sensor integrated into OR lamps (7D Surgical Inc., Toronto, ON, Canada, Faraji-Dana et al. [2020].The authors have reported the registration time to be less than 20 s, and re-registration (in case of perceived registration inaccuracies) to be even faster.However, the lamp-integrated hardware is not versatile.Furthermore, registration starts with manual point sampling and has to be performed for each vertebral level individually.Motion detection and compensation relies on markers clamped to the anatomy, which is the standard approach.While such techniques can diminish some of the concerns related to radiation exposure, time and cost of a common registration pipeline, they still require a residual registration process between the reconstructed anatomy and the patient, making them susceptible to issues such as sub-optimal manual input, poor initialization and small capture range.More recent algorithms have looked into achieving a higher level of registration autonomy through the utilization of artificial intelligence (AI) concepts.In Félix et al. [2021], an RGB-D sensor was used for automatic registration of preoperative femural and tibial 3D models to intraoperative cadaveric anatomy.The RGB images were segmented with a neural network allowing for the corresponding 3D segmentation of the reconstructed anatomy.The reconstructed 3D models were then automatically registered to the anatomy using a RANSAC-based method.Through the analysis of the results, the authors have reported that a considerable part of the error was attributed to the infrared-based RGB-D sensor, an observation that was affirmed in other studies (e.g., Gu et al. [2021a]).Hu et al. [2022] investigated a femur registration and tracking approach using point cloud data that finds a global alignment between a peroperative 3D reference model and intraoperative depth camera data based on RANSAC, followed by an iterative closest point (ICP) refinement.A PointNet-based network was proposed in this study to restore the surface of the unmodified bone captured by the depth camera before using it for registration, coping intraoperative bone surface modification.The employment of the network reduced the registration RMSE from 2.40 mm to 2.07 mm, but the improvement did not reflect significantly on the pose error when compared to a ground truth tracking.An early prototype of a complete AR-based navigation approach using an optical see-through head-mounted display (HMD) for total shoulder arthroplasty evaluated on synthetic bone models was presented in Gu et al. [2021b].The 3D model counterpart of the synthetic bone model is first aligned manually as a movable AR rendering, followed by an ICP refinement.Thereby, the intraoperative data was a point cloud, originating from a co-calibrated external RGB-D sensor.The point cloud was computed from a disparity map which was generated by a tansformer-based network with the two RGB images of the depth sensor as input.After registration, an additional fiducial marker clamped to the anatomy is responsible for motion compensation.An average pin placement accuracy of 4.66°and 3.8 mm was achieved.These recent research contributions show that AI-based algorithms using surface data, potentially in combination with RGB, could advance registration approaches in CAOS.What stands out is that there remains a dependency on a coarse initial alignment as well as reference markers for motion detection and/or compensation.
The goal of this study was to tackle the aforementioned drawbacks of the current navigation systems by developing an efficient, radiation-free and real-time approach for automatic registration of intraoperative RGB-D data to the underlying patient with the potential downstream goal of providing a more accurate and faster pedicle screw placement alternative under AR guidance.Our method comprises of a registration module for automatic piecewise registration of preoperative lumbar spine 3D models to intraoperative RGB-D data with pose updates during surgical interaction as well as a navigation module for AR-guided pedicle screw placement.The registration module was developed and evaluated on the public SpineDepth dataset (Liebmann et al. [2021]) of pose-annotated cadaveric surgery RGB-D recordings.Data collected from simulated pedicle screw placement interventions was used to evaluate the registration success.Finally, the full pipeline (registration + navigation) was validated in an ex-vivo setup, where a surgeon placed ten pedicle screws in a cadaveric lumbar spine under AR guidance.

Material and methods
The proposed hardware setup consists of the following components: an RGB-D sensor, a GPU-enabled workstation and a HMD.In our case, a ZED Mini (Stereolabs Inc., San Francisco, CA, USA), a HP Z2 (HP Inc., Palo Alto, CA, USA) with an Nvidia GeForce RTX 2080 SUPER (Nvidia Corporation, Santa Clara, CA, USA) and a Microsoft HoloLens 2 (Microsoft Corporation, Redmond, WA, USA) were used (Fig. 1).
The main contribution of this work lies in the registration and navigation method.It can be subdivided into two modules: the registration module and the navigation module (Fig. 1).The RGB-D sensor observes the surgical site from the top and serves as input for both modules.The registration module (Section 2.1) is responsible for the automatic segmentation and registration of lumbar vertebrae L1-L5 and outputs five rigid 3D transformation estimations, one for each vertebra V i , i ∈ {1, 2, . . ., 5} and incoming RGB-D frame f : TVi (f ), T denoting that the transformation is estimated.The navigation module (Section 2.2) is responsible for tracking a surgical drill sleeve (rigidly attached to the drill) and finding its 3D transformation in each frame: T D (f ).For each frame, the navigation module streams the aforementioned six transformations ( TVi (f ), i ∈ {1, . . ., 5} and T D (f )) to the HMD via a UDP connection.The second part of the navigation module consists of AR guidance for pedicle screw placement on the HMD, based on the received vertebra and drill sleeve poses.Finding the relative transformation HM D T S between the coordinate frame of the RGB-D sensor S and the HMD device is required upon re-positioning of the RGB-D sensor and is based on standard chessboard detection (Bradski [2000]).
The registration module and the first part of the navigation module were implemented as a real-time capable C++ application with an OpenGL (Woo et al. [1999]) window for live visualization and controlling purposes, referred to as the server app.Libraries and implementation details are provided throughout the following sections.The second part of the navigation module, the AR guidance for pedicle screw placement, was implemented in Unity (2019.4.39f1,Unity Technologies, San Francisco, CA, USA) and will hereafter be referred to as the client app.

Registration module
The registration module has two goals: first, given an unoccluded RGB-D frame, referred to as the initial frame, it estimates the 3D pose of each lumbar vertebra L1-L5 with respect to the sensor cordinate frame.Second, given subsequent frames of the same viewpoint with surgeon interaction, referred to as the interaction frames, it updates the poses of L1-L5, if visible.In our workflow, the surgeon positions the RGB-D sensor above the incision without any occlusion by personnel or instruments and initiates the process.
The registration method is illustrated in Fig. 2. It is divided into segmentation/pose initialization and pose refinement.The first stage relies on a deep neural network that combines the concepts of 2D U-Net (Ronneberger et al. [2015]) and regression-based orientation prediction (Mahendran et al. [2017]).For a given initial frame and during the inference time, the network outputs a 2D binary segmentation mask for the lumbar spine (segmentation path) and an estimate of the spine's rotation in the a system (orientation path) represented in form of a quaternion (R in Fig. 2).The segmentation output is used to mask the corresponding depth image, leading to a segmented point cloud of the lumbar spine.The preoperative 3D models are transformed with the predicted orientation as the rotation and the center of mass of the segmented point cloud as the translation (T in Fig. 2).In the second stage, an en-bloc registration of the combined preoperative 3D models is performed using ICP (Besl and McKay [1992]) registration (general alignment), followed by a piecewise ICP registration of each vertebra (piecewise refinement).Using the accurate pose determined from an initial frame, efficient motion compensation in subsequent interaction frames is achieved by iterative application of segmentation and piecewise refinement steps, indicated as dotted arrows in Fig. 2.
Figure 1: Setup overview.Solid lines denote a wire, dotted lines denote a wireless connection and dashed lines denote a transformation.The RGB-D sensor observes the surgical site from the top and serves as input for the registration and navigation module.The former outputs five poses, one for each vertebra V i and frame f : TV1 (f ), . . ., TV5 (f ).The latter tracks the surgical drill sleeve and outputs its pose T D for each frame: T D (f ).All poses are streamed to the HMD wirelessly.The second part of the navigation module comprises AR guidance for pedicle screw placement, based on the received poses.HM D T S denotes the transformation between the RGB-D sensor's coordinate frame and the one of the HMD.It is found using standard chessboard detection after the RGB-D sensor has been positioned.

Data preparation
The SpineDepth dataset that was created in our preceding publication (Liebmann et al. [2021]) provides pose-annotated RGB-D recordings of mockup spine surgeries performed on ten cadaveric specimens.The extent of anatomical exposure in Specimen 1 was significantly less than in the other nine.It was therefore excluded from this study.Furthermore, we excluded Specimen 10 as its anatomy is extremely different from the remaining eight, i.e. it is much smaller.When the dataset was created, in each surgery, ten pedicle screws were placed bilaterally into vertebrae L1-L5.The placement of each screw was divided into four surgical steps, each captured in a separate recording by two downward facing RGB-D sensors simultaneously.After each screw placement, the sensors were repositioned to capture the surgical site from a new perspective.This resulted in a total of 80 recordings within the SpineDepth dataset from 20 different viewpoints per specimen.Within the same dataset and for each specimen, preoperative 3D models of vertebrae L1-L5, Figure 2: For an inital frame, all arrows are executed once: the network outputs result in a segmented point cloud and an initial pose (R, T ) for the preoperative 3D models, which are then registered to the segmented point cloud (general alignment), followed by an individual registration (piecewise refinement).For subsequent interaction frames, only the dotted arrows are executed, updating the poses of visible vertebrae individually.referred to as PreOp models, are available that are spatially aligned to the actual position of the anatomy.In other words, the SpineDepth dataset includes the aforementioned transformation TVi between each vertebra and the RGB-D frame observing it.However, in the dataset it is the ground truth transformation and therefore referred to as T Vi .Applying it on the PreOp 3D models transforms them to their ground truth location in the camera coordinate system of the respective RGB-D sensor.
In order to investigate the generalizability of our method to unseen anatomy, the data was prepared to support a leave-one-out cross-validation strategy with eight folds, one for each specimen.For each screw and surgical step three frames were selected, the first frame (initial frame) and two random frames (interaction frames), resulting in twelve frames per screw.The sensor viewpoints did not change within the four recordings.The initial frame never contained surgeon interaction, while the interaction frames had a chance thereof.The resulting 240 frames (10 screws × 12 frames × 2 RGB-D sensors) per specimen were used for training the network.They are referred to as training folds.For later testing of the entire registration module, only the first two surgical steps, i.e. recordings, for each screw were considered, as they represent the relevant steps for navigation, i.e. entry point preparation and trajectory drilling with a surgical awl.These 40 recordings of a specimen were used in their full lengths with the first frame as initial frame and all subsequent frames as interaction frames.They are referred to as testing folds.
The pose annotations provided by the SpineDepth dataset cannot be used directly for training our network, as the segmentation and orientation path of our network require a binary segmentation mask and a quaternion as ground truth, respectively.The former was generated as follows.A depth image of vertebrae L1-L5, transformed according to their ground truth poses, was rendered, using the method of Guney and Geiger [2015] and Geiger and Wang [2015], and subtracted from the corresponding sensor depth image for all non-zero pixels in Matlab (R2021a, MathWorks, Portola Valley, CA, USA).The resulting mask consisted of all pixels with an absolute difference below 10 mm.This threshold correctly handled clear occlusions, while maintaining contiguous mask regions, despite measurement noise.Further smoothing of those regions was achieved by applying a 2D convolution with a kernel size of 15 and uniform weights of 1/225, followed by a thresholding at 0.5, resulting in a binary mask again.The network orientation path only predicts the overall lumbar spine orientation (Fig. 2), and not the one of each vertebra.Therefore, the rotation of vertebra L3 was stored as a quaternion for each frame, representing the overall spine orientation.

Network architecture and training
The network architecture is indicated in Fig. 2. The main structure is inspired by U-Net, taking downsampled RGB images of size 144 × 256 (H × W ) as input.It consists of four downsampling blocks, each of the form Conv-BN-ReLU-Conv-BN-ReLU-MaxPooling, with 64, 128, 256 and 512 filters of size 3 × 3. The bottleneck block is of the same form, with 512 filters, but without the MaxPooling layer.The upsampling blocks mirror the downsampling ones, but with a leading upconvolution layer instead of a MaxPooling layer at the end.Skip connections connect the down-and upsampling blocks.The output Conv layer has 1 filter and sigmoid activation, for which a dice loss L D is minimized.An additional branch is appended to the bottleneck where the 9 × 16 × 512 feature representation is used for regression-based orientation prediction.After flattening, two blocks of FC-BN-ReLU, with 512 and 64 units, follow.Another FC layer with 4 units then predicts the spine orientation.The quaternion codomain of [−1, 1] is accounted for with a tanh activation.For frame i, the geodesic loss L Gi is computed between the ground truth quaternion q ti and its normalized predicted counterpart q pi , as given in Equation (1).Quaternions should have magnitude 1 in order to represent valid rotations.As suggested in Langlois et al. [2018], a penalization term L Ni helped the network predicting such.
Including L D , the network's total loss for a batch of size B is Augmentations were employed to enrich the SpineDepth dataset that included a limited number of specimens and viewpoints.2D image rotation of α radians was applied on the input images and the corresponding ground truth quaternion by multiplication with quaternion [cos α 2 , 0, 0, sin α 2 ], which represents a rotation around the camera's z-axis.Each frame was augmented by rotations of 30, 90, 150, 210, 270 and 330 degrees, resulting in 1680 frames (240 + 6 × 240) per training fold.The 240 frames are: 10 screws × 12 frames × 2 RGB-D sensors (Section 2.1.1).
The resulting network with ∼58M trainable parameters was implemented in Keras (2.7.0, Chollet et al. [2015]).For each specimen, it was trained from scratch on the remaining eight training folds during 30 epochs with batch size of 32 using the Adam optimizer (Kingma and Ba [2014]).The learning rate was set to 10 −4−⌊ epoch 10 ⌋ .Training took roughly 30 minutes on a NVIDIA Quadro P6000.Three trainings were performed per specimen.For each specimen, the best results in terms of L D are presented.

Registration and pose update
The registration for an initial frame consists of an initial pose, a general alignment and a piecewise refinement (Fig. 2).They are denoted as Tinit , Tgen and Tref V i .The following sections elaborate on each part followed by an explanation of the pose update for the interaction frames.Note that only the points visible from an orthogonal posterior view were selected from the PreOp models (3-matic, Materialise NV, Leuven, Belgium) and used for registration and pose updates (Fig. 3).

Initial pose estimation
For an initial frame's RGB image, our network predicts a resized binary segmentation mask M p (1080 × 1920 pixels) and a normalized quaternion q p .Using M p , the corresponding full point cloud P C f is masked to produce a segmented point cloud P C s .The initial pose for the first frame, which is used as the initial frame, Tinit for the PreOp models is constructed from the center of mass of the largest connected component (Bradski [2000]) of P C s of size N as the translation and the inferred q p as the rotation: The trained network was converted to Open Neural Network Exchange (ONNX, Foundation [2022]) format and integrated into the server app using TensorRT (8.0.3, NVIDIA [2021]).This way, the RGB-D sensor data, which is available on GPU memory, can be directly fed to the network.Furthermore, P C f is masked by M p on the GPU using CUDA (8.0.3.4, NVIDIA [2022]), resulting in P C s , consisting of N points.

General alignment
The general alignment consists of point-to-point ICP between the combined PreOp models of L1-L5, transformed by Tinit , and P C s .It is denoted as T ICP combined .The pose after general alignment is: The Point Cloud Library (1.12.0,Rusu and Cousins [2011]) implementation with a maximum correspondence distance of 5 mm and stopping criteria of 50 iterations or a transformation epsilon of < 10 −8 was used.[1991], Guennebaud et al. [2010]) method was then used to find the optimal transformation, which was applied to the respective PreOp model.The two steps are repeated 50 times, or until the average distance between point pairs did not decrease anymore (ϵ = 10 −8 ).The ICP result for vertebra i is denoted as T ICPpiecewise i .The pose after piecewise refinement is: As the same functionality is employed for real-time pose updates during interaction frames (see next paragraph), the nearest neighbor search is performed in parallel on the GPU using CUDA.The EIGEN (Guennebaud et al. [2010]) implementation of the Umeyama method is computationally inexpensive, and was therefore performed serially for each PreOp model on the CPU.
Pose update After performing the registration based on the initial frame, the poses of PreOp models L1-L5 are updated individually once a new frame is available.After network inference on the new frame, the same technique as for the piecewise refinement is used.Given that the extent of change in the vertebra poses is minimal for a single patient during an intervention and in hopes of boosting the performance, only a single iteration was performed, denoted as Tupdatei (f ) for vertebra i in frame f .Furthermore, due to surgeon interactions, the visibility of the anatomy during interaction frames may be obscured.Consequently, only PreOp models that have at least 90% of the number of inliers as after the last iteration during piecewise refinement are updated.The pose TVi (f ) (Fig. 1) in any interaction frame f after transformation of each vertebra by Tref V i is:

Navigation module
The navigation module comprises two parts: tracking of a surgical drill sleeve and AR guidance for pedicle screw placement on the HMD.Both parts are explained in the next sections.

Drill sleeve tracking
Tracking of the surgical drill sleeve was based on a custom-made, 3D printed component attached to the drill sleeve (Fig. 4).The component was equipped with three nonplanar, sterile markers (Clear Guide Medical, Baltimore MD, USA) showing unique AprilTag (Olson [2011], Wang and Olson [2016]) patterns.The tracking was integrated into our server app and was performed on a separate thread, which is initiated upon app startup.Whenever a new frame was available from the RGB-D sensor, the undistorted left and right grayscale images were made available to the tracking thread.In both images, the markers were detected by the ArUco (Muñoz-Salinas et al. [2018], Garrido-Jurado et al. [2014,2016]) library.If at least two corresponding markers were found in both images, the respective 2D corner coordinates are used for triangulation (Bradski [2000]), yielding eight or twelve 3D corner coordinates.The actual pose is found by applying the Umeyama method between the ground truth corner coordinates, which are known by design, and their estimated counterparts resulting from the triangulation.A Kalman filter (Kalman [1960], Bradski [2000]) with a constant acceleration model is used for noise reduction on the final drill sleeve pose.
Figure 4: Drill sleeve with three nonplanar, sterile markers for tracking.

AR guidance
The goal of the AR guidance for pedicle screw placement is the accurate and fast navigation of the screw entry point and trajectory.Upon startup of the client app, the HMD establishes a UDP connection to the server app and continuously receives the poses of PreOp models L1-L5 as well as the drill sleeve pose.The surgeon positions the RGB-D sensor such that a reasonable initial pose, which is visualized on a monitor in the periphery of the OR, is estimated by the server app.A standard chessboard (Bradski [2000]) is used to co-calibrate the coordinate frame of the RGB-D sensor and the one of the HMD.The surgeon can trigger the detection in the client app by speech command.As soon as the chessboard is removed, the surgeon initiates the registration on an initial frame, which is followed by pose updates in subsequent interaction frames (Section 2.1.3).
In the client app, the surgeon is provided with three different visualization components.The most important component is the virtual twin presented in the work of Wolf et al. [2023], who investigated different user interfaces for AR-guided pedicle screw placement.Their virtual twin approach, where PreOp models and navigation information are not directly overlaid onto the anatomy, but rendered in an axis-aligned fashion with only a translational offset from the anatomy (Fig. 5), was integrated into the client app.Wolf et al. [2023] showed that this approach allows for accurate screw placement, while both ease of use and cognitive load were well rated by surgeons.On the virtual twin, the current drill sleeve pose was visualized with respect to the preoperatively planned screw entry point and trajectory.In addition, the angular 3D deviation between the drill sleeve and the screw trajectory was shown.Besides the virtual twin, a direct overlay of the entry point in form of an aiming cross could be shown/hidden by the surgeon.Lastly, the PreOp models could also be visualized on the anatomy upon request.This was particularly useful to qualitatively check the overall registration accuracy.For a more detailed verification of the registration, the surgeon was asked to touch certain anatomical landmarks using the drill sleeve on the anatomy and confirm their correspondence on the virtual twin before starting the actual navigation of a screw.If the registration was unsatisfactory, the RGB-D sensor was repositioned and the process was repeated from the co-calibration on.After successful navigation of a level, the surgeon selects the next level in the client app menu.

Evaluation
The method was evaluated in two stages.First, the registration module was evaluated separately on the SpineDepth dataset, referred to as the verification.The entire prototype (registration and navigation module) was then evaluated in a cadaveric experiment on an unseen lumbar spine anatomy, where ten real pedicle screws were placed under AR guidance, referred to as ex-vivo validation.The two evaluation stages and the respective outcome measures are described in the following sections.

Verification
For verification, the eight trained networks (Section 2.1.2) were employed.For each specimen, each of the 40 recordings (Section 2.1.1)in the respective testing fold was evaluated using the server app, with the first frame as the initial frame and all subsequent frames as interaction frames.
The recordings in the SpineDepth dataset were made from a broad variety of viewpoints, some of which providing a strongly inclined lateral view of the anatomy, thus potentially influencing the registration quality.Preliminary analysis had shown that the 3D angle between the RGB-D sensor's forward axis and the coronal plane normal correlate positively with the target registration error (TRE) after applying our proposed method.The correlations (Pearson correlation coefficient: PCC) are reported for each specimen in the testing fold (viewpoint-error correlation, VpErCo).It was defined that only recordings with a 3D angle below 30°are considered.The number of recordings fulfilling this criterion is also part of the results (acceptable viewpoints, AcVp).Note that for generalization purposes, these viewpoints were not excluded from network training.
A single threshold for the accuracy of pedicle screw placement with respect to the optimally planned screw could not be defined, as the required accuracy is generally dependent on different anatomical and surgical factors such as the anatomical morphology and pathology of the patient, the underlying bone quality, or the utilized surgical approach.Furthermore, with an automated evaluation based on the given dataset, tapping landmarks on the anatomy to confirm the registration accuracy, as done during AR guidance (Section 2.2.2), is not possible.Therefore, a successful registration was measured by the established clinical criteria according to Modi et al. [2008], who define a screw perforation of less than 2 mm as safe.To this end, optimal pedicle screws ( : 5 mm) were planned bilaterally using an in-house developed preoperative planning software (CASPA, University Hospital Balgrist, Zurich, Switzerland).For the assessment of pedicle perforation, a 3D model of the pedicle was extracted from the PreOp model and imported into MATLAB.The screws were represented as cylinders ( : 5 mm).For each frame of a recording, the screws were transformed according to the corresponding vertebra pose found by our method, while the pedicle 3D model was transformed according to the respective ground truth pose.For all points of the pedicle 3D model, it was verified whether they are located inside the cylinder representing the pedicle screw.If no points were inside the cylinder, there was no perforation.Otherwise the perforation was quantified as the maximum distance from all points inside the cylinder to the cylinder surface.Note that the registration success was defined on a per frame basis and for the target screw only (the screw that the surgeon works on in the respective recording), e.g. if the surgeon prepares the entry point of L2 left in the recording, the registration for a frame was considered successful when the previously described perforation assessment using the estimated pose for L2 ( TV2 ) revealed that the perforation would have been below 2 mm.The success rate of a single recording was defined as the number of successful frames divided by the total number of frames.The success rate of an entire specimen equals the average success rate over all recordings in the testing fold.In the same way, the average 3D angular deviation between the optimal and estimated screw trajectory (trajectory error: E E T R ) as well as the average 3D distance between the optimal and estimated screw entry point (entry point error: E E E P ) are reported, except that for an entire fold the median over all 40 recordings is reported.As the TRE considers the registration for an entire vertebra, it can only be computed on a per vertebra level.The average TRE for exemplary vertebra L2 in a recording of F frames with K = 3 landmarks (L 1 : spinous process, L 2 and L 3 : left and right transverse processes) and d(p 1 , p 2 ) as the 3D Euclidean distance between two points p 1 and p 2 is defined in Equation ( 7).Again, the median over all 40 recordings is reported.
Preliminary analysis showed that the alignment of each PreOp model can improve during the first few interaction frames, due to the slightly varying 3D reconstructions provided by the RGB-D sensor.Therefore, the TRE is reported as of frame 61 (∼ 2 s).
As an additional result, the percentage of updated poses (% update ) for the vertebra of interest in interaction frames, i.e. the number of actual pose updates according to our method (Section 2.1.3)divided by the number of possible frames are reported.
Besides the outcome measures related to registration and pose updates, the performance of the eight trained networks are reported.Segmentation accuracy was evaluated with the Dice similarity coefficient (DSC).As in Tulsiani and Malik [2015] and Mahendran et al. [2017], the orientation prediction was evaluated with the median geodesic angle error (MGAE), which equals the median loss defined in Equation ( 1) over an entire fold, expressed in degrees.Note that these outcome measures are based on number of frames defined for the training folds, i.e. not full recordings but 240 frames per fold/specimen.

Ex-vivo validation
The goal of the ex-vivo validation was to place ten pedicle screws (L1-L5, left and right) under AR guidance using the herein presented method (registration and navigation module) on an unseen lumbar spine.A fresh frozen specimen was used.Ethical approval was obtained from the ethical committee of Canton Zurich (Basec-Nr.2017-00874).The specimen was CT scanned using a NAEOTOM Alpha © device (Siemens Healthineers, Erlangen, Germany) with a 0.8 mm slice thickness and a 0.41×0.41mm in-plane resolution (x-y).3D models of L1-L5 were extracted using the global thresholding, region growing and wrapping functionalities of the Mimics software (Materialise NV, Leuven, Belgium).Again, the points visible from an orthogonal posterior view were selected as described in Section 2.1.3(Fig. 3).Optimal pedicle screws ( : 5 mm) were planned in CASPA.In preparation of the experiment, the specimen was thawed and dissected to have no soft tissues, e.g.paravertebral muscles, without damaging the intraspinous ligament, the ligamentum flavum as well as the facet joint capsule.The specimen was fixated to a wooden board with surgical pins through spinal levels T6/7 and S1.
The network was trained in the same way as described in Section 2.1.2,but using the training folds of all eight specimens in the SpineDepth dataset.As there was no ground truth available for the unseen specimen, the number of epochs was reduced to ten to mitigate overfitting to the experimental setup, e.g. the spine orientation w.r.t. the table, of the SpineDepth dataset.Due to the fact that the preoperative CT was taken from the frozen specimen, the inter-vertebral deformation between the pre-and intraoperative states was higher than in the SpineDepth dataset, where the CT was conducted in a fully thawed state.Therefore, the piecewise refinement (Section 2.1.3)was performed for 50 iterations, without a stopping criterion, to overcome local minima due to the 2 mm inlier threshold.
During the experiment, the RGB-D sensor was placed above the surgical site (Fig. 6) and the server and client apps were started.After that, the workflow was as described in Section 2.2.2: the sensor viewpoint was adjusted, such that the initial pose was reasonable, followed by the co-calibration of sensor and HMD.The registration and pose updates were initiated.For each vertebra, the surgeon checked the registration accuracy and inserted the respective pedicle screw (right side) according to the AR guidance.For screw insertion on the left side, the specimen was rotated 180°( the other side of the table was not optimal to stand for the surgeon), followed by a re-registration.
For the ex-vivo validation, TrEr, EpEr and TRE are reported.TrEr and EpEr were quantified following the same procedure as described in Liebmann et al. [2019].A postoperative CT of the specimen was acquired with the same imaging device and protocol as for the preoperative scans.3D models of the bone anatomy and the screws were extracted.In the CASPA software, the PreOp models along with the planned screw trajectories were registered to the postoperative bone anatomy using point-to-plane ICP (Rusinkiewicz and Levoy [2001]).In the same fashion, generic cylindrical 3D models were aligned to the postoperative screw 3D models.The cylinders' main axes were compared to the planned screw trajectories, yielding the 3D angular deviation TrEr.The 3D Euclidean distance EpEr was determined by comparing the planned entry points to the intersection point of the cylinders' main axes with the registered preoperative 3D model.
In contrast to the SpineDepth verification where the ground truth vertebral poses were available, the data collected in the ex-vivo validation experiment lacked the registration ground truth; therefore, the experiment was captured as an RGB-D recording and the TRE was quantified retrospectively in a static manner.Six push-pins were inserted into the spinal levels T12 and S1 (three each) before the experiment (Fig. 5).The 3D positions of the push-pin head centers were determined in the postoperative CT using the Mimics software as well as in the left and right RGB images of the RGB-D initial frame using blob detection and triangulation techniques (Bradski [2000]).The best fit in a least-squares sense between the two point sets was found in the CASPA software and allowed for transforming the preoperative 3D Figure 6: Setup during the ex-vivo validation.After inserting the five pedicle screws on the right side, the specimen was rotated by 180°.After re-registration, the surgeon could insert the five screws on the left side.models into the coordinate frame of the RGB-D sensor.The TRE is based on the same three landmarks per vertebra as for the verification and is also reported for the 61 st frame (∼ 2 s) after the initial frame.
Additional outcome measures are: the registration time, the time for pose updates, and the navigation time.The latter was defined according to Farshad et al. [2021a], as the time from picking up the drill sleeve until the drilling process was started.

Ablation study
To further understand the capabilities of the proposed registration method and the mechanisms leading to our results, an ablation study was conducted.For both verification stages, the server app was run three times with the following modifications (italic font denotes the name of the modification used hereinafter): • General: Registration only included general alignment, no piecewise refinement, no pose updates • Refinement: Registration included general alignment and piecewise refinement, but no pose updates • First-60: Registration included general alignment, piecewise refinement and pose updates during the first 60 interaction frames of a recording For the ablation study, only the TRE is considered.Note that, for the ex-vivo validation, First-60 is equal to our primary results by definition and is therefore not reported.

Results
Table 1 summarizes the results.It comprises the verification and the ex-vivo validation.For the verification, the viewpoints and TRE correlated by 0.47 ± 0.23.The number of acceptable viewpoints was 28 ± 6, with an average of 96 ± 4% of registrations being successful.The median trajectory error was 1.79 ± 0.47°and the median entry point error of 2.43 ± 1.03 mm.The median TRE was 2.73 ± 1.13 mm.An average of 17 ± 5% of poses were updated during interaction frames.The mean DSC was 0.72 ± 0.04 and the MGAE 15 ± 3°.
An exemplary case from the verification is illustrated and explained in Fig. 7.It shows segmentation and occlusion handling, initial pose, general alignment, piecewise refinement and pose updates as well as comparison of ground truth models and screws to their counterparts estimated by the proposed registration module.
During the ex-vivo validation, three registrations were necessary.After placing the first screw (L1, right), the client app crashed unexpectedly.Therefore a second registration became necessary.After placement of the remaining four screws on the right side, the specimen was rotated by 180°, followed by the third registration, such that the surgeon could operate on the left side.The mean trajectory error was 2.55±1.67°,while the mean entry point error was 1.95±1.07mm.All screws were of grade 0, i.e. fully contained within the pedicle (Modi et al. [2008]).The mean TRE was 0.87 ± 0.31 mm for the first, 0.94 ± 0.19 mm for the second, and 1.80 ± 0.62 mm for the third registration, respectively.The mean navigation time per screw was 28 ± 11 s.
For the verification, the ablation study showed a median TRE of 2.66 ± 1.05 mm for First-60.This was slightly more accurate than Refinement (2.68 ± 1.14 mm), which in turn was slightly more accurate than our primary results (2.73 ± 1.13 mm).The least accurate was General with a median TRE of 2.89 ± 1.07 mm.
For the ex-vivo validation, our primary results were the most accurate with a TRE of 1.20 ± 0.22 mm, followed by Refinement (1.38 ± 0.45 mm) and General (2.53 ± 0.23 mm).

Discussion
Despite the fact that CAOS can increase accuracy as well as safety in complex orthopedic procedures, such as pedicle screw placement (Gelalis et al. [2012], Perdomo-Pantoja et al. [2019]), the clinical adoption of such methods is arguably low (Joskowicz and Hazan [2016], Härtl et al. [2013], Nadeau et al. [2015]).Besides economic reasons, one major barrier along ubiquitous adaptation of the existing CAOS solutions is their interference with the standard surgical workflow.More specifically, main limiting factors associated with the current CAOS systems for surgical navigation can be noted as: cumbersome and time-consuming, ionizing radiation exposure, lengthy registration procedures and unintuitive visualization of spatial navigation information on 2D monitors in the OR periphery.In this work, we intended to tackle these drawbacks and presented a simplistic and radiation-free approach for automatic, accurate and fast pedicle screw placement in cadaveric lumbar spines under AR guidance.
The verification on the SpineDepth dataset showed a registration success rate of 96%, meaning that the target screw would have been placed successfully within the clinical safe zone in 96% of the frames.In the study of Félix et al. [2021], who pursued a similar approach for femur and tibia, the success of a registration was defined based on the percentage of inliers, which had to be at least 80.They reached a success rate of 37.7% for the femur and 35.2% for tibia, respectively.The required registration accuracy, defined as 3°rotational error and 3-4 mm translational error, was only met in terms of translation.The surface-based femur registration and tracking approach of Hu et al. [2022] achieved a root-mean-square error of 2.40 mm on real-time captures of a bone phantom, which reduced to 2.07 mm when the bone phantom data was processed with the suggested PointNet-based restoration network.For spine surgery, a wide range of acceptable registration errors can be found in the literature, which depend on various factors.Rampersaud et al. [2001] defined that the maximum rotational and translational deviation for the lumbar spine reach from 2.1°/0.65 mm (L1) to 12°/3.8 mm (L5) for screws with a diameter of 6.5 mm.As the TRE reported in this work comprises the rotational and translational aspect, a comparison to TrEr and EpEr is more meaningful.While the TrEr (1.79°) is within the aforementioned limits in our case, the EpEr (2.43 mm) exceeds the limit for L1.Besides the targeted spine level, different methods for error calculation can affect the reported values (Holly and Foley [2007]).The TRE is a well-known measure to characterize the accuracy of navigation approaches (Ershad et al. [2014]).Guha et al. [2019] investigated the error propagation of clinical-grade navigation systems w.r.t. a dynamic reference frame (DRF) attached to the anatomy, which is a common motion compensation technique, on four human cadavers.They compared intraoperative tip positions of a tracked awl (mimicking a bone screw) to typical pedicle screw entry points in a postoperative CT.An average 3D navigation (note that this error may difer from the registration error) error of 2.71 mm at DRF level was found.This error increased with a larger distance to the DRF level.Although the respective registration error must have been lower, the fact that the use of a DRF is the gold standard makes it eligible for comparison to our registration error, assuming tracking errors in current navigation systems are minimal: the EpEr (2.43 mm) of our verification was superior and the TRE (2.73) equal.When comparing to the TRE of 1.43 ± 0.35 mm in the semi-automatic microscopic RGB stereo method of Ji et al. [2015], three out of eight specimens in the verification can be considered within the range of their standard deviation.The required 2 mm maximum acceptable registration error for cranial and spine procedures (Faraji-Dana et al. [2020]) is reached for three of our specimens.However, it should be considered that the dataset already comes with certain inaccuracies (ground truth TRE of 1.5 mm).The first (0.87 mm) and second (0.94 mm) registration in our ex-vivo validation showed sub-millimetric accuracy, which is equal or close to studies using navigation systems with manual point sampling for pedicle screw navigation (0.9 mm in Papadopoulos et al. [2005], 0.7 mm in Nottmeier and Crosby [2007]) or cutting-edge intraoperative CT device for cranial procedures (0.93 mm in Carl et al. [2018]).The screw accuracies with a TrEr of 2.55°and EpEr of 1.95 mm in the ex-vivo validation are in line with other studies investigating surgical navigation for pedicle screw placement.The AR system used in Felix et al. [2022] achieved a 3D accuracy of 2.5°and 1.9 mm for open surgery in cadavers.In van Dijk et al. [2015], the accuracy of 178 minimally invasive screws using a robotic system was assessed, resulting in a mean 2D in-plane error of 2.55°and a 3D entry point deviation of 2.0 mm.An even lower mean angular deviation of 1.53°can be found in the cadaveric study of Lamartina et al. [2015].However, again, the values originate from 2D in-plane measurements.
Besides showing similar accuracy, our registration method has two advantages over clinically established navigation systems.First, the registration is fully automated and is performed for all targeted levels simultaneously, while computation time required by our method was less than 2 s (and after that real-time) considerably lower to other clinical-grade systems based on surface data (less than 20 s in Faraji-Dana et al. [2020]) or manual point sampling (117 s in Nottmeier and Crosby [2007], 125 s in Farshad et al. [2021a]).Our ablation study shows that the piecewise refinement improves accuracy especially when the preoperative images were acquired in a different patient positioning.Second, anatomy displacement induced by surgical manipulation or respiration can be as high as 1.85 ± 1.48 mm and 1.09 ± 0.44 mm, respectively (Guha et al. [2019]).Our ablation study could not show superior accuracy when applying real-time pose updates in interaction frames throughout entire recordings (primary results) as opposed to a registration based on an initial frame only (Refinement) or applying updates for the first 60 interaction frames (First-60).One of the main reasons could be the frame rate of the RGB-D sensor as well as motion blur in the images, leading to a insufficient 3D reconstruction.The surgical interactions in the SpineDepth dataset are of fast nature.However, slower motions, such as breathing, could be compensated with the method at hand (upon proper investigations in the future).For faster motion, different sensor types, not based on RGB or grayscale images, could further improve the performance of our method in this regard.Finally, the anatomical part that is moved the most is (partly) hidden by the surgeon, and therefore challenging to track.Nevertheless, we see our approach being a foundation for developing automatic level-wise motion compensation in real-time without needing a DRF clamped to the anatomy.
The average time for pre-drilling a screw trajectory with method was 28 s per screw which can be considered as very fast.This is superior to navigation using C-or O-arm (248 s for C-arm and 134 s for O-arm in Liu et al. [2017]), as well as other studies using AR guidance in cadaveric specimens (57.5 s in Müller et al. [2020], 67 s in Farshad et al. [2021a]) or a first in-human study (312 s in Elmi-Terander et al. [2019]).
In terms of network performance, the DSC of the segmentation path with a 0.72 mean on the SpineDepth dataset was comparable to Félix et al. [2021] (DSC for tibia: 0.73), although segmentation of the spinal anatomy might be considered more challenging.The accuracy of the orientation prediction (16.33°) is comparable to the ones reported in the two publications inspiring our method (16.63°inMahendran et al. [2017], 13.59°in Tulsiani and Malik [2015]).
Further analysis revealed a PCC of -0.78 between TRE and DSC, suggesting that the segmentation quality plays a key role in finding an accurate registration.The TRE also correlates (PCC of 0.74) with the visible bone surface error (VBSE) reported in the SpineDepth publication (Liebmann et al. [2021]), which essentially describes the reconstruction quality of the RGB-D sensor in use.While the dataset was recorded based on stereo calibrations created with a manufacturerprovided application, for our ex-vivo validation, standard OpenCV stereo calibration functionality (Bradski [2000]) was employed, leading to a much lower mean TRE (1.20 mm).This potential of accurate spine 3D reconstruction was confirmed in the study of Manni et al. [2020], where features in stereo grayscale images were matched with a 3D triangulation error below 0.5 mm.Stereo calibration and reconstruction quality may not be the only factors influencing the accuracy of the proposed registration approach, but they can be seen as a key factor.Other such factors could be the presence of soft tissue and the missing facet joints/mamillary processes in the dataset specimens, not only regarding accuracy, but also for convergence during general alignment, as more soft tissue flattens important bony surfaces, as well as the presence of previously inserted screws.The latter is suspected to be the reason for the increase in error from the first and second registration in the ex-vivo validation to the third, for which the specimen was rotated by 180°and all screws on the right side had already been inserted.This imbalance is not accounted for, which is a limitation of our method.More importantly, the full anatomical exposure in the cadaveric specimens is unrealistic within a clinical setting, and the high visibility facilitates the registration as well as the navigation.Furthermore, our method did not generalize to all specimens in the verification: specimen 10 had to be excluded due to its much smaller size compared to the other eight considered specimens.
For future work, the method needs to be evaluated on specimens with surgical approaches of varying sizes, i.e. less visibility of anatomical structures.Furthermore, transformer-based depth reconstruction, as proposed in Gu et al. [2021b], could be a promising way to increase registration accuracy, while feature-based tracking (Manni et al. [2020]) should be investigated as a motion compensation strategy.

Conclusions
Our results suggest that fast, radiation-free, and fully automatic level-wise registration with real-time pose updates from RGB-D data for pedicle screw navigation under augmented reality guidance is feasible and meets clinical accuracy demands.

Declaration of Competing Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Prof. Dr. med.Mazda Farshad, MPH is shareholder and member of the board of directors of Incremed AG, a company developing mixed-reality applications.All other authors declare that they have no conflict of interest.

Figure 3 :
Figure 3: Points used for registration and pose updates (blue).

Figure 5 :
Figure 5: Surgeon's view of AR navigation.The PreOp model is rendered as an axis-aligned virtual twin (top) and the current drill sleeve pose is visualized with respect to the preoperatively planned screw entry point and trajectory.In addition, the angular 3D deviation between the drill sleeve and the screw trajectory is shown.The direct overlay of the entry point (green cross) can be shown/hidden by the surgeon.Note that white push-pins are only used for postoperative evaluation of target registration error.

Figure 7 :
Figure 7: Exemplary recording from verification.(a) Point cloud (b) Segmentation with screw occlusion handling (L4 left & right, L5 left).(c) Segmentation with surgeon occlusion handling.(d) Initial pose in initial frame.(e) General alignment in initial frame.(f) After piecewise refinement and 60 pose updates.(g) Estimated (blue) and ground truth (green) vertebra poses.(h) Estimated (blue) and ground truth (green) simulated screws.(i) Estimated simulated screws on point cloud.