Patient-specific Cardio-respiratory Motion Prediction in X-ray Angiography using LSTM Networks

Objective. To develop a novel patient-specific cardio-respiratory motion prediction approach for X-ray angiography time series based on a simple long short-term memory (LSTM) model. Approach. The cardio-respiratory motion behavior in an X-ray image sequence was represented as a sequence of 2D affine transformation matrices, which provide the displacement information of contrasted moving objects (arteries and medical devices) in a sequence. The displacement information includes translation, rotation, shearing, and scaling in 2D. A many-to-many LSTM model was developed to predict 2D transformation parameters in matrix form for future frames based on previously generated images. The method was developed with 64 simulated phantom datasets (pediatric and adult patients) using a realistic cardio-respiratory motion simulator (XCAT) and was validated using 10 different patient X-ray angiography sequences. Main results. Using this method we achieved less than 1 mm prediction error for complex cardio-respiratory motion prediction. The following mean prediction error values were recorded over all the simulated sequences: 0.39 mm (for both motions), 0.33 mm (for only cardiac motion), and 0.47 mm (for only respiratory motion). The mean prediction error for the patient dataset was 0.58 mm. Significance. This study paves the road for a patient-specific cardio-respiratory motion prediction model, which might improve navigation guidance during cardiac interventions.


Introduction
Cardiovascular disease (CVD) is a leading cause of death worldwide today (Scott et al 2013). Image-guided interventional procedures are minimally invasive treatments for CVD and have been gaining popularity in the last two decades. Catheter-based procedures using fluoroscopy are prevalent within image-guided interventions and are becoming increasingly complex modern tools for planning and guiding cardiac interventions (Ackermann and Ender 2019). Fluoroscopy is commonly used in the navigation of cardiac interventions. In the process, 2D X-ray images are acquired in real-time during the visualization of contrasted arteries. It provides adequate temporal and spatial information from the visible targets in images (contrasted arteries, medical devices, etc) (Dauer 2011).
During X-ray acquisitions, multiple organs, including the arteries, move with the heartbeat, respiration and with occasional unexpected patient movements. Such movements cause artifacts in the image acquisition process and render the navigation process challenging. Overall, image-guided cardiac interventions are greatly affected by cardio-respiratory motion (Shechter et al 2004, McClelland et al 2013. The cardiac motion, being distinct for each patient, and hardly perceptible on angiographies, generates additional complexity for cardiologists. Moreover, respiratory motion degrades the detection and quantification capabilities of imaging modalities. Mismatches between some imaging techniques due to respiratory motion also result in additional attenuation correction (AC) artifacts and inaccurate localization.
Accurate image-guided treatments in CVDs are very important. Thus, an accurate method ensuring the reduction of the effects of respiratory, cardiac, and unexpected motions during cardiology interventions is needed. In recent years, numerous methods have been proposed for controlling respiratory and cardiac motion, as well as for minimizing its effects.
One approach that has been put forward to overcome the problems caused by organs moving during image acquisitions and image-guided interventions is called breath-holding. With this, the respiratory motion is suppressed by having a patient to practice multiple short breath-holds. However, when a breath-hold ends, the patients chest and internal organs move due to realignment. There are other limitations, such as scan time constraints in data acquisition that can limit the image signal-to-noise ratio and accordingly spatial resolution and poor steady breath-holding with this technique (Oshinski et al 1996). Given these limitations of breath-holding techniques, free-breathing techniques are more popular these days. Thus, there has been more focus on improving free-breathing methods (Taylor et al 1997, Nehrke et al 2001.
Other methods propose solutions to estimate and predict cardio-respiratory motion during image acquisition. Image-based cardio-respiratory motion estimation approaches are mainly based on surrogate data tracking. Diaphragm tracking for respiratory motion and catheter tracking for cardiac movements are the most popular image-based motion estimation approaches aimed at making up for the effects of cardio-respiratory motion during imageguided interventions (Schweikard et al 2000, Shechter et al 2005, Timinger et al 2005, Kesner and Howe 2010, Ma et al 2011. However, these approaches usually require multi-

Proposed contribution
We present a novel patient-specific cardio-respiratory motion prediction approach using a simple supervised long short-term memory (LSTM) network for angiography sequences. The 2D displacements of the moving objects in an angiography sequence are extracted from the images and represented through 2D affine transformation matrices. Then, a many-to-many LSTM model is trained for every sequence to predict the next geometrical transformation of the arteries in the future frames of the sequence from previous ones.

Materials and methods
2D transformation parameters (translation, rotation, scaling, and shearing) are extracted from the X-ray angiography sequences representing the motion features in matrix form. Thus, a sequence of affine transformation matrices resulting from frame-by-frame image registrations of the original X-ray sequence is fed into the sequential LSTM model for training. Then, the future displacements of the moving targets (contrasted arteries and medical devices) in the upcoming frames are predicted.

Data description
This approach was developed with a simulated X-ray angiography dataset using a realistic XCAT computational phantom simulator (Segars et al 2010(Segars et al , 2013 and was validated with real patient X-ray angiography sequences from Sainte-Justine Hospital. Computerized phantoms play a major role in medical imaging research today. They are very helpful in that they provide a practical means for evaluating and improving imaging techniques and devices. In this study, we employed realistic XCAT computational phantoms with the cardio-respiratory motion for both adult and pediatric patients. Initially, our experiments were done on three different adult patients (Azizmohammadi et al 2019).
The length of the sequences used to capture at least two to five heart and/or respiratory cycles varied between 75 and 150 frames. For each patient, three different types of motions (only cardiac, only respiratory, and both motions) were generated. Respiratory motion is not gated with cardiac motion, and the misalignment between the motions makes the motion estimation more complicated. Therefore, in this method, we simulated different motion modes to assess the predictions based on the motion complexity. Simulations were also carried out for different circumstances, with the patient having normal and abnormal respiratory and heartbeat cycles. A normal heartbeat cycle is 1 a second long, while the respiratory cycle, is 5 seconds. These values can vary between patients and change if the patient is under stress or not breathing normally. We then simulated 64 pediatric and adult patients (36 male and 28 female) falling within the 8-month newborn to 85-year age range, including the left and right coronary arteries. The pediatric simulated dataset included 112 sequences (2 sequences per patient, showing either the left coronary artery or right coronary artery), while our adult simulated dataset included 12 sequences.
For the real patient dataset, we selected 10 different sequences with lengths ranging between 84 and 166 frames and having visible moving objects (contrasted arteries and/or medical devices such as catheters and guide wires).

Preprocessing: segmentation and centerline extraction of simulated and patient Xrays sequences
To track the motion signal in an X-ray sequence, pre-processing steps were applied to segment and extract the centerlines of the moving targets. Using the Frangi filter and the connected components, we segmented the arteries and skeletonized the segmentation to extract the centerlines. The 2D motion features were extracted by registering the X-ray images, frame-by-frame in the sequence.
For the simulated data, the vascular structures were extracted from the image frames for each sequence and segmented by applying image processing filters and connected components to remove the small objects. Figures 1 and 2 show the vessel structures extracted from the original X-rays (simulated and patient data, respectively). The Frangi filter parameters were found experimentally. For the scale parameter (sigma), an interval of [1, 6] was considered with a step size of 0.1. These values for the parameters were set based on the visual inspection of the images in both simulated and patient datasets.
We applied the same segmentation steps for the patient data, although the segmentation was noisy in the background as compared to the simulated images. Moreover, since the arteries' structures were not continuously contrasted in all the image frames of a given sequence, we segmented medical devices as moving objects, where the arteries were not visible or faded in some frames. We changed the Frangi filter parameters as a function of the visible objects in the patient sequences.

Coherent point drift (CPD) registration-
The point set registration algorithm is widely used in computer vision problems such as image registration. The registration can be rigid or non-rigid. Given two point sets (centerlines of two consequent frames), we applied CPD as a non-rigid registration to map one point set to the other, yielding a non-rigid transformation. Non-rigid transformations include affine transformations such as scaling and shear, as well as translation and rotation mapping.
The CPD algorithm is a Gaussian Mixture Model (GMM) based algorithm that assigns correspondence points between two sets of point clouds. It retrieves the transformations for mapping each point cloud to the other using a specified registration (Myronenko and Song 2010). The alignment of the two point clouds is a probability density estimation problem. The first point set is centered on the second one by fitting the GMM algorithm and maximizing the likelihood. The GMM moves coherently and retains the topological structure of the point sets (Myronenko and Song 2010). A coherence constraint is imposed for affine registration by re-parametrizing the GMM centroid locations with affine transformation parameters (translation, rotation, shearing, scaling). These parameters are concatenated to build the Affine Transformation (AT) matrix as follows: While A 00 = s x cos(θ), A 01 = s y sin(θ), A 10 = −s x sin(θ), A 11 = s y cos(θ), T x = x − c x s x cos(θ) − c y s y sin(θ) and T y = y + c x s x sin(θ) − c y s y cos(θ). We used notations A 00 , A 01 , A 10 , A 11 , T x , T y for the predicted parameters. The extracted centerlines of the arteries are considered as bright point sets. Every centerline point set in each frame is registered to the previous frame in a sequence using the CPD algorithm.

LSTM models for sequence prediction
LSTM is a recurrent neural network (RNN)-based architecture with optimized memory which can solve the vanishing and exploding gradient problem in conventional RNNs. LSTM structures have memory blocks which include memory cells that can store the temporal information of sequential data as well as specific multiplicative units, called gates, to control the flow of information. Each memory block contains an input gate to control the flow of input activations into the memory cell, an output gate to control the output flow of cell activations into the rest of the network, and a forgetting gate. Therefore, a LSTM network can keep only the necessary information from the past and forget the rest, thus optimizing its memory.
Initially, the model could predict the transformation matrix for a single future frame given the previous frames using a many-to-one LSTM structure such that given a sequence of frames as an input, we were expecting one single frame as output. The new proposed model can forecast multiple values in the future after receiving multiple inputs to improve time complexity. Figures 3 and 4 show the structures of many-to-one and many-to-many frame prediction respectively. In both models, the input for the model is a sequence of transformation matrices extracted from the X-ray images by registering the consequent frames in pairs, and the LSTM network is trained to predict the arteries' transformation in the next frame/s from the previous ones.
Considering N = 6 as the number of 2D affine transformation parameters representing translation, rotation, shearing, and scaling (Tx, Ty, A00, A01, A10, A11), and T as the number of transformation matrices. To effectively feed the models we sort the parameters in a vector X t of size N*T. This vector is called the transformation parameter vector (TP). Then, the values in the TP vector are normalized to be fed into the network. The normalization was required since the range of values for some parameters is so small or big and in that case, the network can not learn or converges slowly. Then, at the end of the prediction, they can be de-normalized to have the actual values. Now, the prediction problem is defined as solving the predictor of X t (denoted by X t ) via a series of previously measured TP vectors.
Compared to the primarily (many-to-one) model, the new (many-to-many) model was developed to predict all the N = 6 transformation parameters at the same time, for multiple matrices. In the primarily (many-to-one) model, we assumed that all the parameters were independent of each other. Thus, to predict the vector X t , the model was able to predict only one element or parameter x n t at a time by feeding the LSTM one vector x 0 t , x 1 t , …, x n t of size n = T at a time. Yet, in the new model, the many-to-many model could predict all N = 6 parameters simultaneously. To feed the input matrix sequences to the model, a fixed number of previous frames (matrices) including enough information (at least one heart cycle) was considered as a time window from which to learn to predict the new TPs in the future (figures 5). Figure 6 shows the deep LSTM model structure. First, the input image sequence is preprocessed and the TP vectors are extracted and then the deep LSTM model is fed by multiple TP vectors. The output of the N-layer LSTM model is passed by a linear regression layer and the output of the model is multiple future TP vectors.

Experiments and results
We used 64 patients (pediatric and adult) simulated in normal and abnormal modes for different motions (cardiac only, respiratory only, and both motions). The sequences simulated with both motions were 75 frames in length, while other sequences having only cardiac or only respiratory motion were generated using a 150 frame length. Additionally, 10 sequences from the patient dataset with between 84 and 166 frames in length were selected based on the visibility of contrasted arteries or medical devices through the sequences.
The motion features extracted from the centerlines of the segmented arteries were represented as 6 affine transformation parameters collectively called the TP vector translations Tx, Ty, and rotation, shearing, scaling was considered as 15 matrics to capture at least two heart cycles and one breathing cycle in average. The number of predictions with the model was set to 5 (one over third of the window size). For each sequence, the LSTM model was trained separately with a sequence of TP vectors, while 80 percent of each sequence was considered as the training set and 20 percent as the testing set. TP vectors were normalized by dividing by the maximum value in each TP vector. Since the prediction is considered a regression problem, we used a linear activation function for our model and the RMSProp as an optimizer for compilation. Based on the experiments, the optimal number of epochs within the 100 to 1000 range was 500 for the simulated data with a length of 150 frames and 200 for sequences with 75 frames. We trained the patient data sets with 500 epochs as well.
Keras library was used to build and train the model. The accuracy of the method was evaluated by computing the Mean Absolute Error (MAE) between the predicted values using our model and the results of the CPD registration as the ground truth (tables 1 and 2). Figures 7 and 8 depict the TP predictions for a simulated data sample and figures 9, 10 and 11 show a patient data TP prediction sample. The patient dataset contains motion irregularities, as illustrated in figures 10 and 11. Nevertheless, the imposed patient movement represented as irregularities in the motion signal did not affect the predictions.
To evaluate the overall error of predicted transformed centerlines, we first calculated the distance transform of the original centerline images. For each pixel of the background, we obtained its distance to the closest centerline point. The distance transform or distance field for each white pixel on the extracted centerline assigns a number, which is the distance between that pixel and the nearest nonzero pixel of the vessels. Thus, to calculate the final distance, we projected the predicted transformed centerline on the distance transform matrix and averaged the obtained values as the overall prediction error.
We applied the predicted parameters in matrix form to transform the arteries' centerlines and overlaid the transformed centerlines with predicted parameters on the distance transform of the extracted centerlines from the original images. Figures 12, 13 and 14 show the overlay of the transformed segmented vessels with predicted transformation parameters and the original transformed images for simulated and patient data, respectively.
Therefore, we first evaluated our results by computing the Mean Absolute Error between the predicted TP values and the ground truth TP (resulting from CPD registration) (tables 1 and 2), and then we compared our results to the original images by applying for the CPD registration on the originally extracted and predicted centerlines (3). Apart from the simulated data, we validated our results by applying the method on 10 real patients, with a more realistic and noisy segmentation of the moving objects.
As shown in the results presented in table 3, we obtained a low accumulated error for the prediction of the transformation matrix using the distance transform of the original segmented arteries for both simulated data with different motions as well as the patient data samples.

Discussion
We have presented a learning-based patient-specific cardio-respiratory motion prediction method using a simple LSTM model. This model can predict the temporal and spatial changes for moving objects (contrasted arteries and/or medical devices) in a sequence. This approach is an extension of our previous work (Azizmohammadi et al 2019). We applied the same pre-processing and motion feature extraction while we extended the data for development and validation. Here are the differences between the newly presented work and our previous study: (1) The preliminary results of the first model were obtained by developing and validating the deep LSTM model on only 3 synthetic datasets (only adult patients). The new results were obtained on 64 synthetic datasets (pediatric and adults), including adults and pediatric patients. Then, we validated our final results using actual patient data (10 different sequences from different patients). Hence, the proposed motion prediction method was validated on a wide range of simulated cardiac and respiratory rates for pediatric and adult phantoms as well as a patient dataset. The resulting prediction error of this method (average 0.58 mm) on the patient data is promising since, in a few samples for the patient dataset, we had an unexpected patient motion that appeared as irregularity in the cardio-respiratory motion signal. Figure 11 represents the 2D translation of the moving objects in the sequences, there are irregularities in the cardio-respiratory motion signal extracted from a 150-frame sequence. Nevertheless, the imposed patient movement did not affect the predictions. (2) The preliminary deep LSTM model was a many-to-one structure. The input is a sequence of transformation matrices, and the output is the prediction of the single matrix at position n + 1. In a many-to-one model, to generate the output, the final input must be entered into the model. The performance of the updated model was improved by extending it to have multiple output predictions using a many-to-many structure. The many-to-many model generates the output whenever each input is read. Thus, a many-to-many model can understand the feature of each token (matrix) in the input sequence. Compared to other learning-based predictive methods with a 2mm prediction error (Lee and Motai 2014), our both models have achieved a prediction error of less than 1 mm (0.58 mm-new approach).
For the patient data, the medical devices such as guide wires or catheters were considered as moving objects in a sequence since they are continuously visible and contrasted under the X-ray images. Hence, the visible devices can represent the movement of arteries, while the contrasted arteries gradually become faded through a sequence as they lose their contrast. Motion prediction is thus not completely dependent on the visibility of the contrasted arteries with the contrast agent.
The presented motion tracking approach can be applied for the estimation of future heart trajectory in robotic-assisted operations on a free-beating heart as a robust prediction algorithm. The training time for each sequence depends on the length of the input sequence. We trained the model using CPU, and each step for training took 6ms. Aside from other factors such as the hardware (CPU or GPU) and optimized implementation, the time complexity for training would grow linearly by increasing the length of the input sequence. Moreover, no additional imaging modality and preoperative information are required for motion tracking using this approach. Not only this model-free cardio-respiratory motion prediction approach can facilitate the navigation process of cardiac interventions but also potentially aids in reducing the required amount of contrast agent and radiation dose for cardiac interventions.
The accuracy of the prediction indirectly depends on the accuracy of the segmentation and registration algorithms in the pre-processing steps. Segmentation and centerline extraction for the moving objects (arteries and medical devices) was challenging, especially for the patient data. In our preprocessing steps, we used the fixed parameters for simulated and patient datasets. With the connected component approach, small objects were removed, and the centerline of the segmented objects was extracted using skeletonization. Hence, the segmentation and centerline extraction error accumulated into the final prediction error. In figure 14, the extracted centerlines for the arteries are the edges of the contrasted guide wires and not the actual centerlines of the arteries. Future work will include investigating an accurate state-of-the-art method based on a Convolutional Neural Network (CNN) for automatizing the segmentation in our pre-processing steps. Moreover, we are planning to integrate this motion prediction approach into an End-to-End system for xXray image prediction.

Conclusion
This study investigated an RNN-based motion prediction model that can be used for cardio-respiratory motion tracking in the navigation of cardiac interventions with X-ray angiographies. We developed a simple LSTM model that can predict the 2D transformation of visible moving objects in an X-ray sequence. This patient-specific motion prediction approach can estimate the cardio-respiratory motion signal with a less than 1 mm prediction error. Segars W et al. 2013   Pre-processing steps (segmentation, denoising, and centerline extraction) on simulated data. The first row is a Right Coronary Artery (RCA) branch and the second row shows the Left Coronary Artery (LCA). The many-to-many LSTM model. All TP elements are predicted at the same time. W = 15 is the window size and the number of outputs or predictions is P = 5.  The deep LSTM model structure for multiple TP predictions. The required pre-processing steps to extract the input matrics are shown on the left. The metrics are flattened to be fed into the input layer. The network has N = 500 hidden layers and a linear regression layer to generate multiple outputs.        Two different patient data samples for overlaying the transformed vessels with predicted transformation parameters (blue colored vessel) with the original transformed images. MAE between the ground truth and predicted transformation parameter values for simulated data.