Evaluation of tightly-and loosely-coupled approaches in CNN-based pose estimation systems for uncooperative spacecraft

The relative pose estimation of an inactive spacecraft by an active servicer spacecraft is a critical task in the design of current and planned space missions, due to its relevance for close-proximity operations, such as In-Orbit Servicing and Active Debris Removal. This paper introduces a novel framework to enable robust monocular pose estimation for close-proximity operations around an uncooperative spacecraft, which combines a Convolutional Neural Network (CNN) for feature detection with a Covariant Efficient Procrustes Perspective-n-Points (CEPPnP) solver and a Multiplicative Extended Kalman Filter (MEKF). The performance of the proposed method is evaluated at different levels of the pose estimation system. A Single-stack Hourglass CNN is proposed for the feature detection step in order to decrease the computational load of the Image Processing (IP), and its accuracy is compared to the standard, more complex High-Resolution Net (HRNet). Subsequently, heatmaps-derived covariance matrices are included in the CEPPnP solver to assess the pose estimation accuracy prior to the navigation filter. This is done in order to support the performance evaluation of the proposed tightly-coupled approach against a loosely-coupled approach, in which the detected features are converted into pseudomeasurements of the relative pose prior to the filter. The performance results of the proposed system indicate that a tightly-coupled approach can guarantee an advantageous coupling between the rotational and translational states within the filter, whilst reflecting a representative measurements covariance. This suggest a promising scheme to cope with the challenging demand for robust navigation in close-proximity scenarios. Synthetic 2D images of the European Space Agency’s Envisat spacecraft are used to generate datasets for training, validation and testing of the CNN. Likewise, the images are used to recreate a representative close-proximity scenario for the validation of the proposed filter.


Introduction
Nowadays, key Earth-based applications such as remote sensing, navigation, and telecommunication, rely on satellite technology on a daily basis. To ensure a high reliability of these services, the safety and operations of satellites in orbit has to be guaranteed. In this context, advancements in the field of Guidance, Navigation, and Control (GNC) were made in the past years to cope with the challenges involved in In-Orbit Servicing (IOS) and Active Debris Removal (ADR) missions [1,2]. For such scenarios, the estimation of the relative pose (position and attitude) of an uncooperative spacecraft by an active servicer spacecraft represents a critical task. Compared to cooperative close-proximity missions, the pose estimation problem is indeed complicated by the the lack of range information in these latter [6]. In this framework, pose estimation systems based solely on a monocular camera are recently becoming an attractive alternative to systems based on active sensors or stereo cameras, due to their reduced mass, power consumption and system complexity [7]. However, given the low Signal-To-Noise Ratio (SNR) and the high contrast which characterize space images, a significant effort is still required to comply with most of the demanding requirements for a robust and accurate monocular-based navigation system. Interested readers are referred to Pasqualetto Cassinis et al. [8] for a recent overview of the current trends in monocular-based pose estimation systems. Notably, the aforementioned navigation system cannot rely on known visual markers, as they are typically not installed on an uncooperative target. Since the extraction of visual features is an essential step in the pose estimation process, advanced Image Processing (IP) techniques are required to extract keypoints (or interest points), corners, and/or edges on the target body. In model-based methods, the detected features are then matched with pre-defined features on an offline wireframe 3D model of the target to solve for the relative pose. In other words, a reliable detection of key features under adverse orbital conditions is highly desirable to guarantee safe operations around an uncooperative spacecraft. Moreover, it would be beneficial from a different standpoint to obtain a model of feature detection uncertainties. This would provide the navigation system with additional statistical information about the measurements, which could in turn improve the robustness of the entire estimation process.
Unfortunately, standard pose estimation solvers such as the Efficient Perspective-n-Point (EPnP) [9], the Efficient Procrustes Perspective-n-Point (EPPnP) [10], or the multi-dimensional Newton Raphson Method (NRM) [11] do not have the capability to include features uncertainties. Only recently, the Maximum-Likelihood PnP (MLPnP) [12] and the Covariant EPPnP (CEPPnP) [13] solvers were introduced to exploit statistical information by including feature covariances in the pose estimation. Ferraz et al. [13] proposed a method for computing the covariance which takes different camera poses to create a fictitious distribution around each detected keypoint. Other authors proposed an improved pose estimation method based on projection vector, in which the covariance is associated to the image gradient magnitude and direction at each feature location [14], or a method in which covariance information is derived for each feature based on feature's visibility and robustness against illumination changes [15]. However, in all these methods the derivation of features covariance matrices is a lengthy process which generally cannot be directly related to the actual detection uncertainty. Moreover, this procedure could not be easily applied if Convolutional Neural Networks (CNNs) are used in the feature detection step, due to the difficulty to associate statistical meaning to the IP tasks performed within the network. In this context, another procedure should be followed in which the output of the CNNs is directly exploited to return relevant statistical information about the detection step. This could, in turn, provide a reliable representation of the detection uncertainty.
The implementation of CNNs for monocular pose estimation in space has already become an attractive solution in recent years, also thanks to the creation of the Spacecraft PosE Estimation Dataset (SPEED) [16], a database of highly representative synthetic images of PRISMA's TANGO spacecraft made publicly available by Stanford's Space Rendezvous Laboratory (SLAB) and applicable to train and test different network architectures. One of the main advantages of CNNs over standard feature-based algorithms for relative pose estimation [7,17,18] is an increase in robustness under adverse illumination conditions, as well as a reduction in the computational complexity. Initially, end-to-end CNNs were exploited to map a 2D input image directly into a relative pose by means of learning complex non-linear functions [19][20][21][22]. However, since the pose accuracies of these end-toend CNNs proved to be lower than the accuracies returned by common pose estimation solvers, especially in the estimation of the relative attitude [19], recent efforts investigated the capability of CNNs to perform keypoint localization prior to the actual pose estimation [23][24][25][26]. The output of these networks is a set of so called heatmaps around pre-trained features. The coordinates of the heatmap's peak intensity characterize the predicted feature location, with the intensity and the shape indicating the confidence of locating the corresponding keypoint at this position [23]. Additionally, due to the fact that the trainable features can be selected offline prior to the training, the matching of the extracted feature points with the features of the wireframe model can be performed without the need of a large search space for the image-model correspondences, which usually characterizes most of the edges/corners-based methods [27]. In this context, the High-Resolution Net (HRNet) [28] already proved to be a reliable and accurate keypoint detector prior to pose estimation, due to its capability of maintaining a high-resolution representation of the heatmaps through the whole detection process.
To the best of the authors' knowledge, the reviewed implementations of CNNs feed solely the heatmap's peak location into the pose estimation solver, despite multiple information could be extracted from the detected heatmaps. Only in Pavlakos et al. [23], the pose estimation is solved by assigning weights to each feature based on their heatmap's peak intensities, in order to penalize inaccurate detections. Yet, there is another aspect related to the heatmaps which has not been considered. It is in fact hardly acknowledged how the overall shape of the detected heatmaps returned by CNN can be translated into a statistical distribution around the peak, allowing reliable feature covariances and, in turn, a robust navigation performance. As already investigated by the authors in earlier works [29,30], deriving an accurate representation of the measurements uncertainty from feature heatmaps can in fact not only improve the pose estimation, but it can also benefit the estimation of the full relative state vector, which would include the relative pose as well as the relative translational and rotational velocities.
From a high level perspective, two different navigation architectures are normally exploited in the framework of relative pose estimation. A tightly-coupled architecture, where the extracted features are directly processed by the navigation filter as measurements, and a looselycoupled architecture, in which the relative pose is computed by a pose solver prior to the navigation filter, in order to derive pseudomeasurements from the target features [31]. Usually, a loosely-coupled approach is preferred for an uncooperative tumbling target, due to the fact that the fast relative dynamics could jeopardize feature tracking and return highly-variable measurements to the filter. However, one shortcoming of this approach is that it is generally hard to obtain a representative covariance matrix for the pseudomeasurements. This can be quite challenging when filter robustness is demanded. Remarkably, the adoption of a CNN in the feature detection step can overcome the challenges in feature tracking by guaranteeing the detection of a constant, pre-defined set of features. At the same time, the CNN heatmaps can be used to derive a measurements covariance matrix and improve filter robustness. Following this line of reasoning, a tightlycoupled filter is expected to interface well with a CNN-based IP and to outperform its loosely-coupled counterpart.
In this framework, the objective of this paper is to combine a CNN-based feature detector with a CEPPnP solver whilst evaluating the performance of a proposed tightly-coupled navigation filter against the performance of a loosely-coupled filter. Specifically, the novelty of this work stands in extending the authors' previous findings [29,30] by further linking the current research on CNN-based feature detection, covariant-based PnP solvers, and navigation filters. The main contributions of this work are: 1. To assess the feasibility of a simplified CNN for feature detection within the IP 2. To improve the pose estimation by incorporating heatmapsderived covariance matrices in the CEPPnP 3. To compare the performance of tightly-and loosely-coupled navigation filters.
The paper is organized as follows. The overall pose estimation framework is illustrated in Section 2. Section 3 introduces the proposed CNN architecture together with the adopted training, validation, and testing datasets. In Section 4, special focus is given to the derivation of covariance matrices from the CNN heatmaps, whereas Section 5 describes the CEPPnP solver. Besides, Section 6 provides a description of the tightly-and loosely-coupled filters adopted. The simulation environment is presented in Section 7 together with the simulation results. Finally, Section 8 provides the main conclusions and recommendations.

Pose estimation framework
This work considers a servicer spacecraft flying in relative motion around a target spacecraft located in a Low Earth Orbit (LEO), with the relative motion being described in a Local Vertical Local Horizontal (LVLH) reference frame co-moving with the servicer (Fig. 1a). Furthermore, it is assumed that the servicer is equipped with a single monocular camera. The relative attitude of the target with respect to the servicer can then be defined as the rotation of the target bodyfixed frame B with respect to the servicer camera frame C, where these frames are tied to each spacecraft's body. The distance between the origins of these two frames defines their relative position. Together, these two quantities characterize the relative pose. This information can then be transferred from the camera frame to the servicer's centre of mass by accounting for the relative pose of the camera with respect to the LVLH frame.
From a high-level perspective, a model-based monocular pose estimation system receives as input a 2D image and matches it with an existing wireframe 3D model of the target spacecraft to estimate the pose of such target with respect to the servicer camera. Referring to Fig. 1b, the pose estimation problem consists in determining the position of the target's centre of mass and its orientation with respect to the camera frame C, represented by the rotation matrix . The Perspective-n-Points (PnP) equations, relate the unknown pose with a feature point in the image plane via the relative position of the feature with respect to the camera frame. Here, is the point location in the 3D model, expressed in the bodyframe coordinate system B, whereas and denote the focal lengths of the camera and ( , ) is the principal point of the image.
From these equations, it can already be seen that an important aspect of estimating the pose resides in the capability of the IP system to extract features from a 2D image of the target spacecraft, which in turn need to be matched with pre-selected features in the wireframe 3D model. Notably, such wireframe model of the target needs to be made available prior to the estimation. Notice also that the problem is not well defined for < 3 feature points, and can have up to four positive solutions for = 3 [33]. Generally, more features are required in presence of large noise and/or symmetric objects. Besides, it can also be expected that the time variation of the relative pose plays a crucial role while navigating around the target spacecraft, e.g. if rotational synchronization with the target spacecraft is required in the final approach phase. As such, it is clear that the estimation of both the relative translational and angular velocities represent an essential step within the navigation system.
The proposed tightly-coupled architecture combines the above key ingredients in three main stages, which are shown in Fig. 2 and described in more detail in the following sections. In the CNN-based IP block, a CNN is used to extract features from a 2D image of the target spacecraft. Statistical information is derived by computing a covariance matrix for each features using the information included in the output heatmaps. In the Navigation block, both the peak locations and the covariances are fed into the navigation filter, which estimates the relative pose as well as the relative translational and rotational velocities. The filter is initialized by the CEPPnP block, which takes peak location and covariance matrix of each feature as input and outputs the initial relative pose by solving the PnP problem in Eqs.
(1)- (2). Thanks to the availability of a covariance matrix of the detected features, this architecture can guarantee a more accurate representation of feature uncertainties, especially in case of inaccurate detection of the CNN due to adverse illumination conditions and/or unfavourable relative geometries between servicer and target. Together with the CEPPnP initialization, this aspect can return a robust and accurate estimation of the relative pose and velocities and assure a safe approach of the target spacecraft.
In this work, a rectilinear VBAR approach of the servicer spacecraft towards the target spacecraft is considered, as this typically occurs during the final stages of close-proximity operations in rendezvous and docking missions [1,2]. This assumption is justified by the fact that the proposed method needs to be first validated on simplified relative trajectories before assessing its feasibility under more complex relative geometries. Following the same line of reasoning, the relative attitude is also simplified by considering a perturbation-free rotational dynamics between the servicer and the target. This is described in more detail in Section 6.

Convolutional neural network
CNNs are currently emerging as a promising features extraction method, mostly due to the capability of their convolutional layers to extract high-level features of objects with improved robustness against image noise and illumination conditions. In order to optimize CNNs for the features extraction process, a stacked hourglass architecture has been proposed [23,24], and other architectures such as the U-net [34] and the HRNet [28] were tested in recent years.
Compared to the network proposed by Pavlakos et al. [23], the architecture proposed in this work is composed of only one encoder/ decoder block, constituting a single hourglass module. This was chosen in order to reduce the network size and comply with the limitations in computing power which characterizes space-grade processors. The encoder includes six blocks, each including a convolutional layer formed by a fixed number of filter kernels of size 3 × 3, a batch normalization module and max pooling layer, whereas the six decoder blocks accommodate an up-sampling block in spite of max pooling. In the encoder stage, the initial image resolution is decreased by a factor of two, with this downsampling process continuing until reaching the lowest resolution of 4 × 4 pixels. An upsampling process follows in the decoder with each layer increasing the resolution by a factor of two and returning output heatmaps at the same resolution as the input image. Fig. 3 shows the high-level architecture of the network layers, together with the corresponding input and output.
Overall, the size of the 2D input image and the number of kernels per convolutional layer drive the total number of parameters. In the current analysis, an input size of 256 × 256 pixels is chosen, and 128 kernels are considered per convolutional layer, leading to a total of ∼ 1,800,000 trainable parameters. Compared to the CNNs analysed by Sun et al. [28], this represents a reduction of more than an order of magnitude in network size.
As already mentioned, the output of the network is a set of heatmaps around the selected features. Ideally, the heatmap's peak intensity associated to a wrong detection should be relatively small compared to the correctly detected features, highlighting that the network is not confident about that particular wrongly-detected feature. At the same time, the heatmap's amplitude should provide an additional insight into the confidence level of each detection, a large amplitude being related to large uncertainty about the detection. The network is trained with the -and -image coordinates of the feature points, computed offline based on the intrinsic camera parameters as well as on the feature coordinates in the target body frame, which were extracted from the wireframe 3D model prior to the training. During training, the network is optimized to locate 16 features of the Envisat spacecraft, consisting of the corners of the main body, the Synthetic-Aperture Radar (SAR) antenna, and the solar panel, respectively. Fig. 4 illustrates the selected features for a specific target pose.

Training, validation and test
For the training, validation, and test datasets, synthetic images of the Envisat spacecraft were rendered in the Cinema 4D©software. Table 1 lists the main camera parameters adopted. Constant Sun elevation and azimuth angles of 30 degrees were chosen in order to recreate favourable as well as adverse illumination conditions. Relative distances between camera and target were discretized every 30 m in the interval 90 m -180 m, with the Envisat always located in the L. Pasqualetto Cassinis et al.   camera boresight direction in order to prevent some of the Envisat features from falling outside the camera field of view. Although being a conservative assumption, this allows to test the CNN detection under ideal servicer-target geometries during a rectilinear approach. Subsequently, relative attitudes were generated by discretizing the yaw, pitch, and roll angles of the target with respect to the camera by 10 degrees each. Together, these two choices were made in order to recreate several relative attitudes between the servicer and the target. The resulting database was then shuffled to randomize the images, and was ultimately split into training (18,000 images), validation (6,000 images), and test (6,000 images) datasets. Fig. 5 shows a subset of the camera pose distribution for 100 representative training images, whereas Fig. 6 illustrates some of the images included in the training dataset. During training, the validation dataset is used beside the training one to compute the validation losses and avoid overfitting. The Adam optimizer [35] is used with a learning rate of 10 −3 for a total number of 50 epochs. Finally, the network performance after training is assessed with the test dataset.
Preliminary results on the single-stack network performance were already reported by Pasqualetto Cassinis et al. [29]. Above all, one key advantage of relying on CNNs for feature detection was found in the capability of learning the relative position between features under a variety of relative poses present in the training. As a result, both features which are not visible due to adverse illumination and features occulted by other parts of the target can be detected. Besides, a challenge was identified in the specific selection of the trainable features.
Since the features selected in this work represent highly symmetrical points of the Envisat spacecraft, such as corners of the solar panel, SAR antenna or main body, the network could be unable to distinguish between similar features, and return multiple heatmaps for a single feature output. Fig. 7 illustrates these findings. Notably, the detection of wrong features results in weak heatmaps, which can be filtered out by selecting a proper threshold on their total brightness. In order to compare the feature detection accuracy of the proposed Single-stack Hourglass with a more complex CNN architecture, the HRNet proposed by Sun et al. [28] has been selected and trained on the same Envisat datasets. This architecture had already been tested on the SPEED dataset [25] and already proved to return highly accurate features of the TANGO spacecraft. The performance is assessed in terms of Root Mean Squared Error (RMSE) between the ground truth (GT) and the , coordinates of the extracted features, which is computed as Fig. 8 shows the RMSE error over the test dataset for the two CNNs, whereas Table 2 reports the mean and standard deviation of the associated histograms. As expected, the added complexity of HRNet, translates into a more accurate detection of the selected features, thanks to the higher number of parameters: only 4% of the test images are characterized by a RMSE above 5 pixels, as opposed to the 15% in the Single-stack Hourglass case.
Although HRNet proves to return more accurate features, it is also believed that the larger RMSE scenarios returned by the Single-stack Hourglass can be properly handled, if a larger uncertainty can be associated to their corresponding heatmaps. As an example, a large RMSE could be associated to the inaccurate detection of only a few features which, if properly weighted, could not have a severe impact on the pose estimation step. This task can be performed by deriving a covariance matrix for each detected feature, in order to represent its detection uncertainty. Above all, this may prevent the pose solver and the navigation filter from trusting wrong detections by relying more on other accurate features. In this way, the navigation filter can cope with poorly accurate heatmaps while at the same time relying on a computationally-low CNN.

Covariance computation
Compared to the methods discussed in Section 1 [13][14][15], the proposed method derives a covariance matrix associated to each feature directly from the heatmaps detected by the CNN, rather than from the computation of the image gradient around each feature. In order to do so, the first step is to obtain a statistical population around the heatmap's peak. This is done by thresholding each heatmap image so that only the -and -location of heatmap's pixels are extracted. Secondly, each pixel within the population is given a normalized weight based on the grey intensity at its location,  where , , are the components of the coloured image and , , are the weights assigned to each channel in order to get the greyscale intensity. This is done in order to give more weight to pixels which are particularly bright and close to the peak, and less weight to pixels which are very faint and far from the peak. Finally, the obtained statistical population of each feature is used to compute the weighted covariance between , and consequently the covariance matrix , where and is the number of pixels in each feature's heatmap. In this work, the mean is replaced by the peak location = ( , ) in order to represent a distribution around the peak of the detected feature, rather than around the heatmap's mean. This is particularly relevant when the heatmaps are asymmetric and their mean does not coincide with their peak. Fig. 9 shows the overall flow to obtain the covariance matrix for three different heatmap shapes. The ellipse associated to each features covariance is obtained by computing the eigenvalues and of the covariance matrix, where defines the scale of the ellipse and is derived from the confidence interval of interest, e.g. = 2.2173 for a 68% confidence interval. As can be seen, different heatmaps can result in very different covariance matrices. Above all, the computed covariance can capture the different CNN uncertainty over , . Notice that, due to its symmetric nature, the covariance matrix can only represent bivariate normal distributions. As a result, asymmetrical heatmaps such as the one in the third scenario are approximated by Gaussian distributions characterized by an ellipse which might overestimate the heatmap's dispersion over some directions.

Pose estimation
The CEPPnP method proposed by Ferraz et al. [13] was selected to estimate the relative pose from the detected features as well as from their covariance matrices. The first step of this method is to rewrite the PnP problem in Eqs. (1)-(2) as a function of a 12-dimensional vector containing the control point coordinates in the camera reference system, where is a 2 ×12 known matrix. This is the fundamental equation in the EPnP problem [9]. The likelihood of each observed feature location is then represented as where is a small, independent and unbiased noise with expectation [ ] = and covariance [ ] = 2 and is a normalization constant. Here, 2 represents the global uncertainty in the image, whereas is the 2 × 2 unnormalized covariance matrix representing the Gaussian distribution of each detected feature, computed from the CNN heatmaps. After some calculations [13], the EPnP formulation can be rewritten as L. Pasqualetto Cassinis et al. Fig. 9. Schematic of the procedure followed to derive covariance matrices from CNN heatmaps. The displayed ellipses are derived from the computed covariances by assuming the confidence intervals = . and = . . This is an eigenvalue problem in which both and matrices are a function of and . The problem is solved iteratively by means of the closed-loop EPPnP solution for the four control points, assuming no feature uncertainty.
Once is estimated, the relative pose is computed by solving the generalized Orthogonal Procrustes problem used in the EPPnP [10].

Navigation filter
Several navigation filters for close-proximity operations were investigated in recent years in the context of relative pose estimation. The reader is referred to Pasqualetto Cassinis et al. [8] for a comprehensive overview that goes beyond the scope of this work. In the proposed navigation system, the so-called Multiplicative Extended Kalman Filter (MEKF) is used. Remarkably, other works [15,30] adopted a standard formulation of the EKF that propagates the relative pose, expressed in terms of relative position and quaternions, as well as the relative translational and rotational velocities (prediction step), correcting the prediction with the measurements obtained from the monocular camera (correction step). However, the quaternion set consists of four parameters to describe the 3DOF attitude, hence one of its parameters is deterministic. As reported by Tweddle and Saenz-Otero [36] and Sharma and D'Amico [31], this makes the covariance matrix of a quaternion have one eigenvalue that is exactly zero. As a result, the entire state covariance propagated by the filter may become nonpositive-definite and lead to the divergence of the filter. The MEKF, introduced for the first time by Lefferts et al. [37], aims at solving the above issue by using two different parametrizations of the relative attitude. A three element error parametrization, expressed in terms of quaternions, is propagated and corrected inside the filter to return an estimate of the attitude error. At each estimation step, this error estimate is used to update a reference quaternion and is reset to zero for the next iteration. Notably, the reset step prevents the attitude error parametrization from reaching singularities, which generally occur for large angles.

Propagation step
A standard EKF state vector for relative pose estimation is composed of the relative pose between the servicer and the target, as well as the relative translational and rotational velocities and . Under the assumption that the camera frame onboard the servicer is co-moving with the LVLH frame, with the camera boresight aligned with the along-track direction, this translates into is the quaternion set that represents the relative attitude. Notice that the assumption of the camera co-moving with the LVLH is made only to focus on the navigation aspects rather than on the attitude control of the servicer. Therefore, the application of the filter can be extended to other scenarios, if attitude control is included in the system.
In the MEKF, the modified state vector propagated inside the filter becomes where is four times the Modified Rodrigues Parameters (MRP) , The discrete attitude propagation step is derived by linearizingȧ round = 3×1 and assuming small angle rotations [36], As a result, the discrete linearized propagation of the full state becomes where represents the process noise and The terms and in Eq. (17) represent the servicer argument of perigee and true anomaly variation from time 0 to , respectively, whereas the term in Eq. (23) is the inertia matrix of the target spacecraft. In Tweddle and Saenz-Otero [36], the integral terms ins Eqs. (22)-(23) are solved by creating a temporary linear system from Eq. (14), augmented with the angular velocity and the process noise. The State Transition Matrix of this system is then solved numerically with the matrix exponential.

Correction step
At this stage, the propagated statẽis corrected with the measurements to return an update of the statê. In a loosely-coupled filter, these measurements are represented by the relative pose between the servicer and the target spacecraft, obtained by solving the PnP problem with the CEPPnP solver described in Section 5. In this case, a pseudomeasurements vector is derived by transforming the relative quaternion set into the desired attitude error , In Eq. (24), ⊗ denotes the quaternion product. Conversely, in a tightlycoupled filter the measurements are represented by the pixel coordinates of the detected features, Referring to Eqs. (1)- (2), this translates into the following equations for each detected point : where * is the quaternion conjugate. As a result, the measurements update equation can be written as and the Jacobian of the observation model with respect of the state vector is a 2 ×13 matrix whose elements are The partial derivatives of the differential quaternion set with respect to the attitude error are computed from the relation between the attitude error and the differential quaternion set , In the tightly-coupled filter, the measurement covariance matrix is a time-varying block diagonal matrix constructed with the heatmapsderived covariances in Eq. (5), Notice that can differ for each feature in a given frame as well as vary over time. Preliminary navigation results [30] already showed that such heatmaps-derived covariance matrix can capture the statistical distribution of the measured features and improve the measurements update step of the navigation filter. Conversely, in the loosely-coupled filter represents the uncertainty in the pose estimation step and hence it is not directly related to the CNN heatmaps. A constant value is therefore chosen based on the pose estimation accuracy observed for the test dataset.
Finally, the updated state estimatêis obtained from the propagated statẽ, the residuals̃, and the Kalman Gain ,

Reset step
In the reset step, the reference quaternion ref is updated with the attitude error estimatêand the new attitude error is set to zero, The obtained estimated quaternion set̂is then compared to the real quaternion set to assess the angle accuracy of the filter.

Simulations
In this section, the simulation environment and the results are presented. Firstly, the impact of including a heatmaps-derived covariance in the pose estimation step is addressed by comparing the CEPPnP method with a standard solver which does not account for feature uncertainty. The weights in Eq. (4) are selected based on the standard RGB-to-greyscale conversion ( = 0.299, = 0.587, = 0.114). Secondly, the performance of the MEKF is evaluated by comparing the convergence profiles with a heatmaps-derived covariance matrix against covariance matrices with arbitrary selected covariances. Initialization is provided by the CEPPnP for all the scenarios.
Two separate error metrics are adopted in the evaluation, in accordance with Sharma and D'Amico [20]. Firstly, the translational error between the estimated relative position̂and the ground truth is computed as This metric is also applied for the translational and rotational velocities estimated in the navigation filter. Secondly, the attitude accuracy is measured in terms of the Euler axis-angle error between the estimated quaternion̂and the ground truth ,

Pose estimation
Three representative scenarios are selected from the CNN test dataset for a preliminary evaluation of the Single-stack Hourglass   performance. These scenarios were chosen in order to analyse different heatmaps' distributions around the detected features. A comparison is made between the proposed CEPPnP and the EPPnP. Fig. 10 shows the characteristics of the covariance matrices derived from the predicted heatmaps. Here, the ratio between the minimum and maximum eigenvalues of the associated covariances is represented against the ellipse's area and the RMSE between the Ground Truth (GT) and the x, y coordinates of the extracted features, Notably, interesting relations can be established between the three quantities reported in the figure. In the first scenario, the correlation between the sub-pixel RMSE and the large eigenvalues ratio suggests that a very accurate CNN detection can be associated with circularshaped heatmaps. Moreover, the relatively low ellipse's areas indicate that, in general, small heatmaps are expected for an accurate detection. Conversely, in the second scenario the larger ellipses' area correlates with a larger RMSE. Furthermore, it can be seen that the largest difference between the -and -components of the RMSE occurs either for  the most eccentric heatmap (ID 13) or for the one with the largest area (ID 6). The same behaviour can be observed in the last scenario, where the largest RMSE coincides with a large, highly eccentric heatmap. Table 3 lists the pose estimation results for the three scenarios. As anticipated in Fig. 10, the statistical information derived from the heatmaps in the first scenario is uniform for all the features, due to the very accurate CNN detection. As a result, the inclusion of features covariance in the CEPPnP solver does not help refining the estimated pose. Both solvers are characterized by the same pose accuracy.
Not surprisingly, the situation changes as soon as the heatmaps are not uniform across the feature IDs. Due to its capability of accommodating feature uncertainties in the estimation, the CEPPnP method outperforms the EPPnP for the remaining scenarios. In other words, the CEPPnP solver proves to be more robust against inaccurate CNN detections by accounting for a reliable representation of the features covariance.
Next, the previous comparison is extended to the entire test dataset as well as to HRNet, by computing the mean and standard deviation of the estimated relative position and attitude as a function of the relative range, respectively. This is represented in Figs. 11-12. First of all, it can be seen that the pose accuracy of the CEPPnP solver in the Single-stack Hourglass scenario does not improve compared to the EPPnP, as opposed to the ideal behaviour reported in Table 3. There are two potential causes of this behaviour. On the one hand, most of the test images characterized by a large RMSE (Fig. 8) could not return statistically-meaningful heatmaps that would help the CEPPnP solver. This could be due to multiple heatmaps or highly inaccurate detections in which two different corners are confused with each other. On the other hand, this could be a direct consequence of the large relative ranges considered in this work. As already reported by Park et al. [26] and Sharma and D'Amico [31], a decreasing performance of EPPnP is indeed expected for increasing relative distances, due to the nonlinear relation between the pixel location of the detected features and in Eq. (2). In other words, relatively large pixel errors could lead to inaccurate pose estimates for large relative distances, independently of the use of either CEPPnP or EPPnP. Furthermore, it can be seen from a different comparison level that both the mean and standard deviation of the estimated relative pose are improved, when HRNet is used prior to the PnP solver (Figs. 11b-12b). Again, this is a direct consequence of the smaller RMSE reported in Fig. 8. As a result, the above-mentioned degradation of the pose estimation accuracy for increasing relative ranges is less critical for HRNet. Notice also that, despite an actual improvement of CEPPnP over EPPnP can be seen in the HRNet scenario, the improvements in both the mean and standard deviation of the estimation error are relatively small at large relative distances. This is considered to be related to the fact that HRNet returns circular heatmaps for most of the detected features, due to its higher detection accuracy compared to the Single-stack Hourglass.
Notably, it is important to assess how well the pose estimation system can scale when tested on datasets different than the Envisat one. To this aim, the proposed heatmaps-based scheme was benchmarked on the SPEED dataset, in order to compare its pose accuracy against standard as well as CNN-based systems [19,25,26]. The reader is referred to Barad [38,p. 115] for a comprehensive quantitative analysis of such comparison. The results demonstrated that the performance of the proposed pipeline, based on extracting feature heatmaps and using the CEPPnP solver, compares well with the state-of-the-art pose estimation systems.

Navigation filter
To assess the performance of the proposed MEKF, a rendezvous scenarios with Envisat is rendered in Cinema 4D©. This is a perturbationfree VBAR trajectory characterized by a relative velocity ‖ ‖ = 0 m∕s. The Envisat performs a roll rotation of ‖ ‖ = 5 deg/s, with the servicer camera frame aligned with the LVLH frame. Table 4 lists the initial conditions of the trajectory, whereas Fig. 13 shows some of the associated rendered 2D images. It is assumed that the images are made available to the filter every 2 s for the measurement update step, with the propagation step running at 1 Hz. In both scenarios, the MEKF is initialized with the CEPPnP pose solution at time 0 . The other elements of the initial state vector are randomly chosen assuming a standard deviation of 1 mm/s and 1 deg/s for all the axes of terms (̂0 − ) and (̂0 − ), respectively. Table 5 reports the initial conditions of the filter.    Figs. 14-15 show the convergence profiles for the translational and rotational states in the tightly-and loosely-coupled MEKF, respectively. Besides, a Monte Carlo simulation with 1,000 runs was performed to assess the robustness of the filter estimate against varying the initial statê0. Table 6 lists the standard deviation chosen for the deviation from the true initial state of the filter. The distribution follows a Gaussian profile with true-state mean. For the attitude initial error, the initial reference quaternion ) .
(50) Table 7 reports the mean of the steady-state pose estimates together with their standard deviation. From these results, important insights can be gained on two different levels of the comparison. On a CNN performance level, the results in Fig. 14 show that a slightly worse cross-track estimate of the Single-stack Hourglass is compensated by a more accurate estimate of the relative attitude. Given the limited impact of these estimation errors at the relatively large inter-satellite range of 150 m, these results suggest that the Singlestack Hourglass has a comparable performance with the HRNet for the selected scenario. Next, on a filter architecture level, a comparison between Figs. 14-15 illustrate the different convergence pattern between the tightly-and loosely-coupled MEKF. Most importantly, it can be seen that the loosely-coupled estimate of the relative along-track position is characterized by a bias which is not present in the tightlycoupled estimate. This occurs due to the decoupling of the translational and rotational states, reflected in the Jacobian in Eq. (25). As a result, the relative position is estimated without accounting for the attitude measurements and vice versa. In other words, the creation of pseudomeasurements of the relative pose prior to the loosely-coupled filter leads to two separate translational and rotational estimates. Conversely, in the tightly-coupled filter the full statistical information is enclosed in the detected features, and can be used to simultaneously refine both the translational and the rotational states. Moreover, a close inspection of the Single-stack Hourglass attitude estimates in Table 7 suggests that the tightly-coupled MEKF is characterized by a lower standard deviation, highlighting a better robustness with respect to the initial conditions of the filter when compared to the loosely-coupled MEKF. Note that, due to the higher accuracy of HRNet in the feature detection step -and hence also in the pose estimation step, this is not observed for the latter CNN. In conclusion, a tightly-coupled architecture is expected to return higher pose accuracies if simplified CNNs, such as the proposed single-stack hourglass, are implemented at a feature detection level.

Conclusions and recommendations
This paper introduces a novel framework to estimate the relative pose of an uncooperative target spacecraft with a single monocular camera onboard a servicer spacecraft. A method is proposed in which a CNN-based IP algorithm is combined with a CEPPnP solver and a tightly-coupled MEKF to return a robust estimate of the relative pose as well as of the relative translational and rotational velocities. The performance of the proposed method is evaluated at different levels of the pose estimation system, by comparing the detection accuracy of two different CNNs (feature detection step and pose estimation step) whilst assessing the accuracy and robustness of the selected tightly-coupled filter against a loosely-coupled filter (navigation filter step).
The main novelty of the proposed CNN-based pose estimation system is to introduce a heatmaps-derived covariance representation of the detected features and to exploit this information in a tightly-coupled, Single-stack Hourglass-based MEKF. On a feature detection level, the performance of the proposed Single-stack Hourglass is compared to the more complex HRNet to assess the feasibility of a reduced-parameters CNN within the IP. Results on the selected test dataset suggest a comparable mean detection accuracy, despite a larger standard deviation of the former network. Notably, this latter aspect is found to decrease the pose estimation accuracy of the proposed CNN compared to HRNet, despite the adoption of CEPPnP to capture features uncertainty. However, important insights are gained at a navigation filter level, delineating two major benefits of the proposed tightly-coupled MEKF. First of all, the capability of deriving a measurements covariance matrix directly from the CNN heatmaps allows to capture a more representative statistical distribution of the measurements in the filter. Notably, this is expected to be a more complex task if a loosely-coupled filter is used, due to the need to convert the heatmaps distribution into a pose estimation uncertainty through a linear transformation. Secondly, the coupling between the rotational and translational states within the filter guarantees a mutual interaction which is expected to improve the global accuracy of the filter, especially in the alongtrack estimate. Besides, the navigation results for the selected VBAR    scenario demonstrated that the proposed Single-stack Hourglass could represent a valid alternative to the more complex HRNet, provided that its larger detection uncertainty is reflected in the measurements covariance matrix. Together, these improvements suggest a promising scheme to cope with the challenging demand for robust navigation in close-proximity scenarios. However, further work is required in several directions. First of all, more recent CNN architectures shall be investigated to assess the achievable robustness and accuracy in the feature detection step. Secondly, the impact of a reduction in the number of CNN parameter on the computational complexity shall be assessed by testing the CNNs in space-representative processors. Moreover, broader relative ranges between the servicer camera and the target spacecraft shall be considered, most importantly to allow a thorough investigation of the 3D depth perception challenges when approaching the target spacecraft with a single monocular camera. Besides, more close-proximity scenarios shall be recreated to assess the impact of perturbations on the accuracy and robustness of the navigation filter. In this context, other navigation filters such as the Unscented Kalman Filter shall be investigated.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.