Multimodal registration across 3D point clouds and CT-volumes

Multimodal registration is a challenging problem in visual computing, commonly faced during medical image-guided interventions, data fusion and 3D object retrieval. The main challenge of multimodal registration is finding accurate correspondence between modalities, since different modalities do not exhibit the same characteristics. This paper explores how the coherence of different modalities can be utilized for the challenging task of 3D multimodal registration. A novel deep learning multimodal registration framework is proposed by introducing a siamese deep learning architecture, especially designed for aligning and fusing modalities of different structural and physical principles. The cross-modal attention blocks lead the network to establish correspondences between features of different modalities. The proposed framework focuses on the alignment of 3D point clouds and the micro-CT 3D volumes of the same object. A multimodal dataset consisting of real micro-CT scans and their synthetically generated 3D models (point clouds) is presented and utilized for evaluating our methodology.

Registration is the process of aligning different sets of spatial data by determining the proper geometrical transformation [13] between them.Multimodal registration is a special case, where the data to be aligned are of different modalities (e.g.capture techniques or sensors) but represent the same object.These data can be 2D images, 2.5D data (image + depth), 3D images acquired by tomographic modalities like CT, MR or PET, 3D point clouds or 3D meshes.Most multimodal registration research has arisen in the medical imaging field, but cultural heritage (CH) and other areas can equally benefit from the visual combination of multiple modalities in order to produce an accurate and useful representation of, e.g., CH assets [9].
Cultural heritage documentation aims at a multimodal record of CH objects that enables a range of operations, such as inspection, virtual reconstruction of fragmented artefacts and fabrication processes [14][15][16][17][18].An accurate model of an object's surface and inner structure can also contribute to preservation and monitoring, by detecting any structural damages and deformations in structure or cracks, blistering or erosion.The detailed representation of both the interior and external surface can be used as a foundation for future change monitoring of the object.Alterations can be accurately recorded, quantified and tracked through the years [18].While our specific motivation and data have arisen from the CH field, the applications of the proposed method are not limited to CH.
Geometry acquired from 3D surface scanners is a core aspect of a digital model, but is limited due to the fact that only data from the surface are acquired and the inner structure of the object cannot be documented.The penetrative capabilities of CT scanning allow the digitization of the interior of an object without having to perform physically invasive actions [18].By combining 3D surface models and CT imaging techniques, it is possible to produce more precise 3D representations of an object, consisting of an accurate geometric model of the surface along with a detailed representation of its internal structure [19][20][21].
Multimodal registration is a long standing research area with many challenges.Finding an accurate, robust and fast multimodal alignment 1 is still very challenging, since different modalities come from different acquisition systems, having different representations and properties.In particular, the core difficulty of aligning CT volumes and point clouds comes from the significant difference in physical characteristics and representation which Conversion however results in extra computational cost and loss of structural information.This is the gap that we attempt to address in the current paper.
We propose a deep network architecture capable of registering two different modalities, without transforming either of them before feeding them to the network which performs the registration process.The proposed PCD2VOL method aligns 3D surface data with 3D CT volume data.To the best of our knowledge, this is the first time that a deep learning network is trained to register such modalities.The main contributions of this paper can be summarized as follows: • The problem of multimodal 3D registration of CT volumes and 3D point clouds is formally defined and a framework for such registration is proposed.Publicly available upon publication.
• To the best of our knowledge, it is the first deep learning network that combines regular CNNs suited for data with a standard grid structure and geometric deep learning suited for unstructured data.
• The proposed network employs a siamese architecture for a novel attention mechanism for effective multimodality fusion.
• A multimodal dataset for evaluating algorithms for aligning CT volumes and 3D point clouds.Publicly available upon publication.
The remainder of this paper is organized as follows: In Section 2 related works are discussed while in Section 3 the problem of 3D multimodal registration of CT volumes and Point clouds is defined.In Section 4 the proposed methodology for 3D multimodal registration is introduced.The proposed evaluation benchmark and experimental results on multimodal alignment are presented in Section 5.The paper is concluded in Section 6.

Related work
Multimodal datasets are increasingly being created and exploited.There has also been growing research on the registration of 3D data obtained from different acquisition sensors or data of different structure.Approaches have been proposed for integrating different data modalities so as to produce complete models.However, according to the specific application, the modalities and the approach vary considerably.Medical imaging [22], remote sensing [23] and cultural heritage documentation [6] have emerged as the most fruitful application areas for 3D multimodal registration.A comprehensive review of 3D multimodal registration methodologies across application domains can be found in [24].
3D multimodal registration has been extensively researched in the medical domain, due to the variety of medical modalities that need to be fused.Medically oriented registration methods focus on specific modality pairs, clinical task or body organs.Detailed surveys on medical multimodal registration can be found in [25][26][27].
Registration methodologies can be broadly classified based on the type of correspondence between the data (parts, structure or context of each dataset).They may be feature-based or intensitybased.In feature-based registration, features (such as interest points, contours or lines) are first extracted from each dataset and are subsequently used to determine the proper correspondence and alignment.Intensity-based methodologies attempt to identify context similarity between the datasets based on the correlation between pixel/voxel intensities [28].Both techniques have been successfully employed for aligning data from different modalities by identifying salient structures [29] or statistical dependency of the intensities [30][31][32] across the different modalities.Alternatively, methods exist that try to simplify the multimodal registration problem to unimodal by reconstructing or mapping one modality onto the other [33,34].
Over the last few years, there is a clear predominance in the use of deep learning techniques for registration [35][36][37][38].However, most of these methods involve the same modality, the specific combination of 2D images/3D model, or are somehow restricted in application to the medical field due to the assumptions made.There is virtually no research in 3D multimodal registration outside the medical field where the modalities are differentiated in both structure and physical principles.
Our work is motivated by the idea of using attention mechanisms for multimodal registration.An attention mechanism enables a model to focus on important information for a task; thus it has been applied widely to various computer vision problems, including image classification [39], object detection [40], image generation [41] and image captioning [42].Recently this technique has also been used for multimodal registration.[43] fused RGB images and point clouds by learning feature interactions between the modalities with a cross-modal attention scheme while [44] developed a self-attention mechanism specifically for aligning 3D medical volumes of MRI and TRUS modalities.
Our problem is generic in that it concerns the alignment of 3D modalities that are complementary since they jointly describe the interior and the surface of a 3D object.The proposed network exploits cross attention for the challenging task of aligning 3D modalities of different geometric data structures.The proposed framework is a combination of CNN, geometric deep learning for feature extraction and a siamese architecture of cross modal attention network, trained to identify correspondences and fuse regular input data formats (like 3D voxels) and irregular 3D geometric data (like 3D point clouds).To the best of our knowledge, this is the first time that registration of such different modalities, without projecting one modality onto the other, is explored.

Given a set of 3D points
. ., H}, the aim is to find the unknown rigid transformation T, so as to align the two input modalities as well as possible.
The registration result is a rigid transformation matrix T(R, t), where T ∈ SE (3).It consists of two components; a rotation submatrix R ∈ SO(3) and a translation vector t ∈ R 3 .The rigid transformation T can then be represented by the following homogeneous 4 × 4 matrix:  3D Point Clouds and 3D CT Volumes have different geometrical and physical characteristics.Hence, identifying a distance measure for alignment is challenging.Parameters like the centroid or the bounding box (orientation and location) could approximately measure if two instances of these modalities are aligned.It is inherently difficult to come up with a traditional algorithm which could find correspondences across these modalities.Both modalities represent the same object, therefore common features exist to guide the registration.In our methodology and experiments we take advantage of a ground truth in order to train a neural network and evaluate our results.

Method overview
The proposed framework, as illustrated in Fig. 1 consists of three main components.Initially, the 3D point cloud and the 3D CT volume are fed into two modality-specific feature extraction network blocks to identify regional and geometric features of each modality independently.Then, the modality-based features are passed to a siamese architecture of cross-modal attention blocks, in order to capture local features and their global correspondence across the modalities.Finally, the deep registration block processes the fused feature representation to extract the registration parameters.The details of each component are discussed in the following subsections.

Feature extraction
Each input modality is initially passed to the respective feature extraction network.The feature extraction of the 3D point cloud modality, adopts a variant of PointNet [47].PointNet has been chosen for this task due to its efficiency in capturing critical geometric features of point clouds.The architecture is shown in Fig. 2.
The 3D CT Volume is passed through CTVolNet, a CNN-based architecture to efficiently represent the CT volume.Based on [48], two sets of convolutional and max-pooling layers are used to capture regional features, shown in Fig. 3.

Cross-modal attention siamese architecture
The proposed cross-modal attention block identifies local features and jointly determines the spatial correspondence between the input modalities.The cross-modal module utilizes the modal correlations and adaptively adjusts the modality features for an accurate fusion result.After the representations for each modality have been extracted, the cross-modal attention block captures the distinct parts of one modality given the context features of the other modality as proposed in [49,50].Rather than considering features of each modality equally, the proposed cross modal attention block estimates a bidirectional relationship between the input modalities.The cross-modal attention block highlights the important information for one modality related to the other and achieves a inter-modality relationship.
The two input modality feature maps are denoted as F P = {fp i | i = 1, . . ., N} and F V = {f v lwh | l = 1, . . ., L, w = 1, . . ., W , h = 1, . . ., H}; F P and F V are the point cloud feature map and the CT volume feature map respectively.The modality feature maps are sent to a siamese architecture of cross-modal attention blocks; each modality feature map will be sent as primary modality to one cross-modal attention block and as cross-modal modality to the second block (see Fig. 1).
Without loss of generality, we will present the cross-modal attention block independently of the input modality context.The The cross-modal feature map CM shows the corresponding relationship between a position i of the primary input M 1 and all positions j of the cross-modal input M 2 and is computed following [44,51] as an extended non-local operation: Function f (M 2 , M 1 ) computes the relationship between the feature in the ith position of the first modality and all features j of the second modality.Function g computes a representation of the first modality at position j: (5) θ, φ are also linear embeddings: (6) where W g , W θ and W φ are the weight matrices to be learned during training.F is a normalization factor of the final result and can be calculated as: Therefore, CM i is calculated as: which can be estimated by a softmax computation for i along j: This cross-modal attention module plays a vital role when the features to be fused are from different modalities.It preserves the information from each individual modality and makes them complementary to each other so as to eliminate the modality gap.The module's output M Cor summarizes the features on all locations of the first modality weighted by their correlations with the cross-modal features on the specific location.By using a Siamese network of cross-modal attention blocks, the network  investigates the relationships of each modality as both a primary and a cross-modality input and identifies their respective correlations.Fig. 4 shows details of the cross-modal attention block.

Deep registration block
After computing the spatial correspondences between the input point cloud and volume, the registration block fuses the two sets of feature maps and computes the registration parameters.The deep registration block's architecture is shown in Fig. 5.
The network is supervised by calculating the RMSE (Registration Mean Square Error) between the predicted and the ground truth transformation as the loss function.The loss function of the Deep Registration Module is then back-propagated through all three components and allows the adjustment of the network parameters and the minimization of the error.

Dataset
The proposed fully supervised deep learning method is dependent on sufficient training data with ground truth.The biggest challenge was the lack of a publicly available dataset with ground truth for aligning 3D models from the source modalities of 3D point clouds and 3D micro-CT volumes.The dataset of the PRE-SIOUS project [52][53][54], is publicly available and contains 3D models of the modalities of interest.It consists of 17 stone slabs, captured in several modalities across accelerated erosion cycles; the modalities involved are 3D geometry scans (point clouds and 3D meshes), micro-CT volumes, 3D microscopy and petrography.A total of 38 pairs of 3D geometry scans and micro-CT volumes of stone slabs exist.
The use of the PRESIOUS dataset presented a number of challenges.First, the amount of data are limited and insufficient for training our deep network.Moreover, the 3D geometry scans and micro-CT captures were performed independently, without the use of any external reference points; thus the data from the two modalities do not possess the necessary ground truth for training our supervised network.
We thus followed a different path in order to expand and augment the cultural heritage dataset of PRESIOUS stones for benchmarking and training multimodal 3D registration algorithms.The process for the creation of the '3DPCD-CT' dataset is outlined in Fig. 6.Starting with the micro-CT data of the PRESIOUS stone slabs, we sliced each slab resulting in a larger dataset of subvolumes and then synthetically generated the 3D surface geometry of each piece.Since, the generated 3D point clouds exactly correspond to the respective 3D CT volumes, we consider this as ground truth for training and evaluation purposes.
Every micro-CT volume was divided into a smaller volumes of 50 slices each, providing an average of 35 new smaller volumes.From these smaller volumes, we excluded those with high noise content and no beneficial stone information, resulting in 636 smaller CT volumes, which were then resized to 90 × 90 × 50 voxels each.The corresponding 3D point clouds were then synthetically generated using the marching cubes method of [55].The outcome consisted of very dense surfaces, so we simplified each model to 13.455 points using the algorithms from [56,57].The dataset is split into a training set (80% of the dataset) and a test set (20% of the dataset).The training set contains 508 objects and the test set has 128 objects.Each object contains the CT volume, the respective point cloud and the ground truth transformation (see Fig. 7).

Experimental results
We evaluated our 3D multimodal registration framework on the '3DPCD-CT' dataset.Since there is no established performance measure for the registration error between a volume and a geometry surface, we employed the target registration error (TRE) [58].TRE measures the effect of the predicted transformation T pred against the ground truth transformation T GT on the initial point cloud P = {p i | i = 1, . . ., N} based on [59]: All tests were run on a PC with an i7-7700K CPU at 4.20 GHz, NVIDIA GeForce GTX 1080 Ti GPU and 32 GB of RAM.In Table 1 we summarize the quantitative registration results on the challenging '3DPCD-CT' dataset for multimodal 3D alignment; Fig. 8 illustrates some qualitative results.
An accurate and fair comparison between our method and different literature approaches is not straightforward because we could not identify any previous registration method that directly aligns point clouds and CT volumes.We thus used the classic ICP [60] as a baseline, but in order to do so, we pre-processed the CT volumes and converted them into point clouds.We then run the ICP algorithm between these point clouds and the point clouds of the '3DPCD-CT' dataset.In general, ICP fails when it comes to large rigid transformation differences.To succeed, ICP needs a good initial transformation estimation (not the case in realistic applications).Thus, in most cases, ICP did not converge.Moreover, ICP and other state of the art registration techniques, requires inputs of the same modality (point clouds in general) necessitating the conversion of one of the inputs in order to address the modality gap.This conversion involves loss of information, which can significantly affect the registration result.In addition, such a conversion can be expensive, especially when large 3D volumes are involved, as in CH applications.For example, in our experiments the conversion of a CT volume into a point cloud representation took approximately 1 h.Conversely, after training, our method requires 0.12 s per registration.
We thus opted for a direct comparison of our method against other multimodal registration methods, even though they may represent different modalities, as this was the nearest we could get to comparing against other methods.Table 2 presents quantitative registration results of the latest state-of-the-art 3D multimodal registration methods.Most of these methods align data of different modalities but of the same structure.Of course, the results are only indicative, since each method registers different modalities and the datasets that experiments were conducted on are different and oriented to the specific modalities and task.The table shows the TRE metric as it is considered to be a more generic measure of registration accuracy [58].In general, TRE is the distance between the corresponding points of the inputs, but due to the fact that the modalities that each method fuses are different, the exact calculation of TRE may differ.
The methods that align different representations of data are [29] and the proposed one (Table 2).[29] aligns 2D images against a 3d model.However this method converts one modality to the other as a first step (the 2D images to a 3D model) and then executes a typical unimodal registration; the conversion involves the penalties of cost [29] and information loss, as also attested by its high TRE.The proposed method directly registers different data modalities and of different structure, which is a more challenging task compared to registering multimodal data of the same structure.
Interestingly the initial TRE, corresponding to the initial pose of the inputs of the compared methods, varies significantly.The results displayed in Table 2 show that the registration error is associated to the difference in initial pose of the inputs. 2When input modalities start with a pose close to the ideal solution, the initial TRE is lower and so is the registration (final TRE).However, many commonly used registration methods could produce non sufficient results if the modalities are not initialized properly [61].
In an attempt to measure the improvement in alignment of the compared methods, we also calculated the percentage change (PC) in TRE as [62]: Higher values of PC denote a larger improvement on the initial pose.We chose a high initial TRE for the evaluation of our method in order to mimic real, challenging, situations.Taking into consideration the PC of the proposed method and the fact that it operates on modalities of different data structure, the results obtained can be considered as very competitive.
However, there are some cases where our method fails to accurately register the inputs.Such an example is depicted in  the last row of Fig. 8, where the initial pose of the inputs was considerable, both in terms of rotation and translation; although the method determined the proper rotation it failed to detect the correct translation.
The modified registration Siamese network proposed here is the first registration mechanism that attempts to align two different data modalities not only in terms of data type but data structure as well.In this light, the achieved results can be considered as satisfactory as well as promising.For example, the work of [44] which also uses a cross-modal attention block to register MRI and TRUS data, achieves comparable registration results and has competitive computational cost.[44] achieves target registration error between the surfaces of 3.63 and a PC of 54%.However, MRI and TRUS have the same structure (sequences of images), so the network uses the same feature extractor for representing both input volumes.Moreover, method [44] seems to be more efficient in terms of run-time; since this involved absolute execution time based on specific experiments and datasets, we do not think that it represents a conclusive comparison against the proposed.Our method deals with high resolution input data of different structures, thus the search for spatial correspondences through the cross-modal block increases the computational cost.
3D volume modalities (i.e.CT, MRI, TRUS) contain details about the inner structure of the object, like cracks, porosity and voids.
like [22,44,45] can detect and use contextual information based on the respective intensities in order to fuse different modalities of 3D volumes.On the other hand, 3D models contain a precise representation of the external surface of the object.A conversion from one modality to the other might result in information loss that will significantly affect the registration result.For example, a 3D model of the surface lacks information of the inner details, so a conversion will not contain any valuable contextual information of the interior and this is likely to affect the registration result.Conversely, a conversion of a 3D volume to a 3D model might add extra computational time without the respective benefit on registration accuracy.

Ablation study
To demonstrate the contribution of the proposed framework and to validate the effectiveness of each component we executed three different trials of our network by excluding a different module each time.
The results are shown in the lower part of Table 1.It can be seen that removing any of the components has strongly diminutive effects in the registration accuracy; removing the crossmodal attention module results in the worst loss.

Conclusions and future work
In this work, we present a direct solution for the challenging task of 3D multimodal registration between 3D volumes and 3D point clouds.A novel deep network that consumes and fuses different 3D modalities (CT-volumes and point clouds) is proposed.These modalities are treated directly (no conversion of one onto the other) to avoid information loss and time penalty.Our network introduces a novel siamese architecture of crossmodal attention blocks that captures and fuses features of two structurally different modalities.
We believe that this approach is an important step forward as it addresses the non-trivial task of aligning modalities of different structural and physical principles, for which it is also extremely challenging to write traditional (non deep learning) code.The method presented can potentially be extended to other computer vision tasks, such as multimodal retrieval and recognition.Moreover, it can be generalized to different modalities due to its adjustable framework.Using alternative feature extraction methods suitable per modality, the method can be extended to fuse modalities such as 3D meshes, voxel data or medical imaging modalities such as MRI, 3D TRUS etc.

Fig. 1 .
Fig. 1.Overview of the proposed cross-modal 3D registration framework.The 3D cross-modal registration network consists of three stages.1.Each input modality (Point Cloud and 3D CT Volume) is fed into an independent feature extractor network that is suitable for that modality.2. The captured features are fed to a siamese architecture of cross-modal attention blocks.3. The registration block fuses the cross-modal features into the final registration parameters.

Fig. 2 .
Fig. 2. The adopted PointNet [47] architecture used to extract point cloud features.For each point P = {p i | i = 1, . . ., N} of the point cloud, the network computes C features.
block receives a primary input M 1 ∈ R CxN and a cross-modal input M 2 ∈ R CxLWH .C denotes the number of features that have been identified in the previous steps (we use C = 32 in our experiments), N and LWH indicate the size of each 3D feature map.The cross-modal attention block computes a new feature map M Cor that shows the modality correlation, as the sum of the initial primary feature map M 1 and the cross-modal feature map CM:

Fig. 4 .
Fig. 4. The detailed architecture of the proposed cross-modal attention module.

Fig. 5 .
Fig. 5.The detailed architecture of the deep registration module.

Fig. 7 .
Fig. 7. Example point clouds in the 3DPCD-CT dataset.Two different object cases are shown: a. the Nidaros GSmall 01 stone and b. the Nidaros BLarge 02 stone.For each case it is shown: on the left the whole 3D geometry of the stone and on the right: point clouds of different stone pieces generated from the respective piece of CT-volume.

Fig. 8 .
Fig. 8. Example multimodal registration outcomes for the proposed method.

Table 1
Registration results of the proposed method on the '3DPCD-CT' dataset when random rotations and translations are performed on the initial sub pieces.The metrics evaluated are target registration error (TRE), and Recall with threshold 6.00.The initial TRE of the transformations was 15.34.

Table 2
Performance comparison between multimodal registration methods.