Fast computation of mutual information in the frequency domain with applications to global multimodal image alignment

Multimodal image alignment is the process of finding spatial correspondences between images formed by different imaging techniques or under different conditions, to facilitate heterogeneous data fusion and correlative analysis. The information-theoretic concept of mutual information (MI) is widely used as a similarity measure to guide multimodal alignment processes, where most works have focused on local maximization of MI that typically works well only for small displacements; this points to a need for global maximization of MI, which has previously been computationally infeasible due to the high run-time complexity of existing algorithms. We propose an efficient algorithm for computing MI for all discrete displacements (formalized as the cross-mutual information function (CMIF)), which is based on cross-correlation computed in the frequency domain. We show that the algorithm is equivalent to a direct method while asymptotically superior in terms of run-time. Furthermore, we propose a method for multimodal image alignment for transformation models with few degrees of freedom (e.g. rigid) based on the proposed CMIF-algorithm. We evaluate the efficacy of the proposed method on three distinct benchmark datasets, of aerial images, cytological images, and histological images, and we observe excellent success-rates (in recovering known rigid transformations), overall outperforming alternative methods, including local optimization of MI as well as several recent deep learning-based approaches. We also evaluate the run-times of a GPU implementation of the proposed algorithm and observe speed-ups from 100 to more than 10,000 times for realistic image sizes compared to a GPU implementation of a direct method. Code is shared as open-source at \url{github.com/MIDA-group/globalign}.


Introduction
Multimodal image alignment (also known as registration) [1], involves finding correspondences between images formed by different imaging techniques or under different conditions. The goal is often to enable data fusion of the heterogeneous information of the sources involved. Processes closely related to registration are patch retrieval and template matching. Multimodal image alignment can be a very challenging problem, due to great dissimilarities of the involved modalities. One area where fully automated approaches have been lacking is in alignment of micrographs, and in particular correlative microscopy, where image modalities are often highly distinct and images contain small/thin structures that are difficult to match without relying on fiducial markers or time-consuming manual intervention [2].
Monomodal (or unimodal) alignment can be addressed by a number of existing techniques, often grouped into (i) featurebased methods, such as SIFT [3], where the focus is on finding correspondences between distinct feature points detected in the images, and (ii) intensity-based methods, such as [4], where the alignment is guided by similarity between the whole image functions. Multimodal alignment is more complex and challenging, since similarity in appearance can not be expected for corresponding structures. This reduces the applicability of feature-based methods. Instead, intensity-based methods guided by local optimization of similarity measures which rely on some statistics, such as mutual information (MI) [5,6] or modality independent neighbourhood descriptors [7], are well- Max CMIF gives T Aligned A and T(B) Fig. 1: Illustration of the main steps of the proposed image alignment method. The input consists of two unaligned histological images from SHG (modality A) and BF imaging (B). They are both quantized into 16 categorical labels using k-means clustering (illustrated by pseudo-coloring) followed by computation of dense CMIF maps (MI over all discrete displacements) for each considered rotation angle. The transformation with the max CMIF is identified; it provides the desired alignment. At the end of the pipeline we observe the transformed overlapping images which are successfully aligned. established tools for this task. Another class of methods rely on reducing a multimodal alignment task to a monomodal one, by (typically learning-based) transformation of the image modalities to a common modality [8,9], for which high-performance monomodal methods are applicable.
It has been observed that MI exhibits high performance when used as a similarity measure in local optimization frameworks if the displacements to recover are small, but struggles for larger displacements [4,6,9]. This highlights the need for fast global optimization of MI. However, existing methods [10,11] exhibit run-time complexities which make them unsuited for many practically relevant image sizes.
We here present two algorithms. The first algorithm, and main contribution of this work, efficiently computes MI between two images for all possible discrete displacements on a rectangular domain. The output corresponds to a generalization of the, in the field of 1D (EEG) signal analysis, existing notion of cross-mutual information function (CMIF) [12] and we therefore use the same name in this work. We propose a novel approach for efficient computation of CMIF by using cross-correlation (CC) in the frequency domain, providing an output which is equivalent to exhaustively computing MI over all possible discrete displacements.
Our second contribution is a method that combines the fast CMIF computation with a grid search over additional transformation parameters (e.g. rotation) to facilitate global multimodal image alignment for transformations with reasonably few degrees of freedom (rigid, affine, etc.). Figure 1 illustrates the main steps of global rigid alignment by this method applied on histological images acquired by second harmonic generation (SHG) microscopy, and bright-field (BF) microscopy.
Our third contribution is a theoretical analysis of the proposed CMIF-algorithm. We show that it is equivalent to a direct histogram-based algorithm for computation of MI. Furthermore, we derive the asymptotic computational complexity of the proposed CMIF-algorithm, and show that it is substantially more computationally efficient than the existing methods.
Our fourth contribution is an empirical evaluation of the run-time of the fast CMIF-algorithm in comparison to the direct histogram-based algorithm. We observe speed-ups ranging from hundreds of times to more than 10,000 times for practically relevant image sizes. Our fifth contribution is an evaluation of the proposed global rigid alignment method compared with state-of-the-art methods on three distinct and publicly available multimodal rigid alignment benchmark datasets, following the protocol of [9].
The two proposed algorithms are implemented in PyTorch [13] for accelerated computation on general-purpose graphics processing units (GPGPUs), and the code is shared as opensource at github.com/MIDA-group/globalign .

MI in image processing
MI, introduced in [14], is an information-theoretic measure quantifying the mutual dependence between two random variables. It is commonly used in image processing as a similarity measure between images [15,16].
Given two discrete images A and B with domains X A ⊂ Z n and X B ⊂ Z n , intersecting on X AB , where X AB ≠ ∅, with ranges A and B, MI is defined as where p AB XAB (a, b) denotes the relative frequency of values a and b occurring jointly in X AB , and p A XAB (a) denotes the marginal relative frequency of value a occurring in image A, within X AB . The frequencies p AB XAB (a, b), p A XAB (a), p B XAB (b) can be computed using marginal and joint histograms.
Alternatively, MI can be formulated in terms of the marginal and the joint entropies, where the marginal entropy of image A, on X AB is given by analogous definition holds for image B, and the joint entropy is given by CMIF describes the MI between two images subject to displacement χ ∈ Z n (generalization to nD of definition in [12]),

Image alignment by MI maximization
An image alignment process based on MI maximization can be expressed as finding a transformationT such that where A is the reference image, and T (B) denotes the geometrically transformed floating image, where transformation T is applied to image B to warp it into the space of A, and Ω is a chosen set of transformations. Solutions to (6) are commonly sought locally, using gradient-based methods [5,[15][16][17], or recently through a Gibbs-sampling process [18], or globally through computation of MI for each discrete displacement [10]. Use of local methods requires a good initial guess since MI typically exhibits a large number of local maxima [6]. MI can be combined with gradient information to improve the performance when the modalities exhibit similarity of variation [19].
To reduce the dependence on the size of the images and their overlap, a normalized MI (NMI) [20] was developed. Fast computation of MI is critical in many applications, in particular when MI is used as a similarity measure for alignment tasks, due to the large data and number of similarity computations involved. Stochastic sub-sampling is commonly utilized together with gradient-based local optimization methods [5,6]. Other works include speeding up local optimization approaches by using GPU-based histogram methods [11,21].

Global optimization over discrete displacements
Efficient methods for global optimization over discrete displacements (based on fast convolutions performed in the frequency domain or on use of integral images) exist for absolute mean differences, mean square differences [22], CC and normalized cross-correlation [23], while (to the best of our knowledge) so far not for MI [6,10,11,24].
A direct, histogram-based, approach for computing CMIF for all displacements χ requires, for each χ (considered in isolation), to iterate over X ATχ [B] and increment the joint histogram bin corresponding to the values a and b (and analogously for the marginal histograms). Finally the entropies (Eq. (3) and (4)) are computed using relative histograms, which then directly give the MI through Eq. (2). This method is discussed in [10]. Furthermore, [11,21] discuss maximization of MI by exhaustive search in the context of their fast approaches for GPU-based histogram computation, and conclude that such an approach is too costly to be useful in practice.

Method
We propose a fast and exact algorithm for computing MI for all possible discrete displacements on a rectangular domain (Eq. (5)), enabling fast global translation-based alignment using MI maximization. We show the equivalence to a direct histogram-based method and we analyze the asymptotic computational complexity. Furthermore, we propose a global multimodal image alignment method based on the proposed CMIFalgorithm.

Notation and basic definitions
Consider pairs of images, A∶ X A → A and B∶ X B → B, where X B ⊆ X A ⊂ Z n , together with corresponding given region of interest masks, indicating the user-defined part to be included in the computation of MI. Without loss of generality, we assume that X A and X B are rectangular subsets of Z n , since any other subset can be obtained through appropriately defined masks M A and M B . Let X S denote the set of displacements χ for which the entire domain of the shifted image B intersects with the domain of image A.
The CC, for real-valued functions f and g defined on Z n , is given by Let L a [A] denote an indicator function (level set) on X A , of image A being equal to a ∈ A and within the mask image M A ,

Algorithm for computing CMIF in the frequency domain
Contrary to the direct method, where, given a particular displacement χ, histograms are computed for all image values a and b, we instead reorder the required operations; given a single image value a, (or) b, or a pair of values (a, b), the number of occurrences of the value/pair, and the corresponding entropy contribution, is computed for all χ ∈ X S .
Computation of the joint histogram entries C AB a,b (χ; A, B) can be expressed as and the marginal histogram entries C A a (χ; A, B) and C B b (χ; A, B) respectively, can be expressed as Complete histograms are computed by evaluating Eq. (9) and Eq. (10) for all image values a ∈ A and b ∈ B. Another quantity of interest for the CMIF computation is a map N∶ X S → R ≥0 representing the number of valid points (points where both masks are non-zero) for all χ ∈ X S , which enables computation of relative histograms. N is given by The relative histograms obtained after normalizing C a A , C b B , C a,b AB by (pointwise division by) N can be inserted as probability functions into the shifted versions of Eq. (3) and Eq. (4), which can be used directly to compute CMIF (Eq. (5)). Rounding C A a , C B b , C AB a,b and N for all χ to integers we circumvent possible floating-point errors in the computation and reach an exact result. In Eq. (12), we define that 0 0 = 0 and 0 log 2 0 = 0.

Generalization to spatially weighted MI
Spatially weighted MI (SWMI) [25] is a version of MI that associates each x ∈ X A with a weight, modelling the relative importance of the corresponding image element (and its contribution to the relative frequencies). Our here proposed CMIFalgorithm can be generalized to compute SWMI by replacing the binary mask M A with a weight mask W A ∶ X A → R ≥0 , and modifying the level-set formulation to a weighted version, We may further generalize this approach by considering weight masks for both images A and B where the weight for a given x ∈ X A and χ ∈ X S is taken to be the product,

Equivalence of the correlation-based algorithm and the direct method
The proposed algorithm uses CC to compute CMIF exactly, and identically to the direct method; both involve computing contributions to the same discrete histogram counts N(χ, A, B), C A a (χ, A, B), C B b (χ, A, B) and C AB a,b (χ, A, B) for all a, b and χ ∈ X S . The CC (Eq. (7)) computes a sliding innerproduct between the two functions. For binary functions (images) f and g, Eq. (7) yields the number of elements x where f (x) = g(x + χ) = 1, for a given χ. Hence, the CC of level-set and level-set, level-set and mask, as well as mask and mask, give the histogram entries, for each χ, as given by Eq. (9), (10) and (11). These inner-products provide all the quantities required for exact computation of the shifted entropies (Eq. (12)), and thus for CMIF. The main distinction between the proposed algorithm and the direct method is the order in which the histogram counts are computed; the direct method computes the histogram entries (N(χ; A, B), C A a (χ, A, B), C B b (χ, A, B) for all a, b, for a fixed χ, and the proposed approach computes the histogram entries for all χ, for fixed a, b, before proceeding.

Complexity analysis
The worst-case run-time complexity of the direct approach is given by its requirement of O(1) work for every element in image B for every χ ∈ X S and O(1) work per histogram bin. On the other hand, computation of CC in the frequency domain requires, instead, computation of at most 1 + A + B + A B (forward and inverse) DFTs using the Fast Fourier Transform (FFT) algorithm [26] and element-wise (complex) multiplications, yielding an asymptotic run-time complexity of If A and B are treated as (small) constants, the proposed method is asymptotically efficient cf. the direct approach (w.r.t. the size of the images), except for cases with very small X B . It has been observed [19] that 32 intensity bins provided the best trade-off between (i) flexibility and detail, and (ii) insensitivity to noise, for gradient-based image alignment. For very small images B or sets of displacements X S , or for large value sets A and B, the direct method may be the better choice. In practice, small or medium sized value sets are usually acceptable and the image sizes are often such that the proposed algorithm is several orders of magnitudes faster.

Method for global multimodal alignment
We propose a method for image alignment by global MI maximization for transformations with few degrees of freedom (e.g. rigid or affine) by combining the efficient CMIF-algorithm proposed in Sec. 3.2, with grid search over other transformation parameters such as rotation angle. Relying on global search, the method can find the global maximum without smoothing of the images, which otherwise is commonly performed to increase the size of the region of attraction of the global maximum, but may lead to finding a sub-optimal solution [6].
The method, described in Alg. 1, performs global alignment by taking a set of transformations Ω = {T 1 , . . . , T k } (e.g. selected from a grid in the parameter space, or through random selection), warping the floating image (B) for each transformation using nearest-neighbor (NN) interpolation, computing a CMIF map using the proposed algorithm, and locating the displacement with highest MI. We use NN interpolation to compute T (B) since B is assumed to be categorical. A parameter γ ∈ [0, 1] controls the required fraction of overlap compared to the maximal observed overlap; γ is both used to select transformations with sufficient overlap, and to control how large padding is required. Finally, the resulting transformation is taken as the composition of the transformation from Ω and the T χ , χ∈ X S which leads to the highest MI with required overlap.
The first step of Alg. 1 is to quantize the images into suitably small number of levels using some appropriate quantization approach (unless the ranges are already suitable discrete representations). We use k-means clustering, which has been shown to yield more efficient utilization of the discrete bins for MI-based image alignment [17] than equisized binning. Furthermore, kmeans clustering enables the direct application of the method to multi-channel images, including image pairs with different numbers of channels (denoted m A and m B ). For this work, we use the mini-batch k-means algorithm [27] which is fast and scalable. We denote k the number of clusters used for quantization of both images, such that k = A = B .
In Alg. 1, the reference image A and its mask M A are zeropadded with ⌈ X B i (1 − γ)⌉ units (before and after) along (each) axis i, where X B i denotes the size (diameter) of the (rectangular) domain of image B along axis i.
For the task of global rigid alignment we run the method twice: (i) first using grid search to reach a coarse alignment at an angle θ, (ii) followed by an (optional) refinement step, where random search is employed in an interval around the best angle found by the first grid search, sampling uniformly from the interval θ − 2π Ω , θ + 2π Ω , where Ω denotes the number of angles in the first grid search stage.
Both image alignment tasks, where the reference and floating images are similar in size, as well as patch retrieval (template matching) tasks, where the reference image is larger than the floating image, can be efficiently solved directly by Alg. 1.

Implementation
We implement both the proposed CMIF-algorithm, as well as the global image alignment procedure (Alg. 1) in Python/PyTorch as a way to utilize parallel processing on a GPU. Using Python 3.8.8 and PyTorch 1.8.1, the method runs almost entirely on the GPU, including image warping, padding, level-sets, and CC (FFTs and complex multiplication).
Mini-batch k-means clustering relies on the provided implementation in the sklearn package [28], version 0.24.1.

Performance analysis
We consider three datasets ( [29]): (i) aerial images [30] with infrared (IR) as one modality and color images as the other, (ii) cytological images [31] with quantitative phase images as one modality and fluorescence images as the other, (iii) histological images [32] with SHG images as one modality and BF as the other, (illustrated in Fig. 1). The aerial image dataset and cytological image dataset consist of 864 and 5040 images respectively, and are both divided into three distinct folds; the histological image dataset consists of a single set of 536 images. The aerial and cytological images are of size 300 × 300, and the histological images are of size 834 × 834.

Run-time analysis
We investigate the run-times of the proposed CMIFalgorithm, compared to a direct histogram-based algorithm, implemented in Python/PyTorch, using the built-in PyTorch implementation for computing image histograms on the GPU. The algorithms implemented in PyTorch are very similar to methods 1 and 2 in [11], where the method is chosen dynamically based on the number of requested bins, and thus enabling a fair comparison of the proposed method and the direct method. The sort-and-count algorithm for computing the histograms [11] has been shown to be relevant mostly for larger number of bins, and therefore we do not consider it here, given that we observe that small number of bins is shown to work well in this global optimization context, as seen in Sec. 5.2.2.

Experimental setup
We select one of the histological image pairs at random, and crop/pad it to various sizes such that the side-lengths of the reference image are a power of two {128, 256, 512, 1024, 2048, 4096} and the other (floating) image is of half the size along each dimension, which is the scenario encountered when aligning equally sized images with Alg. 1 and γ = 0.5. We compute the CMIF-map for all χ ∈ X S .
Experiments are run on a Nvidia GeForce GTX 2080 GPU, with 11 GB of memory.
We also apply the image alignment method (Alg. 1) on cytological and histological images, and measure the run-time for a number of configurations, including 50, 100, and 200 grid steps (with 32 additional refinement angles), with k ∈ {8, 16, 32}, with k-means batch size of 1000 and max iterations 25.

Results
The results of the run-time performance experiment regarding computation of a complete CMIF-map are summarized in Fig. 2. We observe that the proposed algorithm is substantially (between 100 times to more than 10000 times) faster than the direct method, and the difference is particularly noteworthy for low numbers of bins, k ∈ {2, 4, 8}. Table 1 shows the run-time for the global rigid alignment method for a number of realistic configurations. For the smaller cytological images, with 8 bins in particular, we observe performance compatible with real-time applications. One rigid image alignment for the histological dataset using the direct method with k = 16 and 200 angles + 32 refinement angles, runs in approximately 19 hours (compared to 32.9s with the fast CMIFalgorithm).

Rigid image alignment
To evaluate the efficacy of the proposed image alignment method, compared to several existing methods, we follow the implemented in PyTorch and executed on a GPU. We observe that the proposed CMIF-algorithm is several orders of magnitude faster than the direct method. The empirical run-times of the proposed algorithm exhibits a strong dependency on k, as expected from Eq. (15). The empirical run-times for the direct method are relatively stable w.r.t. k; we observed at most a difference of 30% between such choices for k ∈ {8, 16, 32, 64}. evaluation protocol of [9], comprising a benchmark for evaluating the effectiveness of 2D rigid image alignment methods on three distinct datasets. The study includes comparison of four image-to-image (I2I) translation methods, as well as a stateof-the-art contrastive learning method for modality transfer, which are all used to attempt to transform the multimodal alignment task into an easier monomodal alignment task, and are then combined with either a high-performance intensity-based method [4] or a well-known feature-based method [3]. Finally, local optimization (gradient-based) MI is also included; it is applied to multimodal data directly. Even though 2D rigid transformation is among the simplest models, solving these alignment tasks still poses a challenge in the presence of multiple distinct modalities. It is also a realistic task in microscopy and aerial settings where relative scale between the images may be known a priori, or readily estimated, while arbitrary rotations can be encountered.

Experimental setup
Following [9], we apply the proposed alignment method on all three datasets. The performance measure is based on the average Euclidean distance between image corners considered as landmarks in the reference image space, and the corresponding recovered landmarks of the aligned floating image. An alignment is considered successful if the error is less than 2% of the width of the images [9].
An important parameter of the proposed method is k, which has a large impact on the run-time (Eq. 15) while also directly influencing how detailed structures can be represented in the quantized representations. We run the alignment task for all the images in the included datasets (aerial, cytological, and histological), with k ∈ {2, 4, 8, 16, 32, 64}, with k-means batch size of 100 and max iterations 100, and measure the success-rates.
The number of angles for the grid search, as well as if including the refinement step or not, are also important consid-  The relationship between success-rate and the number of angles in the grid search, using k = 32, both with, or without, the subsequent refinement step. In both cases, the performance increases substantially up to 150 angles for the three datasets. erations. We evaluate the method on the three datasets using a set of angle counts {25, 50, 100, 150, 200, 250, 300}, with and without refinement with 32 randomly selected angles, and measure the success-rate for each configuration.
We use circular masks, to avoid bias from presence or absence of the signal contents in the corners of the images.

Results
First we present the performance on the three evaluation datasets (as measured by success-rates) for different choices of: k in Fig. 3, and angle count in Fig. 4; The success-rate increases substantially up to k = 16, and 150 angles.
Furthermore, we present the performance of the considered methods in comparison to a number of alternative methods in Tab. 2. The four I2I translation-based methods are combined into a single row, where the maximum success-rate along each column is shown. The proposed method is the best choice on two of the datasets (the cytological and the histological) where it exhibits a substantial improvement over existing methods. CoMIR combined with SIFT reaches 100% success-rate on the Zurich dataset, while the proposed method delivers a successrate of 99.8%, which is better than all included intensity-based methods. Table 2: Success rates (presented as percentages) of the evaluated methods on rigid alignment tasks on three datasets. Results on the aerial and cytological datasets are presented as empirical mean ± std-dev over the three folds. Bold marks the best result on each dataset. Used parameters for the proposed method are k = 32, γ = 0.5, 200 equispaced angles in [−π, π] plus 32 random refinement angles. For details about the reference methods, and their configuration, see [9].

Dataset
Aerial

Discussion and Conclusion
We present a novel fast CMIF-algorithm which computes MI for all discrete displacements of two images (similarly as crosscorrelation). The algorithm works in nD. The proposed algorithm enables fast global alignment (for transformation models with few degrees of freedom), has very few parameters, and is straightforward to configure for new applications. Quantization of image ranges with k-means clustering provides efficient use of a limited number of quantization levels, and enables aligning images with different numbers of channels.
We compare the run-time of the proposed CMIF-algorithm with a direct histogram-method (using a fast GPU-based histogram algorithm), and observe speed-ups ranging from hundreds of times to more than 10,000 times for practically relevant image sizes, indicating the potential for real-time global multimodal alignment. The presented methods are highly parallelizable; the work can readily be distributed over multiple GPUs or computers.
Furthermore, we evaluate the performance of the proposed alignment method on three datasets, and compare it with both gradient-based MI method, and recent methods which rely on deep learning. We observe excellent performance of the proposed method on all three datasets, and we conclude that it is the overall top performing method, even outperforming state of the art deep learning-based methods. Finally, the proposed method does not require aligned image pairs (or any training), and have few parameters to tune, which are advantages in comparison to the deep learning-based methods.
One limitation of the proposed method is that it does not support cost-effective sub-pixel alignment. Future work could involve application of a suitable refinement step based on local optimization towards this end.

Declaration of competing interests
The authors declare no competing interests.