ALPHA DIVERGENCES BASED MASS TRANSPORT MODELS FOR IMAGE MATCHING PROBLEMS

. Registration methods could be roughly divided into two groups: area-based methods and feature-based methods. In the literature, the Monge-Kantorovich (MK) mass transport problem has been applied to image registration as an area-based method. In this paper, we propose to use Monge-Kantorovich (MK) mass transport model as a feature-based method. This novel image matching model is a coupling of the MK problem with the well-known alpha divergence from the probability theory. The optimal matching scheme is the one which minimizes the weighted alpha divergence between two images. A primal-dual approach is employed to analyze the existence and uniqueness/non-uniqueness of the optimal matching scheme. A block coordinate method, analogous to the Sinkhorn matrix balancing method, can be used to compute the optimal matching scheme. We also derive a distance function for image morphing. Similar to elastic distances proposed by Younes, the geodesic under this distance function has an explicit expression.

1. Introduction. Image matching or image registration is one important task in medical imaging and computer vision. The goal is to align two or more pictures taken at different times, or from different sensors. Given two ( or many) images of a set of objects, one needs to identify the corresponding objects in these images. Due to the existence of noises, or different imaging method, these corresponding objects may be located at different positions or have different intensities. A registration process is to find an optimal deformation from one image to the other in a certain sense. Over several decades, many registration methods are proposed to search for the optimal deformation over a certain class of transformations (e.g., affine, elastic) such that the similarity among the corresponding objects is maximized, for instance, see [47], [34], [16], and the references therein. Generally a registration method consists of two parts: one is the choice of the feature space, and the other is the selection of the similarity function. The feature space could include pixel intensity, edges, corners, or points, etc. The similarity function could be the sum of intensity differences, the sum of squared intensity differences, the cross-correlation function, the mutual information, etc. Due to the diversity of images to be registered, it is impossible to design a universal method applicable to all registration tasks. According to whether distinctive objects are detected, registration methods can be divided into two groups: area-based methods and feature-based methods. One survey of registration methods can be found in [61].
In this paper, we are interested in applying Monge-Kantorovich problem in the image registration, especially developing some feature-based methods. In the 1780's, Monge formulated a problem of transporting a pile of soil with the least amount of work. In the 1940's, Kantorovich [27] employed a dual variation principle to convert the original problem into a linear problem. A survey of theoretical works on this problem can be found in [13]. Recently, Monge-Kantorovich (MK) mass transport has been applied to image retrieval, image registration ( matching) and image morphing, for instance in [55], [26], [41], [59], [37], [35]. In the image registration, the image intensity functions are regarded as piles of soil and image registration becomes the task of moving these piles of soil. The minimal transport cost is also called the earth mover's distance [41] or the Kantorovich distance. The MK problem itself has several advantages: the minimizer is unique, and there is no other local minimizers, and the MK problem is also parameter free and symmetric. However, a well-known undesirable effect called the "fade-in and fade-out" effect might occur, i.e., the matching result could map a small high intensity region to a large low intensity region which is not related ( see [21]). This intensity change effect mainly originates from the intrinsic measure-preserving constraint enforced in the MK problem. To alleviate this effect, Haker et. al. added an image intensity comparison term to the MK cost function to penalize the change of intensity [21]. On the other hand, causing by the measure-preserving constraint, numerically computing the Kantorovich distance for images is not an easy task. Their algorithm is a gradient descent method searching for a minimizer of the transport cost over the set of measure preserving mappings. The gradient direction is computed through the Helmholtz decomposition. To maintain the measure preserving property, the stepsize should not be too large. Each iteration of this algorithm has the computational complexity of order O(N log N ), where N is equal to the number of pixels in the two images. This is a significant improvement, compared with the computational complexity of previous numerical methods has order O(N 2 ) (See, e.g. [26]). Since this method works directly with intensity values without detecting any salient object, it can be regarded as an area-based method. In practical applications, area based methods are sensitive to intensity changes which are usually introduced by noises or varying illumination [61].
As mentioned in [61], feature-based methods are recommended if the images contain enough distinctive and easily detectable objects. Traditionally, the feature objects were selected manually by experts. Currently, various feature detection methods are preferably employed to automatically extract salient structures in the images. The extracted features could be region features, line features or point features, such as the locations of corners and boundary points. For further processing, these features can be represented by their point representatives, which are called control points in the literature. Hence, point set representations The matrix with entries {exp(− x i − y j 2 /σ 2 ) : i = 1, . . . , m; j = 1, . . . , n} is known as the proximity matrix [42], and σ is the kernel scale. The transformation T is estimated through min T {M KC (T(X), Y) Their experiment shows that the multiple-linked registration is more robust than the ICP method, and also outperforms the EM-ICP method in terms of registration accuracy. The KC framework was further improved by Jian and Vemuri in [23]. They regarded points as realizations of some mixture of Gaussian distributions, which changes the registration of point-sets into the alignment of two Gaussian mixtures. In this framework, the transformation is estimated through minimizing the L 2 distance between Gaussian distributions. This framework was later applied in the multiple point-sets registration [52].
In this paper, we would propose another multiple-linked ICP method, called the alpha divergence matching model, in which a kernel correlation is weighted by the alpha divergence 1 of correspondence functions (see Eq. (7)). By maximizing this weighted kernel correlation, both the unknown transformation and correspondence are estimated. Given two point-sets X and Y as described above, we assign a unit weight γ + i = 1 to each point x i , and a unit weight γ − j = 1 to each point y j . Represent a matching structure by two nonnegative m × n matching ( or correspondence ) matrices Γ + := {γ + i,j : i, j}, Γ − := {γ − i,j : i, j} with the property In other words, {γ + i,j : j} is a partition of γ + i = 1 and {γ − i,j : i} is a partition of γ − j = 1. Let where 0 < α < 1 and K(x, y) = K( x − y ) with K : [0, ∞) → (0, 1] satisfying (6) (i) K(r) = 1 if and only if r = 0, (ii)K(r) is a decreasing function of r.
Then we quantize the similarity between the point-sets be the maximum subject to m + n marginal constraints in Eq. (4). The matching structure is characterized by its maximizer. Since γ + i,j > 0 if and only if γ − i,j > 0 for each pair (i, j) (see Lemma 4.3), then we may describe the matching structure by a set of matching pairs, {(x i , y j ) : γ + i,j > 0, γ − i,j > 0}, and each pair (γ + i,j > 0, γ − i,j > 0) indicates a link between x i and y j . We call this maximization problem in Eq. (7) the alpha divergence matching (alpha-D) 1 The α-divergence [39] between two discrete distributions p = (p 1 , . . . , pn), q = (q 1 , . . . , qn) of fractional order α ∈ (0, 1) is problem. In this discrete setting, the matching between point-sets is usually not one-to-one, i.e., a fuzzy matching is allowed. Analogous to the KC method, the function is always nonnegative and equals to zero if and only if X = Y ( see Lemma B.1). Then using the closest point property(see j are constants, we propose to estimate the transformation by (9) max T E α (T(X), Y) + some regularity term.
We call the above framework the alpha divergence matching model or the alpha-D model. When K(x, y) = exp(− x − y 2 /σ 2 ), this approach can be regarded as a generalization of the KC method, hence it inherits the robustness of the KC method. This framework has several nice properties: 1. The alpha-D problem( Eq. (7)) has the closest point property(see Lemma A.1).
When one point set differs from another point set by a sufficiently small perturbation, then the correspondence can be identified correctly; 2. The L 2 MK mass transport problem is an asymptotic limit of the alpha-D problem when the kernel scale σ approaches infinity (see Remark 1). The alpha-D problem also shares some measure-preserving property in the MK mass transport (see Theorem 3.7). Empirically, these properties would reduce the number of local optimal solutions through inhibiting a large number of matching links connecting to one single point. 3. The measure preserving condition does not appear as a constraint in the alpha-D problem. Each correspondence matrix is optimized only subject to row-sum or column-sum constraints. Numerically, this optimization problem is much easier than the MK problem, in which one correspondence matrix is optimized subject to both row-sum and column sum constraints. That is, one column stochastic matrix and one row stochastic matrix are computed to represent the matching structure in the alpha-D problem, instead of a doubly stochastic matrix in the MK problem. Besides, thanks to the partial decomposable structure in Eq. (9), the block-ascent coordinate method can be employed to solve the point-set matching problem as follows: subject to proper constraints. Keeping remaining variables fixed, the maximizer for Γ + or Γ − is uniquely determined with explicit expressions. Thus, we may solve the problem by updating Γ + , Γ − , T cyclically.
In this paper, we aim at providing theoretical analysis of the continuous version of the alpha-D problem ( Eq. (7)), including existence and uniqueness for the optimal matching scheme ( the continuous counterpart of the correspondence matrices). In the MK problem, the uniqueness of the optimal measure preserving mapping follows from some absolute continuity condition called the N −1 property( see [7], [17]). Analogously, we list several conditions to ensure the uniqueness of the optimal matching scheme in the alpha-D problem, including the bijection condition and the absolute continuity condition. It is worth mentioning that Gaussian kernel functions satisfy the bijection condition. In other words, the continuous counterpart of the above fuzzy correspondence matrix could become a bijection mapping when the above proposed conditions are met. Numerically, an algorithm (with a resemblance to Sinkhorn balancing [45]) is proposed to compute the optimal correspondence matrices.
This paper is organized as follows. In section 2, we start with a brief review of the MK problem and then introduce the alpha-D problem. Several relations between the MK problem and the alpha-D problem are discussed. In section 3, we shall prove the existence and uniqueness of the solutions in the alpha-D problem based on the weak duality theorem. In section 4, we provide an algorithm for the alpha-D problem and prove its convergence. Afterwards, we investigate its related image morphing problem for α = 1/2. An algorithm is provided to compute a geodesic path between two images on its corresponding Riemannian metric. In section 5, we present several preliminary numerical results.

2.
Background & problem formulation. The alpha divergence matching (alpha-D) problem proposed in this paper is to couple the Monge-Kantorovich (MK) mass transport problem with the alpha divergence. The main difference between the alpha-D problem and MK problem is the treatment on the measure preserving condition. In the MK problem, the measure preserving condition is strictly imposed; the condition is relaxed in the alpha-D problem. Next, we introduce these two problems formally.
2.1. MK mass transport. We list several related well-known properties in MK mass transport problem according to a survey [12].
Consider two nonnegative Radon measures ν X , ν Y , both absolutely continuous to Lebesgue measure, on X := R d , Y := R d with their supports Ω X := spt(ν X ) ⊂ X, Ω Y := spt(ν Y ) ⊂ Y , 2 satisfying the mass balance condition (11) ν Given the cost density c : X × Y → [0, ∞), we seek for an optimal measurepreserving mapping s : X → Y , such that where s is a measure preserving mapping, i.e., for all continuous functions g ∈ C(Y ). Then instead of searching for s, we seek for a product measure, which implicitly describes the mass transport scheme, in a class of product measures: Then the original problem in Eq. (12) becomes the minimization problem (15) min{J[ν] := c(x, y)dν(x, y)}, subject to ν in the above class of product measures. This formulation circumvents the difficulty caused by the nonlinear structure of the constraint in Eq. (12). Then by introducing a Lagrangian function, one can derive its dual problem , and then the dual problem is equivalent to the problem Based on this formulation, one can show that the optimal mapping s can be expressed as s = ∇φ(x) a.e. 3 Besides, the new constraint in Eq. (17) yields that the optimal (φ, ψ) must be a convex conjugate pair. Thus, the mass transport problem is equivalent to finding a pair of convex conjugate functions, such that the mapping s(x) := ∇φ(x) is measure-preserving. Thirdly, the optimal measurepreserving mapping s is essentially a bijection from X to Y . Readers are referred to [7], [17], [32], [38] for more discussion on the MK mass transport problem. When the MK problem is applied to image matching tasks, the image intensity functions are regarded as piles of soil described by the measures ν X and ν Y [59] and the image intensity functions are normalized so that the mass balance condition ν X (Ω X ) = ν Y (Ω Y ) holds. The optimal matching scheme for two images is implicitly described by the optimal product measure ν with the least mass transport cost. Lastly, although a variety of cost density functions are discussed in the literature, e.g., x− y p with 1 ≤ p < ∞, the measure preserving constraint is strictly enforced in the MK problem.
2.2. The alpha-divergence (alpha-D) problem. The setting of the alpha divergence matching (alpha-D) problem is given as follows. Consider two nonnegative Radon measures ν X on X : Let the similarity function K(x, y) : Ω → R be a positive, smooth function ( at least a twice differentiable function), which quantizes the similarity between two vectors x ∈ X, y ∈ Y .
Define the primal problem as follows. Assume that both ν X , ν Y are absolutely continuous with respect to Lebesgue measures µ X , µ Y , i.e., ν X µ X , ν Y µ Y , where µ X , µ Y denote Lebesgue measures on Ω X and Ω Y respectively. Let Ω := Ω X × Ω Y . Let P = dν X /dµ X and Q = dν Y /dµ Y be the densities of ν X , ν Y with respect to µ X , µ Y . For simplicity, write dx = dµ X and dy = dµ Y . Then P = dν X /dx and Q = dν Y /dy. The primal problem is to find an optimal pair of correspondence functions (p, q) ( or an optimal matching scheme described by product measures (ν X , ν Y ) defined in Eq. (20)) for the maximization problem: (18)    sup L H (p, q) := ΩY ΩX p α q 1−α Kdxdy subject to ΩY p(x, y)dy = P (x), and ΩX q(x, y)dx = Q(y).
Since the constraints are related to the marginal densities P, Q, we would call them marginal constraints.
In this paper, α is assumed to be a scalar in (0, 1). In application, we use α = 1/2. When α = 1/2, the cost function is a symmetric function of p, q, and it could be viewed as a weighted Hellinger distance 4 .
Unfortunately, the optimal solution (p, q) of the primal problem does not exist in general. Loosely speaking, the optimal functions p, q usually are the Dirac delta functions. To study the primal problem, we introduce their corresponding measures ν X , ν Y . For any measurable set Then the primal problem becomes subject to the marginal constraints Here ω is some dominating measure on X × Y with ν X << ω, ν Y << ω. The cost is invariant of the selection of ω, then we may choose ω = ν X + ν Y . The absolute continuity comes from the fact that for each measurable set The dual problem is to find an optimal pair of positive functions (a, b) for the minimization problem: subject to K ≤ a α b 1−α , a > 0, b > 0. (25) The duality relation between these two problems would be shown in Theorem 3.2.
The α-divergence has been used as a measure of dissimilarity between two densities. Hence, the α-divergence can be applied to image registration, for instance [31]. In application of our model, consider two images having intensity functions I X , I Y and represent features of these images by some weight functions P and Q respectively. For instance, in terms of object contours, we may choose the norm of the gradients P (x) = |∇I X (x)| and Q(y) = |∇I Y (y)|; in terms of point-sets, we may choose various corner detectors (See references in [61]), and let P = m i=1 δ xi , Q = n j=1 δ yj , where δ x is the Dirac delta mass and m, n be the number of selected points in two images. The matching scheme is described by the optimal correspondence functions (p, q) solved in the primal problem. The feature similarity function K(x, y) is the Gaussian kernel function exp(− x − y 2 /σ 2 ), where x and y refer to the spacial position of feature points. In applications of image morphing, we would choose cosine functions as feature similarity functions. Then the problem can be (19) d H (ν X , ν Y )( or d H (p, q)) := [ When K ≡ 1 and ν X (X) = ν Y (Y ) = 1 in the primal problem, then 2 − L H (p, q) is the Hellinger distance between two densities p, q.
regarded as a combination of the MK problem and the elastic distance proposed by Younes [56] ( See Remark 4).

2.3.
Connection to the MK problem. We point out two relations between the MK problem and the alpha-D problem, when K is the Gaussian kernel function. First, the L 2 case of the MK problem is an asymptotic limit of the Hellinger distance problem.
As σ → ∞, using the Taylor expansion of the exponential function, we can approximate the cost in the primal problem as follows, By Hölder's inequality 5 , the first term achieves its maximum when the quotient dν Y dω / dν X dω =: k is constant. When two images are represented by two piles of equal mass, then which implies k = 1. Then ν X = ν Y =: ν. Afterwards, maximizing the second term is equivalent to the problem This is exactly the L 2 -MK mass transport problem.
Secondly, the alpha-D problem with K being Gaussian kernel functions can be regarded as a robust problem modified from the MK problem. Reformulate the Gaussian kernel function in Eq. (9): where the minimum is attained at r = exp(− T (x) − y 2 /σ 2 ). The cost in the alpha-D problem can be regarded as an L 2 mass transport cost p α q 1−α T (x) − y 2 weighted by a positive function r. Different weights are assigned according to the distances T (x) − y : the larger distance T (x) − y has the smaller weight r. When T (x) − y σ, then r ≈ 0 and then this pair (x, y) has less influence on the estimation T than other pairs. Hence, roughly speaking the alpha-D problem tends to ignore these distant pairs. In contrast, due to the measure preserving condition, the image matching result based on the MK problem is easily affected by mass changes of the corresponding objects. 5 Hölder's inequality(see page 622 [14]): Let scalars u ∈ L p , v ∈ L p and 1 < p, q < ∞ with p −1 + q −1 = 1, then |uv|dx ≤ u L p v L q . Let α = p −1 , a = |u| 1/α , b = |v| 1/(1−α) , then 0 < α < 1, and a α b 1−α dx ≤ ( adx) α ( bdx) 1−α . Note that the equality holds if and only if the quotient |a/b| or |b/a| is a constant function a.e.( This condition essentially comes from Young's inequality).
3.1. Primal v.s. dual problems. In the following, we establish the existence of the optimal solution in the primal problem. First, we show the existence of an optimal solution in the dual problem and illustrate the duality relation between these two problems.
We shall prove that the optimal solution (a, b) in the dual problem is a pair of convex functions ( thus continuous functions) in Lemma 3.4. First, in the next theorem we first verify that (a, b) exists in the space of measurable functions. Let Ω X , Ω Y , ν X , ν Y , and K be defined as in the primal problem. There exist a ν X -measurable function a on Ω X , and a ν Y -measurable function b on Ω Y , such that (a, b) minimizes the problem Proof. Clearly this cost in (31) is bounded below by 0, then we take a minimizing sequence of Radon measures {(ν an , ν bn ) : n}, given by dν an = a n dν X , Then a n is a positive ν X -measurable function, and b n is a positive ν Y -measurable function. By the weak compactness for measures [15], there exists a subsequence still denoted by {(ν an , ν bn ) : n ∈ N} weakly converging to Radon measures (ν a , ν b ). For any Borel sets Thus, by the Radon-Nikodym theorem, there exist some ν X -measurable function a and some ν Y -measurable function b, such that dν a = adν X , dν b = bdν Y . Next, we show that the functions a, b satisfy the constraints. Recall that a n , b n are the derivatives of ν an , ν bn with respect to ν X , ν Y respectively. Since the function log(·) is locally summable, by Lebesgue-Besicovitch Differentiation theorem ( [15], pp. 43), (32) log a n = lim is an open ball with center x and radius r. Since log K ≤ α log a n + (1 − α) log b n for each n, then for each (x, y) we have

Theorem 3.2 (Weak duality).
Let Ω X , Ω Y , ν X , ν Y , and K be defined as in the primal problem. The cost of the dual problem (23) gives an upper bound for the primal problem (21). When the following two conditions hold, the cost function values of two problems are equal.
The weighted measure preserving condition: Here, the last inequality comes from Young's inequality 6 . Note that the inequalities become the equalities only when (37) If ν X (A) = ν Y (A) = 0, then (i) holds, which completes the proof.
This theorem provides an approach to show the existence in the primal problem: If we can find a set (ν X , ν Y , a, b) satisfying both two conditions, then (ν X , ν Y ) is an optimal solution to the primal problem and (a, b) is an optimal solution to the dual problem. The first condition says that the support of the optimal measures ν X , ν Y is concentrated in the set {(x, y) : K(x, y) = a(x) α b(y) 1−α }. In the next section, we would examine the relation between (a, b) and (ν X , ν Y ) further, and construct a set of (ν X , ν Y , a, b) to meet these two conditions, which implies no duality gap between these two problems.

Uniqueness.
3.2.1. Uniqueness in the Dual Problem. Introduce notations (38) H := log K, φ := α log a, ψ : then the dual problem becomes Next, we shall examine the uniqueness of (a, b).
Since the exponential function is strictly convex, then ((φ+φ)/2, (ψ+ψ)/2) also satisfies the constraint in Eq. (39), and has a lower cost. Here "two different solutions" means that their difference is not zero a.e. with respect to ν X , ν Y .

Remark 2.
In general, the uniqueness with respect to the measure ν X or ν Y does not imply the uniqueness with respect to Lebesgue measure. For instance, consider is not unique with respect to Lebesgue measure.

3.2.2.
Convexity of φ, ψ, a, b. In the next two subsections, we would show that under the following conditions, an optimal matching scheme (ν X , ν Y ) exists uniquely: Let Ω X , Ω Y , X, Y, ν X , ν Y , µ X , µ Y and K be defined as in the primal problem. Define H as in Eq. (38).
• (Convexity condition) H(x, y) is convex in x, and in y.
The first two conditions make the function H(x, y) act like x · y in the constraint Eq. (17) in the MK problem. The third condition is the absolute continuity condition used in the MK problem. Before we show the construction of (ν X , ν Y ), we need to establish the convexity of (φ, ψ) first. Let us derive a necessary optimal condition for (φ, ψ) in the problem (39): (φ, ψ) must be a pair of convex conjugate functions with respect to ν X , ν Y , i.e., More precisely, let Note that the exponential functions are increasing. If (φ, ψ) does not satisfy Eq. (41), then either (φ * , φ * * ) or (ψ * * , ψ * ) will improve the cost M . The mathematical property of (φ, ψ) depends on the choice of H(x, y). Assume that H(x, y) is both convex in x and y at this moment. We could make a weaker assumption ( see Remark 5), but adopting the convexity assumption would simplify the discussion. Lemma 3.4. Assume the convexity condition on H(x, y). Then in the problem (39), the optimal functions φ(x) and ψ(y) are convex with respect to ν X , ν Y . Then we may find a pair of convex functionsφ,ψ both defined on R d and φ =φ ν X -a.e. and ψ =ψ ν Y -a.e.. Therefore, the minimizer (a, b) of the dual problem, a = exp(φ/α), b = exp(ψ/(1 − α)), is a pair of convex functions with respect to ν X and ν Y .
Hence, ψ is convex with respect to ν Y . Similarly, φ is convex with respect to ν X . Since the exponential function is increasing and convex, and φ is convex, then a is convex. Similar arguments apply to the convexity of b.
The first optimal condition in Theorem 3.2 says that the masses of (ν According to Theorem 3.3, this set is uniquely determined with respect to ν X × ν Y . This set would be called the set of matching pairs as follows. Definition 3.5. Let X, Y, and K be defined as in the primal problem and (a, b) be an optimal solution of the dual problem. Let (φ, ψ, H) be computed from Eq. (38). We call any pair (x, y) ∈ X × Y a matching pair if (46) H The matching relation s : X → Y is defined as the set of all the matching pairs (x, y), i.e., s(x) = {y : (x, y) is a matching pair } for each x ∈ X.
Next, we would investigate the structure of the set of matching pairs and its associated matching relation. In convex analysis(e.g., [40]), given a convex function f : R d → R, a necessary and sufficient condition for unconstrained optimality of exists whenever the convex function f is finite in a neighborhood of x. Differentiability of f at x is equivalent to the existence of a unique sub-gradient y, in which case y = ∇f (x). It is known that differentiability of a convex function f fails only on a Borel set of dimension of d − 1, thus the gradient of f exists a.e. ( [1] or see chapter 25 in [40]).
According to Lemma 3.4, an optimal solution of the problem (39) is given by a pair of convex functions φ, ψ defined on X and Y . These two functions are continuous, and their sub-gradients both exist and form closed and convex sets. From Eq. (41), each matching pair (x, y) satisfies (47) D x H(x, y) ∈ ∂φ(x), and D y H(x, y) ∈ ∂ψ(y).
Since the gradients of the optimal functions φ, ψ exist a.e. with respect to Lebesgue measure, then each matching pair (x, y) satisfies a.e., and D y H(x, y) = ∇ψ(y) a.e. Now, we are ready to examine the bijection of the matching relation between X and Y . Introduce notations X φ := {x ∈ X : ∇φ(x) exists } and Y ψ := {y ∈ Y : Theorem 3.6. Let (φ, ψ) be an optimal solution in the dual problem (39). Let h x ,ĥ y , X φ , and Y ψ defined as above. Let s −1 (Y ψ ) be the inverse image of Y ψ under the relation s in Definition 3.5, i.e., Assume the convexity condition on H. Although s is not necessary a function, s and its inverse relation s −1 have the following properties.
Proof. For the first statement, let Since h x0 is bijective, then the set s(x 0 ) has exactly one element. Thus s restricted to X φ is a function. The second part of the statement is similar. For the second statement, let y 0 be any element in Y . We need to show that there exists some x 0 ∈ X, such that y 0 ∈ s(x 0 ), or equivalently from Eq. (47), h y0 (x 0 ) ∈ ∂ψ(y 0 ). Sinceĥ y0 is surjective, then the existence of x 0 is ensured. For the second part, suppose that for some y 0 ∈ Y ψ , there exist x 0 ,x 0 ∈ X φ with y 0 ∈ s(x 0 ) ∩ s(x 0 ). Then y = s(x 0 ) = s(x 0 ). And from Eq. (48), In some cases, the injectivity in h x andĥ y is not easy to verify, we provide one sufficient condition based on the second derivatives of H, see Lemma D.1 in appendix.

3.3.
No duality gap. We are ready to verify no duality gap between the primal problem (21) and the dual problem (23). In this subsection, we assume the absolute continuity condition: ν X << µ X and ν Y << µ Y . As mentioned in Theorem 3.2, it suffices to justify the two conditions in Eq. (37). Here, we would present the following weighted measure-preserving property, which is related to the second condition.
Theorem 3.7 (Weighted measure-preserving property). Consider the dual problem (23). Let ν X , µ X , ν Y , µ Y , Ω X and Ω Y be defined as in the primal problem. Suppose that both h x ,ĥ y defined in Eq. (49) are bijections. Let (a * , b * ) be a solution of the dual problem. Denote measures dν a = a * dν X , dν b = b * dν Y with ν X << µ X and ν Y << µ Y , then s in Definition 3.5 is a measure preserving map between (Ω X , ν a ) and (Ω Y , ν b ), that is, Proof. This proof is based on the approach used in [13], [17]. Recall the dual problem minimizing the cost (53) On the other hand, if we take y ∈ Ω Y with a * (x) α = K(x, y)/b * (y) 1−α , then from Eq. (54,55), By Theorem 3.6, s restricted to X φ ∩Ω X is a function. Then we have y = s(x). Also as τ → 0, y τ → s(x) and then by l'Hôpital's rule, both sides in Eq. (59) converge to the limit (α −1 − 1)g(s(x))a * (x). Finally, applying the dominated convergence theorem to Eq. (56), we have Replacing g by −g, we deduce that equality holds.
Hence, although generally s is not a measure preserving mapping with respect to the measures ν X , ν Y , s preserves the measure preserving property with respect to the weighted measures ν a , ν b .

Corollary 1.
Here is another formulation of the weighted measure property. Assume that the absolute continuity condition holds. Under the assumption in Theorem 3.6, for each measurable set A ⊂ Y , Proof. Consider a measurable set A ∈ R d and its indicator function g 0 (z) = 1 if z ∈ A, g 0 (z) = 0 otherwise. Let g (y) : and the constant c adjusted so Then as → 0, we have g (y) ∈ C ∞ (Ω Y ) and g → g 0 a.e. with respect to Lebesgue measure. From the above theorem, we have That is, for each measurable set A ⊂ Y , we have Hence, s is a bijection from X to Y a.e. with respect to ν X and ν Y respectively.
Proof. Based on the absolute continuity condition, From the definition of s −1 (Y ψ ) and the second statement in Theorem 3.6, s restricted to Since a > 0 and ν X (X − X φ ) = 0, then Now let us present the construction of an optimal matching scheme (ν X , ν Y ) in the primal problem. We have shown the existence and the uniqueness of (a, b) and (φ, ψ) in the dual problems. Recall that from Theorem 3.2, (ν X , ν Y ) is optimal if and only if (ν X , ν Y , a, b) satisfies those two conditions. According to the first condition in Theorem 3.2, the support of (ν X , ν Y ) must lie in the set of matching pairs {(x, y) : H(x, y) = φ(x)+ψ(y)}. From Theorem 3.6, this condition leads to the injectivity of s restricted to X φ ∩ s −1 (Y ψ ). Since (ν X , ν Y ) must satisfy the marginal constraint Eq. (22), then we define (ν X , ν Y ) as follows: Consider measurable sets By Corollary 2, AX ×AY adν X = AX ×AY bdν Y for each measurable set A X × A Y , which verifies the second condition in Theorem 3.2. Therefore, no duality gap exists between the primal problem and the dual problem. Now let us examine the uniqueness of (ν X , ν Y ). From Eq. (69), it suffices to show that the set (39) is unique with respect to ν X , ν Y respectively. Hence, together with the absolute continuity condition ν X << µ X , ν Y << µ Y , we deduce that both X φ × Y ψ and the set of the matching pairs are uniquely determined with respect to ν X × ν Y .
In summary, we have the uniqueness result: Theorem 3.8. Suppose that Assumption 1 holds. Then the optimal matching scheme described by (ν X , ν Y ) in the primal problem is unique.
In general, the non-uniqueness of the optimal matching scheme (ν X , ν Y ) is caused by two factors: (a) H does not satisfy the bijection condition; (b) ν X , ν Y do not satisfy the absolute continuity condition, i.e., both ν X , µ X are mutually singular on some subset in X and ν Y , µ Y are mutually singular on some subset in Y . For case (a), we provide a simple example: Let K(x, y) be a constant function 1 and then the first equality holds when p = q. Thus every pair of correspondence functions (p, q) satisfying the marginal constraints with p = q has the same cost. For case(b), one example is given as follows. Let d = 2, and let the support Ω X = {(x 1 , x 1 ) : 1 ≤ x 1 ≤ 2} and the support Ω Y = {(y 1 , 3 − y 1 ) : 1 ≤ y 1 ≤ 2}. Then for any fixed x ∈ Ω X , K(x, y) = exp(x · y) is constant for each y ∈ Ω Y , which implies, any matching scheme (ν X , ν Y ) satisfying the marginal constraints is an optimal solution.
Example 1 ( Gaussian kernel function). Assume the absolute continuity condition. Consider the log-similarity function H(x, y) = x · y. Clearly H is convex in both x and y. Also both D x H = y, D y H = x are bijections. Thus, the optimal matching relation s is a bijection almost everywhere. Also, Eq. (41) becomes the Legendre-Fenchel transform, and {φ(x), ψ(y)} is a convex conjugate pair.
Consider the problem with the log-similarity function H = 2x · y/σ 2 , marginal densities (P, Q), and correspondence functions (p, q). Clearly, the matching relation s is bijective a.e.. This problem is equivalent to the following problem with Gaussian kernel functions K(x, y) = exp(− x − y 2 /σ 2 ), marginal densities (P ,Q) and correspondence functions (p,q), provided that (p,q), (P ,Q) are chosen as follows.
Then these two problems have the same cost value, dxdy. Therefore, the optimal matching relation s in the problem with Gaussian kernel functions is also bijective a.e..

Algorithms ( discrete version).
Here, we shall describe the numerical algorithm to compute the optimal matching scheme. When the alpha-D problem is applied to match objects, we take the following discretization. Each object is represented by one particle-set, which consists of a set of points at location {x i ∈ R d : i = 1, . . . , m} with weight (mass) {λ i > 0 : i = 1, . . . , m}. In application, objects represented by weights would depend on user's particular registration problem. For instance, weight could refer to area/length/the cardinality of points in images/shapes/point-sets matching problems respectively.
Here we shall focus on point-sets matching problems. A unit weight is assigned to each point and the following particle-sets matching problem is examined: Given two sets of particles located at {x i ∈ R d : i = 1, . . . , m}, and {y j ∈ R d : j = 1, . . . , n} with weight ( mass) {λ + i : i = 1, . . . , m}, {λ − j : j = 1, . . . , n} respectively, we aim to find a pair of optimal correspondence matrices (Λ + , Λ − ), whose entries are nonnegative, i.e., λ + i,j : As mentioned in section 2, α is a scalar in (0, 1), and in application, α = 1/2 is taken. The matching structure is described by a pair of m × n matrices Λ + and Λ − . Each nonzero pair (λ + i,j , λ − i,j ) indicates one matching link between points x i and y j , and the pair (x i , y j ) is called a matching pair.
The problem in Eq. (77) can be regarded as the discrete counterpart of the primal problem given in Eq. (18): Consider P = m i=1 λ + i δ xi , and Q = n j=1 λ − j δ yj , where δ x is the Dirac point mass. Then a pair of (p, q) having the form p = i,j λ + i,j δ (xi,yj ) and q = i,j λ − i,j δ (xi,yj) is sought to maximize the problem in Eq. (18).
For the sake of simplicity, perform the normalization: Let the set of row/column stochastic matrices be (79) then we have an equivalent problem: Find an optimal scheme characterized by two matrices, Γ + , The normalization decouples {λ + i : i}, {λ − j : j} from the marginal constraints of the original matching problem. In the following, we would design an algorithm to compute the optimal solution (Γ + , Γ − ) to Eq. (80). Once (Γ + , Γ − ) is obtained, the optimal solution (Λ + , Λ − ) to Eq. (77) can be computed from (Γ + , Γ − ) and
Then every limit point (Γ + , Γ − ) is a stationary point and in fact a maximizer of L.
In this algorithm, clearly each column sum of Γ − (l) is one, and each row sum of Γ + (l) is one. To get a maximizer, it is crucial that the initial matrix Γ + (0) is a positive matrix, which yields that Γ + (l), Γ − (l) are positive matrices for all l. 7 Before we show the convergence of the algorithm, we need to study this problem in Eq. (80) further.

the set of all interior points in
, and φ be defined as above. Consider the maximization problem for L given in Eq. (80). We have the following properties.

Fixing
Then the optimal matrix Γ − = {γ − i,j : i, j} is uniquely determined by Then φ(Γ + ) is unique for each Γ + ∈ E + 1 . (1): The cost L can be viewed as the sum, from i = 1 to m, of

Suppose thatΓ
For each term, using Hölder's inequality, we have Due to the constraint n j=1 γ + i,j = 1, the maximal value of n j=1 (γ + i,j ) α (γ − i,j ) 1−α K i,j occurs when the first equality holds, i.e., ). This is the optimal condition for γ + i,j . For (2): The arguments are similar to case(1). For (3): SupposeΓ + / ∈ E + 1 . Without loss of generality, by permuting the index j, we may assume m i=1γ − i,1 = 1, then by permuting the index i, without loss of generality, we may assumeγ − 1,1 > 0. Using Hölder's inequality, we have Thenγ + 1,1 > 0, which contradicts the assumptionγ + 1,1 = 0. From the above result, Alg. 1 can be viewed as one block coordinate ascent method, The standard convergence result ( a global maximizer of a concave function can be attained) of a block coordinate ascent method relies on a crucial assumption: the cost function must be continuously differentiable (See for instance [3], [53]). However, √ xy in Eq. (4.5) is not continuously differentiable at (x, y) = (0, 0). Thus, a little extra effort is required to deduce the desired convergence result.
Then ∇ 1 L(Γ + ) is well-defined and continuous in E + 1 . Proof. Note that for any interior point Γ + ∈ E + 2 , where Eq. (91) is used. Comparing this result with the partial derivatives of f in Eq. (92), we have Now we are ready to prove the convergence property. Proof. The proof consists of two parts, which is modified from Prop.2.7.1 in [3]. The arguments from Eq. (96) to Eq. (103) are provided to establish that every limit is a stationary point. This part is essentially the same proof of Prop. 2.7.1 in [3]. The remaining of this proof is given to verify that every limit point is a maximizer.
Thus, L(Γ + + rs, φ(Γ + )) = L(Γ + , φ(Γ + )). Since rs = 0, this contradicts the fact that L is uniquely maximized when viewed as a function of the first blockcomponent. Hence, we must have that Γ + (k j + 1) − Γ + (k j ) converges to zero. In particular, Γ + (k j ) converges toΓ + . Since Γ + (k j + 1) is the maximizer in max Γ + ∈D + L(Γ + , φ(Γ + (k j ))), then which implies, Taking the limit as j tends to infinity in the above two equations, we obtain Dual Problem I: Given K i,j > 0, Dual problem I is a special case of the dual problem in Eq. (23) with dν X /dµ X , dν Y /dµ Y being Dirac point masses. The uniqueness of {a i , b j } directly comes from Theorem 3.3. The duality relation between this problem and the primal problem in Eq. (80) can be proved by similar arguments in the weak duality theorem: By Young's inequality, Thus, the cost value in the dual problem is always not less than the cost value in the primal problem. Besides, the equality occurs if and only if we can find a set of holds. Thus, when the matrix with entries {K i,j : i, j} is not a matrix product of a column vector with entries {a α i } and a row vector with entries {b 1−α j }, some entries of {γ + i,j : i, j} and {γ − i,j : i, j} must be zero. In particular, when the matrix with entries {K i,j : i, j} has full rank, Γ + and Γ − would be highly sparse.
Dual Problem II: Thanks to Eq. (108), denote (110)γ i,j := a i γ + i,j = b j γ − i,j . Then the Dual problem I could be converted into the problem of finding the optimal matrixΓ with entries {γ i,j :γ i,j ≥ 0} Since (− log x) is a convex function, then the feasible domain forΓ is a convex set. Once the optimal matrixΓ is found, by summing over each row and each column, we can determine the original dual variables {a i , b j : i, j} and the primal variables After the introduction ofγ i,j in Eq. (110), the optimal condition in Eq. (108) reduces to the condition thatγ i,j = 0 for every (i, j) with K i,j < a α i b 1−α j . Hence the restγ i,j can take any value, provided that they satisfy the marginal constraints: Therefore, in generalΓ is not unique, which implies that the original primal variables Γ + , Γ − are not uniquely determined.
The next example illustrates a common phenomena originated from the nonuniqueness: the loop structure -matching links indeed form a loop.

Example 2. Let
The matching links are shown in Fig. 1. Note thatΓ δ is a solution if and only if we have that K i,j ≤ a α i b 1−α j for all i, j, and for all i, j withγ i,j > 0. Suppose thatΓ δ with δ = 0 is one solution of Dual problem II. Clearly, since the cost function is the same for each matrixΓ δ with −1 < δ < 1, thenΓ δ is also a feasible solution for each −1 < δ < 1.
Last, in the coming remark, we point out a connection between our algorithm with the well-known matrix scaling problem proposed by Sinkhorn.
Remark 3 (Matrix scaling). In the special case that m = n, K i,j = 1 for all i, j(then the dual variables a i = 1 = b j for all i, j), Alg. 1 is related to the wellknown matrix scaling problem proposed by Sinkhorn: Any positive square matrix is diagonally equivalent to a unique doubly stochastic matrix, and the diagonal matrices which take part in the equivalence are unique to scalar factors [45], [6]. From Eq. (108), we have γ + i,j = γ − i,j for all i, j. Then Theorem 4.5 says that by repeating row and column normalization we can get two identical doubly stochastic matrices, which provides another proof of the Sinkhorn matrix scaling problem. 4.3. Image morphing. Various distances are proposed to measure the dissimilarity between two images. Sometimes we are interested in a distance function, which is generated by a Riemannian metric on an image space, namely S. A Riemannian metric is a symmetric, positive definite, bilinear form defined on a manifold, called a Riemannian manifold. Given two images described by densities P, Q in S, this Riemannian metric between P, Q is the distance along a path {R t ∈ S : 0 ≤ t ≤ 1} connecting P, Q, and parameterized by t ∈ [0, 1], i.e., R 0 = P, R 1 = Q. This distance T (P, Q) is given by the path integral (113) Here {r i (t) : i = 1, . . . , l} is some particle-set representation of the image R t , and {g i,j : i = 1, . . . , l, and j = 1, . . . , l} is some metric tensor 8 . The path {R t : 0 ≤ t ≤ 1} is called the geodesic connecting P, Q with respect to the metric tensor {g i,j }. In this section, we would study a special metric tensor related to the following weighted Hellinger distance(the alpha divergence with α = 1/2 ): This desired metric tensor has the property that the corresponding geodesic {R t : 0 ≤ t ≤ 1} in S has an explicit expression as the case of MK problem. Adopting this metric tensor, an imaging morphing could be easily constructed by displaying the interpolated images along the geodesic. This idea is analogous to the L 2 MK problem. The cost of the L 2 MK problem could be formulated as a distance for the image space S. More precisely, the optimal cost in the L 2 MK problem is equal to the infimum of (116) over all time varying densities R and velocity fields v satisfying the continuity equation ∂R/∂t + ∇ · (Rv) = 0 for 0 < t < 1, and the initial condition R(0, ·) = dν X /dµ X and the final condition R(1, ·) = dν Y /dµ Y ( See [2], [21]). Let (R min , v min ) be a minimizer, and s min be the optimal mapping of the L 2 MK problem. Then the flow f (x, t) corresponding to the minimizing velocity field v min via f (x, 0) = x, ∂f /∂t = v min •f is given by the interpolation f (x, t) = x+t(s min (x)−x). Then the flow f provides a continuous warping map between ν X and ν Y . Based on this result, the task of computing the warping map boils down to solving the MK problem and applying the interpolation.
In this section, we would derive the corresponding result for the alpha-D problem with α = 1/2. Let the support of the density R t be Ω t for t ∈ [0, 1]. First, we need to derive the corresponding Riemannian metric induced by this distance function T such that T (P, Q) 2 (117) which is the Riemann sum expression of Eq. (113). As k → ∞, R (i−1)/k and R i/k are close to each other, i.e., the distance x−y of each matching pair (x, y) between R (i−1)/k and R i/k is close to zero, using subject to constraints (120) To obtain the metric tensor in Eq. (113), we proceed with the discrete form of Eq. (119). Let the optimal matching scheme between R (i−1)/k , R i/k be a set of l matching pairs {(x i , y i ) : i = 1, . . . , l} with mass {(P i , Q i ) : i = 1, . . . , l} respectively. Then where the infimum is taken over all the possible matching scheme. Substituting Eq. (121) into Eq. (117), as k → ∞, letting r i (t) = (r i,1 (t), r i,2 (t)), we obtain the form of T (P, Q): where the first infimum is taken over all the possible matching schemes ( {(P i , Q i ) : i = 1, . . . , l}) subject to the according marginal constraints, and the second infimum is taken subject to the boundary constraints To derive an efficient algorithm for the minimizers in Eq. (122), we would simplify the integral by fixing one matching scheme and examining the optimality condition for the second infimum in Eq. (122). Since each r i (t) in this cost function in Eq. (122) is decoupled, then we could optimize each r i (t) individually. The result is listed in the following lemma. (Due to the same structure in each optimization problem, the subscript i is dropped.) Lemma 4.6. Given that one particle P is located at x ∈ R d with mass P, and the other particle Q is located at y ∈ R d with mass Q, consider a particle path described by to minimize the distance: with boundary conditions (126) r(0) = ( √ P, x/σ), r(1) = ( √ Q, y/σ), σ > 0.
Then r(t) can be computed explicitly through some coordinate transformation and the minimal distance in Eq. (125) can be simplified as Proof. The minimal path r(t) can be derived through calculus variations. Here, we present another approach based on the following observation: The above Riemannian metric in Eq. (125) has the same metric tensor components as the Euclidean distance in R 2 in terms of polar coordinates (for instance, page 49 in [25]). Hence, the space for each particle path under a coordinate transformation is in fact locally isometric to the plane R 2 . That is, each path under a coordinate transformation becomes a line segment. We outline three coordinate systems for the particle path used in the following discussion: (r 1 , r 2 ) ↔ polar coordinates(r 1 , θ) ↔ Cartesian coodinates(r 1 ,r 2 ).
and the equality holds only when vectors r 2 (t) − r 2 (0) parallels the vector e θ for all t, which implies that the geodesic r(t) must lie in the r 1 θ plane for t ∈ [0, 1].
Besides, the geodesic distance between particles P and Q is the Euclidean distance r(0) −r(1) . Another manner to explain Eq. (132) is through rewriting Eq. (121) as follows: Regard x − y /σ as the angle between two vectorsr(0),r(1). By the Law of Cosines, the distance in Eq. (133) is exactly the squared norm r(0) −r(1) 2 ( See the right in Fig. 2). Note that this distance is only valid for x − y /σ ≤ π/2 according to the following reasons. First, the domain of the angular coordinate for a polar coordinate system is (−π, π). Secondly, in case of π/2 < x − y /σ < π, the distance with the link connected is greater than the distance without the link, ( √ P) 2 + ( √ Q) 2 , then the matching link between these two particles does not exist. Therefore, based on Eq. (132), the squared geodesic distance between P, Q is given by Eq. (128).
The result in Eq. (128) provides an explicit expression for the integral in Eq. (122). Then by substituting it into Eq. (122), we have where both inf, sup are taken over all possible matching schemes. Hence, the optimal matching scheme is a maximizer of the discrete form of the cost function Once the optimal matching scheme is found, then we construct a particle path r(t) = (r 1 (t), r 2 (t)) for each matching pair. Then the representative point-set for each R t is constructed by placing a particle with mass r 1 (t) at the position r 2 (t) for all the matching pairs. We summarize the algorithm as follows. A numerical simulation is provided in the next section.
Algorithm 2. Choose a kernel size σ > 0. Given two images represented by two point-sets R 0 , R 1 on Ω 0 , Ω 1 respectively, our goal is to produce a sequence of images represented by point-sets R t on Ω t for 0 < t < 1 along the geodesic path ( the geodesic distance is given in Eq. (117)).
The mass/position information of the path between x i , y i is given by where θ(t) := arctan(r 2 (t)/r 1 (t)). Then each particle-set R t is described by . . , l}. • To display images, we need to have the intensity values at specific grid points (pixels). Since r 2 (t, x i , y i ) is not necessarily at those grid points, to display an image corresponding to each point-set R t , we convolute the mass function of the particle-set with a Gaussian kernel function : for each grid point u ∈ Ω t . Here σ is another parameter, which is empirically chosen as a number larger than the grid size of the displayed image.
Remark 4. The speciality of adopting square root functions in this Riemannian metric was studied in [56]. Younes gave several distances (e.g., Eq. (21), (22)) between two shapes ( curves) based on the form of a 1/2-divergence weighted by cosines of the angle differences. For instance, one geodesic distance is given by Eq. (22) in [56]: The notations are briefly explained here. Consider two curves C 1 , C 2 in the xy plane with arc-lengths l 1 , l 2 , and angle functions θ 1 , θ 2 (between the tangent and the axis y = 0). Parameterize two curves by t and g(t) respectively, where g is some increasing diffeomorphism of [0, 1] such that the distance is minimized. The correspondence between two curves is given by t ↔ g(t), t ∈ [0, 1], or equivalently, could be constructed through the parameterizations h 1 (t) and h 2 (t) on curves C 1 , C 2 .
In case of l 1 = 1 = l 2 , the derivatives h 1 , h 2 could be regarded as the densities of the distribution functions h 1 , h 2 , then the elastic distance could be regarded as a 1/2-divergence weighted by cosine functions. Since this distance is not convex in g, the optimal correspondence is usually computed in the product space [0, l 1 ] × [0, l 2 ] by dynamical programming. Although this distance can be used to measure the distance between two closed curves, as mentioned in page 583 [56] the curves along the geodesic are not closed in general. Later, Younes et. al. proposed another Riemannian metric based on Jordan angles in Grassmann manifolds [36] in which the curves constructed along the geodesic path are closed [57]. Surely the cosine-weighted alpha-D problem in Eq. (135) can be applied to the curve matching problem. In some sense, Eq. (135) could be viewed as a generalization of Younes' elastic distance in Eq. (141) from R 1 to R d .

5.
Numerical results. The algorithms in this paper have order O(N 2 ) where N is the total number of points. Thus we do not suggest to use the alpha-D model as an area-based method, in which N becomes the number of image pixels. In this section, we provide some experiment results in which the alpha-D model (α = 1/2)is employed as a feature-based method and the size of N is between 100 to 1000. First, we provide a few numerical simulation results to reveal several properties of the alpha-D model. Then we show two matching results of real photos based on Eq. (143). Empirically, matching results vary depending on the kernel scale σ in K, especially when extreme values are taken. Empirically, when σ 2 is too small, the cost in Eq. (143) has many local optimal solutions and the algorithm tends to get stuck at some local optimal solution; when σ 2 is too large, the estimated transformation T is not robust, i.e., very sensitive to noise. For simplicity, in our experiments a kernel scale σ is chosen between 0.1 and 0.4, if images are scaled to fit in a unit square. 10 [Experiment I:] In Fig. 3, the first sub-figure (from the left to the right) shows image 1 consisting of 20 × 20 pixels. The second sub-figure shows image 2 consisting of 20 × 20 pixels. Each pixel with lighter marker · has mass 0.5 and the pixels with darker makers • has mass 1. The third sub-figure shows the matching result based on the alpha-D model. Clearly we can see that these two blocks are correctly matched. Here the similarity function K(x, y) is given by exp(− x − y 2 /0.3 2 ).
The fourth sub-figure shows the result of the MK problem. A normalization on the total mass of each image is employed to meet the mass balance condition. Since the upper block in image 1 has more mass than the upper block in image 2, a few pixels (about 4 pixels with marker • ) merge into the lower block shown in .
[Experiment II:] In Fig. 4, given two particle sets -two letters "i" and "j", we use Alg. 2. to construct five interpolated particle-sets at t = 1/6, 2/6, 3/6, 4/6, and 5/6 respectively. The kernel function is exp(− x − y 2 /0.3 2 ). Both two particle-sets consist of 60 particles (shown in ·), and each particle has mass 1. The first row shows the interpolated particle-sets, in which each dot (·) refers to the trace of the geodesic path between each matching pair in Ω t , and each dot does not necessarily have equal mass. To visualize both the mass and spatial position information, we construct interpolated images by convolving each interpolated particle-set (shown in the first row) with exp(− x − y 2 /0.02 2 )( Here the value 0.02 is larger then the grid size of the image 5 × 10 −3 ). Notice that there is no "fade-in" and "fade-out" effect in this result under this kernel scale.
[Experiment III:] In applications, sometimes the estimated transformation is expected to own some smoothness. However, the cost functions of the MK problem or the alpha-D problem have no control over the smoothness of the matching relation. Introducing splines is a common approach in describing non-rigid deformations. Hence, we incorporate the alpha-D problem with splines to estimate both the correspondence and the nonrigid deformation function. Consider two point-sets {x i : i = 1, . . . , m}, {y j : j = 1, . . . , n} in R d . Given a linear bounded operator L, we estimate corresponding matrices Γ + , Γ − with entries {γ + i,j , γ − i,j : i, j} and a smooth function f through maximizing the energy function, subject to the marginal constraints m i=1 γ − i,j = 1, n j=1 γ + i,j = 1 where λ > 0 is a parameter. A reproducing kernel Hilbert space with its associated reproducing kernel U is introduced by a bounded linear operator L. The smooth function f in this Hilbert space could be represented by linear combination of eigenfunctions of the reproducing kernels U . The details about the relation between U and L can be found in the reference [51]. Two popular splines are radical-basis-function (RBF) splines or thin-plate-splines.
In this experiment, the alpha-D problem incorporated with splines is employed to register two artificial shapes, which are represented by two particle-sets in Fig. 5.
with vectors α i ∈ R 2 to be determined. Then the regularity term of RBF splines Lf 2 in Eq. (143) becomes By formulating the exponential function as shown in Eq. (30), {α i : i} is computed by solving a weighted least square problem. Numerically, the problem given in Eq. (143) could be solved by a block coordinate ascent method. The whole algorithm boils down to updating Γ + , Γ − , and {α i : i} cyclically. The forward/backward matching results are shown in the third and fourth sub-figures.
Here we provide two image matching experiments on photos. Fig. 6 shows the matching result between two point-sets extracted from two photos of one building. Here, each set of feature points is extracted by applying the edge detection on images I, {x : |∇I(x)| > , for some positive }. The affine transformation and the correspondence between point-sets are estimated through maximizing where A(x) is an affine transformation of x. Similarly, Fig. 7 shows the matching result between two photos of one baby. In this experiment, the RBF spline and the correspondence between point-sets are estimated through minimizing Eq. (143). Empirically, the alpha-D model has a similar matching performance as the RPM method [10] or the KC method [48]. The main difference is the computational cost. Compare with the RPM method. Because the correspondence matrix in the RPM method is subject to some row-sum and column-sum constraints, similar to doubly stochastic constraints, they proposed to use the soft-assignment to obtain a doublystochastic-like matrix after each iteration of the transformation. When the size of correspondence matrices is large, the soft-assignment might take a large number of iterations to get an acceptable doubly-stochastic-like matrix. Hence the RPM method generally converges slower than the block coordinate ascent method in the αD model. Now compare the αD model with the KC method. Correspondence is not estimated in the KC model, thus the computational cost in the KC method is more economical, especially in the case of rigid motions. However, in the case of nonrigid deformations, the cost in Eq. (3) cannot be handled by a block coordinate descent method. Instead, a gradient descent method with backtracking is required. Under this circumstance, the computational cost in the KC method could become more expensive. Several comparison experiments can be found in [8].
Appendix A. Proof of closest point property.
Lemma A.1. Consider two point-sets X, Y with the same cardinality n. Suppose for each i, x i is the closest point of y i among the point set X, and y i is the closest point of x i among the point set Y. Then (Γ + , Γ − ) being a pair of identity matrices is a maximizer of E α (Γ + , Γ − ) in Eq. (7).
The second inequality comes from Hölder's inequality. Thus, (Γ + , Γ − ) with γ + i,j = δ i,j = γ − i,j for all i, j is a maximizer. Done. Proof. Since K(x, y) ≤ 1, then M α (X, Y) := αE α (X, X) where the first inequality comes from the assumption 0 < K ≤ 1 and the second inequality comes from Young's inequality. Hence M α (X, Y) = 0 implies that for all i, j, then γ + i,j (K(x i , y j ) − 1) = 0 for all i, j, which implies K(x i , y j ) = 1 (i.e., x i = y j ) for each pair (i, j) with γ + i,j = γ − i,j > 0. This statement in fact is equivalent to X = Y. In summary, M α (X, Y) = 0 implies X = Y.
Hence, when u =x − x, then Eq. (153) indicates that J is not positive definite at x + t 0 (x − x). Contradiction. Hence, we have the desired injection. Part 2: This proof is suggested by one reviewer. Since the gradient of each component F j is constant, then F in this case is affine, i.e., F (x) = J x + b where b is some constant vector. Hence, F is a bijection.
According to this lemma, we can get the desired injectivity for the gradients D x H, D y H in Theorem 3.6. Fix some x 0 , y 0 ∈ R d . Consider a matrix function H : R d × R d → R d×d given by H = {∂ 2 H/∂x i ∂y j : 1 ≤ i ≤ d, 1 ≤ j ≤ d}. If H(x 0 , y) is positive definite for all y ∈ R d , then D x H(x 0 , y) is an injection of y. Likewise, if H(x, y 0 ) is positive definite for all x ∈ R d , then D y H(x, y 0 ) is an injection of x.
Appendix E. Proof of the discrete counterpart of the duality. In the proof of no duality gap for the continuous version of the αD model, we assume the absolute continuity condition. Hence, it is necessary to provide another proof to establish the duality for the discrete version of the αD model.   Figure 4. Illustration of image morphing. The third row represents two given particle-sets P (left) and Q (right). The first row is the intermediate images at t = 1/6, 2/6, 3/6, 4/6, 5/6 (from left to right). The second row is the intermediate images convolved with a Gaussian kernel function at t = 1/6, 2/6, 3/6, 4/6, 5/6 (from left to right).