Large parallax image matching based on Graph Matching

In the security video surveillance system, image splicing technology faces the difficulty of achieving large parallax and small overlap area. This paper summarizes the traditional stitching technology and analyzes that the image matching algorithm is the key technology to solve the problem of large parallax stitching. Aiming at the problems that traditional matching methods with low matching accuracy cannot adapt to large parallax non-rigid transformatio, a model based on image matching is proposed in this paper for splicing tasks. The model which combines the similarity of the global structure and the local structure can not only divide the feature point on the structural level but also can pair with each other. At the same time, the framework can be trained end-to-end. The training samples composed of simulation data sets are adopted to train the designed model which is tested and verifyed by the test sample sets composed of three kinds of mixed data. Compared with PCA-GM, the model proposed in this paper adapts well to splicing tasks. The matching accuracy can achieve 93.1% and 91.9% on the simulation data set and the small parallax splicing data respectively, which can initially solve the image matching problem in the traditional image splicing scene. However, a further research aiming at the problem of the matching accuracy on the large parallax data set can only achieve 74.5% will be needed.


Introduction
With the demand of security in various fields, the installation of video surveillance systems has become more and more essential. However, due to the limitations of the area monitored by a single camera, it is necessary to monitor image scenes under different viewing angles at the same time to obtain more extensive information. By using video splicing technology, videos collected by multiple cameras placed under different viewing angles can be spliced in multiple channel, which can form a richer video image. However, due to the constraints of cost and layout, the overlapping area between surveillance cameras is small and the same scene is presented with a large parallax.
In 2013, TJ CHIN proposed APAP, which divided the image into grids and aligns them with a homography matrix [1]. It was a milestone for the problem of large parallax stitching. Then, some algorithms such as AANAP [2], GSP [3], SEAGULL [4] had all made improvements. However, the overlapping area basically contains the entire image when the parallax is large in the above method.
In the process of image stitching, the stitching task of large parallax and small overlap area encounters difficulties in the module of image matching. Common algorithms of indirect matching need to combine the setp of feature extraction. However, such methods require operations of post-processing Based on the deep graph matching in deep learning, the splicing problem of large parallax was studied in this paper. Under the assumption that the extraction of feature point is completed, the matching problem can be converted into a problem of corresponding the two point sets. Deep graph matching based on graph neural network(GNN) can use the structure of graph to biuld a model of image feature points, which can effectively solve the problem of non-rigid transformation. In addition, the structure of graph is naturally suitable for the study of the problems of corresponding the point sets.

Data Collection and Analysis
Since there are few public and available data sets for splicing tasks, establishing a simulation data set for training is needed in this paper. The VOC2011 image feature point detection data set was adopted to cut the original image twice in different scales [5]. On the basis of preserving the integrity of the label, one image can generate two images to be stitched. After that, in order to fit the data feature of the actual stitching task, every image was randomly processed with the operation of scale, rotation, lighting, and noise. Similarly, the coordinate values of keypoints were blurred by random affine transformation and another random noise of N( 0, σ ) added. Finally, 7146 pairs of stitched images were obtained and every image carried keypoint coordinate label information.
Two identical surveillance cameras whose resolution is 1080p were adopted to collect spliced test data in large parallax. In addition, it was important to increase the distribution of data in different scenes and lighting. The samples must meet the following conditions: First, there must be a certain overlap area between the samples; Second, the camera must ensure that the image sequence must be acquired in a scene.
According to the different parallax can be divided into two types: horizontal parallax 0-75 degrees, vertical parallax 0-40 degrees. The data contained 24 pairs of samples whose horizontal parallax was lower than 35 degrees and whose vertical parallax was lower than 20 degrees can be regarded as the small parallax data set. The remaining data contained 24 pairs of samples can be regarded as the large parallax data set. Meanwhile, the coordinates of the keypoints were labeled manually.
Finally, the simulation data set was divided into training set and testing set according to a ratio of 3:1. In addition, the collected data sets with large and small parallax are adopted as the testing set. The Training Size and Testing Size are shown in Table 1. The data sets are shown in Figure 2 and Figure 3, which are respectively stand for the simulation data set, the small parallax data set and the large parallax data set.

Propose Approach
Based on the PCA-GM(permutation loss and cross graph affinity) model [6], an suitable deep graph matching model was set up for stitching tasks. The model proposed in this paper consisted of feature extraction module, graph embedding module and relationship extraction module. In the feature extraction module, CNN was adopted to extract the image features of keypoints as the node features which input into the graph network. Then, feature aggregation was performed by the graph embedding module composed of intra-graph matching(Intra-GM) and cross-graph matching(Cross-GM). Meanwhile, the affinity matrix of cross-graph was calculated by softmax and passed back to the Intar- GM. Finally, the matching relationship matrix was predicted by the activation function of sigmoid. The input of the network was the pixel and the keypoint coordinates of the two image. The output was the cross-graph affinity matrix adopted to predict the correspondence relationship of node-to-node. The entire structure is shown in Figure 1. Compared with PCA-GM, the model proposed in this paper simplified the graph embedding module, which made and the combination of modules have been greatly changed. In addition, the cross-graph affinity matrix was added to the graph structure for iteration. For the asymmetric matching problem in splicing, the activation function of softmax was adopted to replace the sinkhorn algorithm. Although CNN can implicitly encode the position information, this information is provided by zeropadding. The representation of the location information was enhanced through a multimodal approach. In the image , the geometric feature and the visual feature ( ) of each keypoint were concatenated to obtain the node feature = ( , ( ) ).After obtaining a feature vector for every keypoint, the graph structure was constructed by Delaunay triangulation.

Intar-graph Matching.
In Intar-GM, graph convolutional network(GCN) was adopted as the method of graph embedding module. The function of GCN was to effectively aggregate the nodes and the adjacent nodes [7]. Then introducing convolution calculations to enhance the learning efficiency of the network. Intar-GM transfered feature information between adjacent nodes through GCN and embedding the information of the graph structure in the feature vector of the node to reflect the feature of the graph structure.
The delivery method of Intar-GM is shown as follows:   In order to avoid the bias due to the different number of neighbors owned by different nodes, Eq.(2) normalized the aggregation features of the adjacent nodes according to the total number of adjacent nodes, which is a common practice of GCN. Eq.(3) adopted accumulated information to update the state of node i.

Cross-graph Affinity Matrix.
For the graphs 1 = ( 1, 1) and 2 = ( 2, 2), the network obtained a feature vector ( ) , ( ) of k-1-th layer after Intar-GM, ∈ , ∈ . An affinity matrix M was constructed by calculating the affinity of any two vectors between the two graphs: , ( ) standed for the similarity between node i in the first graph and node j in the second graph, which meaned the similarity score between the two graphs. The exponential function could ensure the output of a positive number. Considering that the feature vector had m dimensions, which meaned ∀i ∈ , ∈ , , ∈ × . ∈ × contained learnable weights of this affinity function. τ standed for a hyperparameter that adjusted the discriminative ability of Eq.(4). Forτ>0, the Eq.(4) became more discriminative whenτ→0^+. After calculating the affinity matrix M, the function of softmax activation was adopted to solve the relationship matrix in the row dimension of the matrix. When the temperature parameter of softmax tended to 0, softmax would become a one-hot vector: = ( ( ) ) (5) The matching relationship matrix S would participate in the updation of Cross-GM.

Cross-graph Matching.
Cross-GM needed to transfer features between two graph structures which were aggregated from nodes with similar features in other graphs. Cross-GM which was similar to Intar-GM adopted GCN for message transmission.
The transfer method of Cross-GM is shown as follows: ∈ (6) ℎ ( ) = ( ( ) , ( ) ) (7) Eq.(6) aggregated features from other graphs, was adopted as identity mapping, , standed for the matching relationship matrix. Eq.(7) was adopted to accumulate information to update the state of node i in G1, standed for the concatenation of two tensors of input feature, which was followed by a fully-connected layer.
The calculation steps of Cross-GM are shown as follows: First, input the feature vector ( ) , ( ) , ∈ , ∈ after Intar-GM. Subsequently, the similarity of any two vectors between the two graphs could be calculated by Eq.(5). Then an affinity matrix M was constructed. The second step was to use the function of softmax activation on the affinity matrix M to extract a cross-graph matching relationship matrix S. It should be noted that S standed for the predicted correspondence relationship from G2 to G1 while ⊺ standed for the relationship from G1 to G2. Finally, this cross-graph matching relationship matrix S was adopted as the update weight between the two graphs. The more similar node pairs were in the feature ( ) , ( ) , ∈ , ∈ , the higher the weight would be updated when updating across graph.

Relation Extraction.
After the affinity matrix M was calculated in the last layer, the function of sigmoid activation could be adopted to solve every row of M: After seting the threshold t, all the element set of which was greater than t could be extracted. Then, the candidate node relationship set ( , ) could be found through the subscript.
If there was a relationship which was greater than 1 set for the vertex , the vertex corresponding to the element with the largest probability value in the set would be selected, which represented a one-to-one correspondence relationship. Finally, the predicted matching relationship matrix .
2.2.6. Loss Function. The node-to-node correspondence in ground truth was adopted as the supervised information for the end-to-end training. In addition, permutation loss based on cross entropy was adopted as well.

Result
In the case where the coordinates of the feature point were known information, the predicted matching accuracy of the network could be solved by the matching relationship matrix and the ground truth permutation matrix : = ∑ ( , , , )/ (10) Where AND standed for a logical function. The matching accuracy could be calculated by dividing the number of correctly matched keypoint pairs by the total number of keypoint pairs.
Comparative experiments were conducted with PCA-GM based on the same test conditions. A simulation data set was adopted in the excperiment. Among the data set, 5360 pairs of images were adopted as the training set, 10720 images were in total; the remaining 1786 pairs of images were adopted as the verification set. In addition, the small and large parallax data sets were adopted as test sets. The specific test results are shown in Table 2.
The experimental results showed that under the same test set sample conditions, the model proposed in this paper was more suitable for image stitching tasks than PCA-GM. The model proposed in this paper possessed a matching accuracy of 91.9% on the small parallax test set, which indicating that the model could achieve better results in traditional stitching tasks. The testing results are shown in Figure  2.
For the large parallax test set, the matching accuracy could reach 74.5%. For large parallax images whose camera's horizontal deviation exceeds 75 degrees and whose vertical deviation exceeds 20 degrees, there will be misconnection problems. The model also needs to be improved for the problem of large parallax image registration. The testing results are shown in Figure 3.

Conclusion
Aiming at the problem of low matching accuracy of traditional registration methods in large parallax image stitching, a new deep graph matching model based on PCA-GM is proposed in this paper. The network is composed of feature extraction module, graph embedding module and relationship extraction module. The results showed that under the same test conditions, the model proposed in this paper not only possesses a higher matching accuracy than PCA-GM but also is well adapted to the stitching task in the stitching scene. However, the matching accuracy of images with large parallax is still needed to be improved.