D-MSCD: Mean-Standard Deviation Curve Descriptor Based on Deep Learning

Curve feature description is an important issue in the field of image matching. In the past years, this problem has been studied mainly based on handcrafted methods. To conquer the disadvantages of low discrimination and weak robustness of curve feature description under complex conditions, a Mean-Standard Deviation Curve Descriptor based on Deep learning (D-MSCD) is proposed in this paper. Firstly, a large-scale curve feature dataset with 210,000 labeled curve image patches is constructed for training and testing. After longitudinally compressing the support areas of the curve in each image into the support areas of points, the mean and standard deviation image patches of each curve are obtained, then the curve image patch is uniquely represented. Secondly, a modified L2-Net(DSM) which is a network architecture with dilated convolution is constructed to improve the performance of curve descriptors, and the experimental results on the Brown dataset show the mean FPR95 value is reduced by 17.48%. Finally, the modified L2-Net(DSM) is trained on the large-scale curve feature dataset and the model of D-MSCD is obtained, it achieves the best matching performance in every image change, and the average matching performance on the Oxford dataset is improved by 13.09%. Experimental results demonstrate the proposed D-MSCD has better effectiveness than the traditional handcrafted curve descriptors.


I. INTRODUCTION
Feature description is a key technology in the fields of computer vision, which has considerable applications in image retrieval [1], scene recognition [2], [3], and 3D reconstruction [4], [5], etc. Curve feature description is an important process of image feature matching, and the performance of descriptors has a direct impact on feature matching. Therefore, the study of robust feature description methods has received lots of researchers' attention.
The traditional curve feature description methods are handcrafted, which are based on the experience accumulation and design inspiration of the researchers, and the desired structural features of the image to be detected are formalized through appropriate mathematical tools to obtain corresponding feature description. Various handcrafted methods have been proposed for curve matching in recent years [6]- [10], but these methods have the disadvantages of low distinguisha-The associate editor coordinating the review of this manuscript and approving it for publication was Emre Koyuncu . bility and weak robustness under complex conditions. Therefore, it's necessary to use a new method for curve feature description.
With the continuous success of deep learning in image recognition tasks, the current research on image feature matching has entered a new data-driven era. In recent years, attempts on using deep learning for image feature description and matching have also shown great opportunities [11]- [18]. However, there are no reports on the method using deep learning to describe curve features at present. One reason is that deep learning requires a large amount of annotated data, another reason is that there is no neural network architecture suitable for curve feature training. Therefore, the key to using deep learning for curve feature description is to transform the problem of curve feature description into a deep learning problem.
This paper attempts to use the convolutional neural network to learn the curve characteristics. Specifically, we convert the curve feature description problem into the mean and standard deviation of point feature description and propose VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ a method for describing the curve feature based on deep learning to achieve reliable curve feature matching. Compared with traditional methods, the contributions of this paper are as follows: 1) A large-scale image curve dataset labeled with matching information is constructed for network training and testing. 2) Improved the latest L2-Net (DSM) [17] and used it to train the dataset. 3) Proposed a feasible curve feature descriptor based on deep learning, and the experimental results demonstrate that the proposed D-MSCD has better matching performance compared with the traditional handcrafted descriptors. The remainder of this article is structured as follows: Section II describes the related work. Section III elaborates on the proposed method in detail. Experimental results are demonstrated in section IV, while the conclusion is in Section V.

II. RELATED WORK A. TRADITIONAL CURVE FEATURE DESCRIPTOR
Curve feature description plays an important role in image feature description and it has attracted considerable attention from scholars. The most classical feature descriptor is the SIFT [6], which is proposed by Lowe. It is based on the gradient distribution in the detected regions and is invariant to scale rotation and viewpoint change. Wang et al. [7] proposed the Mean-Standard Deviation Line Descriptor (MSLD) based on the idea of neighborhood location division of SIFT and extended it to the curve description to obtain the Mean-Standard Deviation Curve Descriptor (MSCD). The MSCD successfully solved the problem of a unified description of lines of different lengths. However, when the viewing angle changes, the image deformation will distort the shape of the region, which can lead to the decline of the matching ability of the descriptor. Liu et al. [8] divided sub-region according to the overall intensity order and the local intensity order mapping, and proposed Intensity Order Curve Descriptor (IOCD), which performs robustly on image rotation, viewpoint change, illumination change, blur change, noise change, and JPEG Compression change. However, when the image has shadows and partial occlusions, there will be wrong subregion divisions in the brightness sequence division, resulting in incorrect matching. Wang et al. [9] combined the idea of intensity order division with MSCD and proposed the Intensity Order Based Mean-Standard Deviation Descriptor (IOMSD), the principle of this algorithm is simple and stable, but the description performance is not high and it is not suitable for weakly textured images. Liu et al. [10] proposed the Gradient Order Curve Descriptor (GOCD), which is constructed based on a global gradient magnitude order for subregion division and local gradient order feature. However, since the radius of the pixel support region and the stride for computing the gradient magnitude around a feature point are fixed, it is not invariant to large scale change. These traditional methods can be widely used for image registration, optical image matching with different geometric and photometric transformations such as scale, rotation, blur, illumination, and JPEG compression, and textured scenes images. However, these methods have some limitations due to the intensity distribution and varying illumination which are caused by noise.

B. DEEP LEARNING-BASED LOCAL FEATURE DESCRIPTOR
With the booming of handcrafted descriptors in the past decades, more and more deep learning-based descriptors have appeared. Han et al. [12] proposed MatchNet, which consists of a featured network for extracting feature representation, a bottleneck layer for reducing feature dimension, and a metric network for measuring the similarity of features pairs. It shows great potential for deep learning in local feature descriptions. Balntas et al. [13] proposed to use the distance relationship between a pair of negative samples and positive samples that are more difficult to distinguish in the triplet for CNN network training. And a good feature matching result is achieved with only 2 convolutional layers. Tian et al. [14] proposed L2-Net, which used a fully convolutional network structure for feature descriptor learning, and compared the distance of positive samples with the distance of all negative samples during training. This greatly outperformed the performance of previous methods. Mishchuk et al. [15] proposed HardNet, which only considered the relative distance between positive samples and the most difficult negative samples in a batch of training data when training. This further improved the matching performance of L2-Net. Zhang and Rusinkiewicz [17] proposed a new triplet loss based on HardNet, which replaced the hard margin with dynamic soft margin, and got a better matching performance. Tian et al. [18] proposed the Second Order Similarity Regularization (SOSR) and incorporated second order similarities into the learning of local descriptors. The matching performance of learning descriptors is significantly improved. These methods of using deep learning for image local feature description show us the possibility of using deep learning for curve feature description.

III. PROPOSED METHODOLOGY
Different from traditional handcrafted descriptors which are mostly driven by intuition or researcher's expertise, deep learning-based methods are driven by data. So we constructed a large-scale image curve dataset labeled with matching information for network training and testing.

A. THE CURVE FEATURE DATASET 1) IMAGE PAIRS
We created about 1,700 sets of image pairs with a size of 640 × 480 pixels through Internet downloading and mobile phone shooting, most of them are buildings and cultural relics that contain more curves. These images include seven types of changes, namely, Scale, Illumination, Blur, Viewpoint, Rotation, Compression, and Noise. As can be seen in Fig. 1, the first image in each row is the reference image, and the rest are the target images of different transformation degrees, each target image and reference image form an image pair.
To produce the image pairs under different transformations, the MatlabR2015b and Photoshop CC2018 are used, and the following operations are performed: The reference image of each group is downloaded from the Internet, the target images are obtained by using Photoshop to crop the reference image in different degrees and then enlarge to the size of the reference image.

b: ILLUMINATION
The reference image of each group is downloaded from the Internet, the target images are obtained by adjusting the illumination of the reference image with the curve tool in Photoshop.

c: BLUR
The reference image of each group is downloaded from the Internet, the Matlab function 'fspecial' is used to generate the blurred target images on the reference image, the types of the function are set to 'average', 'gaussian' and 'disk'.

d: VIEWPOINT
The reference image and the target images in each group are all taken with a mobile phone from different perspectives.

e: ROTATION
The reference image and the target images in each group are all taken with a mobile phone from different perspectives.

f: COMPRESSION
The reference image of each group is downloaded from the Internet, the target images are obtained by compressing the reference image through a program, in which the compression ratios are set to 75%, 85%, 90%, and 95% respectively.

g: NOISE
The reference image of each group is downloaded from the Internet, the Matlab function 'imnoise' is used to generate the target images on the reference image, the types of the function are set to 'salt & pepper' and 'gaussian'.
We divided the image pairs into training set and testing set, the training set contains 1395 pairs of images and the testing set contains 313 pairs of images. Fig. 2 shows the number of image pairs in the training set and testing set under different transformations.

2) CURVE FEATURE DATASET
Canny edge detection operator [19] is used to extract the curve of the image pair, as well as filter the points with curvature greater than 0.8 and eliminate the curves with length less than 20 pixels. For each image pair, the IOCD [8] is used to obtain the curve matching result. To improve the accuracy of the matching result, artificial culling is used to delete the wrong matching to obtain the correct matching curve pair in the image pair.
We use the local image patches around the curve to characterize the curve, the neighborhood of the curve is transformed into a square image patch independent of the curve length. For any curve C composed of Num (C) points, the pixel on C is denoted as P k , k = 1, 2, . . . , Num (C). The image patch I (P k ) along the gradient direction with the length and width of 64 pixels centered at P k is extracted as its local neighborhood. Then, the mean matrix M (C) and standard deviation matrix S (C) of the local neighborhoods of all the pixels are calculated to obtain two patches of the same size with the neighborhood of the pixel, the M (C) and S (C) can be calculated as: where Mean means the mean value of the matrices and Std means the standard deviation of the matrices. Finally, the mean matrix and standard deviation matrix are concentrated to obtain the curve patch A (C) to represent the curve C uniquely: Fig. 3 describes the process of curve patches construction, which takes a matching curve as an example. As can be seen that for two matched curves C and C with the length of 278 and 261 pixels respectively, the matrix A C and A C of a fixed size can represent the curve uniquely.
In this way, a large-scale curve feature dataset with the patch size of 64 × 128 pixels is constructed, it has 214,296 curve patches labeled with matching information. The number of different changes is shown in Fig. 6, each category has over 30,000 curve patches, of which the training set and the testing set are about 25,000 and 5,000 respectively.

B. NETWORK ARCHITECTURE
The basic architecture of our network-a, shown in Fig. 4, is adopted from L2-Net(DSM) [17], which is built by a seven-layer full convolution structure. Compared with L2-Net(DSM), our network has two more convolutional layers, which are the fourth and the seventh convolutional layers. The fourth convolutional layer contains 32 kernels with size 3 × 3 and the seventh convolutional layer contains 64 kernels with size 3 × 3. Dilated convolution can expand the receptive field without pooling the loss of information and make each convolution output contain a larger range of information [20], we use it in the fourth and seventh layers to learn more features. Batch normalization and ReLU are performed after each convolutional layer except the last layer. There are no pooling layers, and dropout regularization is used before the last layer. Padding with zeros is applied to all convolutional layers except the final one, the size of the convolutional kernel is 3 except for the last layer. Each curve patch with the size of 128 × 64 pixels in our curve feature dataset is divided into two curve patches with the size of 64 × 64 pixels, as the input of the network. The output of the network is L2 normalized to produce a 128-D descriptor with unit length.
Besides, we studied the improved L2-Net(DSM) in network-b, shown in Fig. 5, which produces a 256-D descriptor with unit length. Compared with network-a, the number of convolutional kernels in each layer is doubled. In the first to third convolutional layers, the number of convolutional kernels is changed from 32 to 64. In the fourth to sixth convolutional layers, the number of convolutional kernels is changed from 64 to 128. In the seventh to eighth convolutional layers, the number of convolutional kernels is changed from 128 to 256. The other network parameters are the same as the network-a shown in Fig. 4.

C. LOSS FUNCTION
The Dynamic Soft Margin (DSM) loss function [17] is used in this paper to get the real-valued curve feature descriptor. The ''harder'' triplets in a mini-batch are more useful for training, so it's necessary to measure how hard a triplet is compared with other triplets in the same mini-batch, by computing its signed distance to the decision boundary (d posd neg ) and the distribution of these distances. Given a mini-batch of size N, the Probability Distribution Function (PDF) of signed distances is discretized into a histogram, and the d posd neg for each triplet is computed to make the aggregated histogram more accurate, then the d posd neg is linearly allocated into two neighboring bins in the histogram, and the Cumulative Distribution Function (CDF) is obtained by integrating the histogram. The loss is defined as: The w i for each triplet is weighted by the corresponding value from the CDF.

IV. EXPERIMENTS
To evaluate the performance of the proposed D-MSCD, we use the evaluation metrics, FPR95 (false positive rate (FPR) at true positive rate (TPR) equal to 95%) and mAP (mean Average Precision) [25], for reference. Specifically, in the experiments for parameters selection, the FPR95 is computed when TPR = 0.95 according to the following equation:  where TN denotes true negative rate. For image matching, the mAP is adopted as the performance indicator. The Average Precision (AP) score of the positive matching category on an image pair is firstly computed as: where n represents the total number of retrieved positively matched line pairs and Precision denotes the ratio of the number of the retrieved positive matched line pairs to the total number of retrieved line pairs. Then, the mAP can be calculated as: where m is the total number of image pairs in the test set. As the descriptor is learned from the mean and standard deviation curve patches by deep learning, we name the descriptor proposed in this paper as D-MSCD, and name the descriptor learned from Network-a as D-MSCD-a and the descriptor learned from Network-b as D-MSCD-b in the following experiments.
In the following Hyper-Parameters section and Curve Matching section, each curve patch in the training set with the size of 128 × 64 pixels is divided into a mean patch and a standard patch with the size of 64 × 64 pixels, then the mean descriptor and the standard descriptor are obtained by network training, finally, they are combined to generate the D-MSCD. The D-MSCD-a is 256-D and D-MSCD-b is 512-D respectively in the two sections.

A. HYPER-PARAMETERS SELECTION
The basic architecture of our modified network and the loss function are based on L2-Net(DSM), so we choose the same hyper-parameters as L2-Net(DSM). The Stochastic Gradient Descent (SGD) is used with momentum and weight decay equal to 0.9 and 0.0001, respectively, to optimize the network. Weights are initialized to orthogonally with gain equal to 0.6, biases set to 0.01, the learning rate is linearly decayed from 0.1 to 0 and the dilated rate is 2. Training is done with PyTorch library. Two TITAN RTX GPUs are employed to run the experiments.
Besides, we studied the influence of the batch size on network performance. We reported the results for batch sizes 64, 128, 256, 512, 1024. We trained the model on Network-b using the training set of the large-scale curve feature dataset constructed in section III. B2), and tested on the testing set. Fig. 7 shows the average FPR95 value over seven types of changes. The performance improves with increasing the mini-batch size but brings little benefit after 512 batch size, and the performance stabilizes after 14 epochs. To make full use of the GPUs, we set the batch size to 1024 and the training epochs to 20 in the following experiments. Table 1 shows the results of the testing set under different transformations.

B. NETWORK PERFORMANCE
To verify the performance of our modified network, we conducted tests on the Brown dataset [26]. The Brown dataset consists of three subsets: Liberty, Notredame, and Yosemite with about 400k normalized 64 × 64 patches in each, and the dataset assigns each patch with its 3D point ID to identify the matching image patches. Each 3D point ID is associated with a list of patches that are assumed to be matching. Key points were detected by DoG detector and verified by 3D model. Data augmentation is achieved by random flipping and rotating the patch by 90, 180, or 270 degrees. The patch pair classification benchmark measures the ability of a descriptor to discriminate positive patch pairs from negative ones in the Brown dataset. We adopt the commonly used false positive rate at 95% true positive recall (FPR95) to evaluate how well the descriptor classifies the patch pairs. We train one model using each subset and test on the other two subsets, for example, we train one model using Notredame subset and test on Liberty and Yosemite subsets. The results are shown in Table 2. Our descriptors show the best performance compared to other descriptors under the same configuration, the mean FPR95 value is reduced by 17.48% and 27.94% compared to SOSNet and L2-Net(DSM) respectively. This can prove the superiority of our modified network.

C. CURVE MATCHING
To further evaluate the performance of the D-MSCD, we compared the matching performance with the traditional handcrafted descriptors IOCD, IOMSD, and GOCD on the Oxford dataset [25] and the Paper dataset (A dataset of image pairs used in papers including IOCD, IOMSD, and GOCD shown in Fig. 8, we named it the Paper dataset for convenience) by using different descriptors to match the curves in the reference image R and the target image T. First, we obtain the curve patches M R = {M (C i , i = 1, . . . , N 1 )} in reference image R and M T = M C j , j = 1, . . . , N 2 in target image T , where N 1 is the number of the detected curves in R and N 2 is that in T . The curve patches were obtained by using the same method as stated in section III. A2). Next, we train M R and M T in the network respectively and output the corresponding description matrices of N 1 ×256 and N 2 × 256 or N 1 × 512 and N 2 × 512. Then, the nearest neighbor to the next nearest neighbor distance ratio (NNDR) matching criterion is used to obtain the final matching results of the two images, and the threshold is 0.8. To obtain the true matching results, all the correctly matched curves contained in the image pairs in the two datasets are manually labeled. The large-scale curve feature dataset we constructed is used for training in the following experiment. The results reveal that the descriptors achieve state-of-the-art performance.

1) CURVE MATCHING ON OXFORD DATASET
The Oxford dataset is a standard benchmark library used to evaluate the performance of image feature algorithms. We evaluate our descriptors on five image sequences, namely, Boat (Rotation), Leuven (Illumination), Bikes (Blur), Graf (Viewpoint), and UBC (Compression). In each image sequence, there are six images sorted in an order of increasing degree of distortions with respect to the first image, so each image sequence constitutes five pairs of images. The mAP is used to measure the matching performance of the descriptors. Fig. 9 (a) shows the matching performance of the proposed D-MSCD-a and D-MSCD-b with IOCD, IOMSD, and GOCD on the Oxford dataset. It can be easily observed that both the proposed D-MSCD-a and D-MSCD-b achieve the best performance on each image sequence compared with the traditional handcrafted descriptors, D-MSCD-b has a lit-tle advantage over D-MSCD-a, and the average matching performance is improved by 13.09%, 34.48%, and 31.32% compared with the IOCD, IOMSD, and GOCD respectively. The performance of our descriptors is greatly improved especially under the image transformation of blur change and viewpoint change. The performance improvement proves that the proposed D-MSCD is superior to handcrafted descriptors.
In addition to the matching accuracy, the number of correctly matched curves is also an important factor to measure the performance of the descriptor. Table 3 shows the total number of correctly matched curves obtained by different descriptors on the Oxford dataset. It can be seen that with the same number of detected curves, the D-MSCD has an obvious advantage in the total number of correctly matched curves.

2) CURVE MATCHING ON PAPER DATASET
To further verify the matching performance of our descriptor, we tested different descriptors on the Paper dataset (shown in Fig. 8) with the same method above. The dataset includes seven image sequences, namely, Scale change, Illumination change, Blur change, Viewpoint change, Rotation change, JPEG compression change, and Noise change. In each image sequence, there are five pairs of different images. Fig. 9 (b) shows the matching performance of the D-MSCD-a and D-MSCD-b with IOCD, IOMSD, and GOCD on the Paper dataset, as can be seen that the proposed D-MSCD shows the best performance on each image sequence. What's more, MSCD-b has a little advantage over MSCD-a, and the average matching performance is improved by 5.14%, 12.23%, and 16.94% compared with the IOCD, IOMSD, and GOCD respectively. It also has a great improvement under the image VOLUME 8, 2020   transformation of blur change and viewpoint change, which is similar to the performance on the Oxford dataset. Table 4 shows the total number of correctly matched curves for different descriptors on the Paper dataset with the same number of curves detected. It can be seen that the proposed D-MSCD in this paper shows the highest performance on each image sequence compared to the handcrafted descriptor IOCD, IOMSD, and GOCD.

V. CONCLUSION
Inspired by the great progress achieved by the description of feature points in deep learning, we convert the curve feature description problem into the mean value and standard deviation problem of point features, and then propose the curve feature description method based on deep learning. Specifically, we constructed a large-scale curve image dataset labeled with matching information and improved the L2-Net(DSM), the descriptor is obtained by training the network in the self-build dataset. Experimental results show that the obtained descriptor of D-MSCD is superior to the traditional handcrafted descriptors under different image transformations, which demonstrate the great potential of deep learning in curve feature description.
In the future, we will study the effects of different network architectures and loss functions on the learning of curve feature descriptors.  JUNWEI LUO received the Ph.D. degree in computer science from Central South University, Changsha, China. He is currently an Associate Professor with Henan Polytechnic University, Jiaozuo, China. His current research interests include machine learning, bioinformatics, and data mining. VOLUME 8, 2020