Multi-scale information fusion based on convolution kernel pyramid and dilated convolution for Wushu moving object detection

In complex background, the accuracy of moving object detection can be affected by some factors such as illumination change, short occlusion and background movement. This paper proposes a new multi-scale information fusion based on convolution kernel pyramid and dilated convolution for Wushu moving object detection. The proposed model uses a variety of ways to fuse the feature information. First, the multi-layer feature map information with different sizes is fused by the per-pixel addition method. Then the feature map of different stages is splicing in the channel dimension to form the information fusion feature layer with rich semantic information and detail information as the prediction layer of the model. In this model, convolution kernel pyramid structure is introduced into the anchor frame mechanism to solve the multi-scale problem of detecting objects. The number of parameters increased by large convolution kernel is reduced by using dilated convolution to reduce the number of anchor frames reasonably. Experimental results show that the proposed fusion algorithm has certain anti-interference ability and high precision for moving object detection in complex environment compared the state-of-the-art methods.


Introduction
Wushu is a traditional Chinese sport that focuses on both internal and external training, with fighting and defense as its main content, routine and combat as its sport form.Moving object detection refers to the detection of video sequence images and the segmentation of moving object from the complex background.The detection, monitoring and tracking of the object and other subsequent processing are widely used in real-time monitoring and other fields.
The background difference method and frame difference method are the two main methods in moving object detection [1][2][3].The background difference method is used to analyze the background model and every frame image to be detected.The moving object detection image is obtained by difference calculation.In this method, a stable background model is firstly established, and then the final object image is obtained by subtracting the background model and the image to be detected by comparing the background model with the original image.In addition, the inter-frame difference method detects the contour of the moving object through the difference operation of adjacent frame images.It has good adaptive ability, simple calculation principle and good real-time performance [4,5].However, the gap phenomenon exists in the process of object contour extraction, which results in partial feature points being ignored and greatly reduces the detection accuracy [6].In the two methods, the illumination changes will cause great interference, hinder the acquisition of background model, and affect the accuracy of object detection [7].
With the improvement of computer computing capability, especially the maturity of scientific computing technology based on general purpose Computing unit (GP-GPU), image processing and computer vision have achieved rapid development and made great progress [8,9].Due to the order of magnitude improvement in computing power, computer vision technology based on deep neural network technology has also made new development, and has made great progress in image classification, segmentation and other tasks [10].As an important task of computer vision, object detection is also optimized by using deep learning technology.
SqueezeDet [11] serves as a fast object detection framework based on deep learning, it continuously learns from data, automatically extracts the required features, and has natural multi-scale and translational invariance, which makes great progress in the field of object detection.However, in the real scene, the image is seriously affected by lighting, color and other factors.The change of features leads to the failure of model judgment.How to extract invariable features is particularly critical [12].Due to the pooling operation in convolutional neural network, the resolution becomes smaller and information is lost.At the same time, the change of object distance in the real scene will lead to a large scale change, often resulting in a decrease in detection accuracy.In order to effectively overcome the above problems, this paper proposes a new moving object detection algorithm based on convolution kernel pyramid and dilated convolution.The new algorithm firstly uses Skip Connection and Shortcut Connection to connect feature maps with different resolutions, so as to obtain the lost information and form new feature maps with rich semantics.As shown in figure 1(a), it denotes Skip Connection, that is, two feature maps of the same size are spliced together according to channel dimensions to form a new feature map. Figure 1(b) is Shortcut connection, that is, two feature graphs of the same size and channel dimension are added together in the way of "Eltw sum" to obtain a new feature graph (i.e. the addition of corresponding elements).Then, convolution kernel pyramid structure is introduced into the anchor frame mechanism to solve the problem of mismatch between anchor frame and feature region, so as to detect multi-scale objects more accurately.The introduction of dilated convolution increases the receptive field of convolution kernel without increasing the number of parameters, and determines the number of anchor frames according to the generated predictive tensor, which reduces the time complexity.

Related works
Moving object detection is a popular research field in computer vision.The current research is mainly based on deep learning methods.The object detection network based on deep learning consists of two basic parts: feature extraction module and object detection module [13,14].
When convolutional neural network (CNN) is used to extract image features, the deep feature graph has rich object semantic information and it is sensitive to category information.However, it lacks detailed information and is often used in classification tasks.However, shallow feature maps have rich details and are sensitive to position, translation and rotation, but lack semantic information.object detection includes classification and object location.
The former classifies candidate regions and requires object semantic information [15].The latter locates candidate areas, requiring details such as location.In order to improve the performance of object detection, feature information of different depths is often fused to facilitate object classification and location.According to whether feature information fusion is carried out, the object detection network is divided into two types: without feature information fusion and with feature information fusion.In the object detection network without feature information fusion, one type of prediction is based on single-layer feature map, such as two-stage methods (Fast RCNN [16], Faster R-CNN [17]), single-stage methods YOLO (You Only Look Once) [18] and YOLOv2 [19].The other is to predict on multiple feature graphs, such as SSD (single-shot multi-box detector) [20] and MS-CNN (multi-scale CNN) [21].In the object detection network with feature information fusion, one kind of prediction is based on a single fused feature graph.For example, HyperNet and Inside-outside not (ION) fuse features of different levels by splicing [22].The other is prediction over multiple fused feature graphs, such as deconvolutional single-shot dectector (DSSD) [23], which fuses information by multiplying it pixel by pixel.YOLOv3 [24], FPN (Feature Pyramid Network) [25] and Mask R-CNN [26] fuse information by adding each pixel.
In order to take advantage of different depth feature maps, this paper proposes to fuse feature information at two stages.Firstly, multiple convolutional layers are added after the feature extraction network, and the feature information is fused layer by layer from deep to shallow by pixel-by-pixel addition to form feature maps with rich semantic information and detailed information.Secondly, in order to further enhance the fusion of feature information, the method of channel splicing is used to splicing the feature images with different stages in the fusion feature images obtained in the previous step, forming the feature images with richer semantic and detailed information.
For the problem of multi-scale object detection, the proposed solutions mainly include the following three categories: The first approach uses an image pyramid network (which extracts features on images of different sizes) such as the Scale Normalization for Image Pyramids (SNIP) algorithm [27] and the face detection algorithm HR (Hybnid Resolution) and gets good results.Its disadvantage is that the algorithms have high time complexity.In order to reduce the time complexity, sparse image pyramid can be adopted, that is, only three different input image sizes can be adopted.
The second approach is to solve the multi-scale problem of the object by using the anchor frame mechanism on the single-layer feature map.For example, Faster RCNN uses RPN (Region Proposal Network) network to extract candidate regions on the deepest feature map.In order to detect objects of different scales, the RPN network predicts nine anchor frames of different sizes and ratios at each anchor point in the feature graph.In addition, RFCN (Region-based Fully convolutional Network) [28] and YOLOv2 also adopt anchor frame mechanism for prediction.
The two-step detection algorithm represented by Faster R-CNN mainly uses candidate frame location, and then classifies candidate frame by classification network and further coordinate regression, which can obtain detection results with higher accuracy.The algorithm process is divided into two steps: 1) Firstly, a Region Proposal Network (RPN) is used to extract regions of interest (RoIs) from an image; 2) Using multi-task classification and regression network to conduct sub-classification and location regression for the regions of interest extracted in step 1 [3].
YOLO1 and SSD are the main one-stage detection algorithms.In order to speed up the processing with a certain amount of accuracy, it pre-sets a set of anchors directly predicting the category and location of the object in the image.The basic idea of one-stage detection method is to extract regions of interest and classify multiple categories, which is similar to RPN network in nature.In order to improve its detection accuracy and generalization ability, SSDS performs detection on multi-scale feature maps, fuses and filters multi-scale detection objects, and refines the final detection results.
Compared with the two-stage detection algorithm, the one-stage detection algorithm is an end-to-end network, where gradients are transmitted well, so the network is relatively easy to train.In addition, the one-stage detection algorithm has no candidate extraction process and it is fast, which is suitable for many real time scenes (such as unmanned driving and video object detection).SqueezeDet is a detection algorithm similar to YOLO.Firstly, a pretrained model in ImageNet is used to extract highdimensional feature images from an image.Compared with YOLO, SquezeDet uses convolution layer instead of full connection layer, which greatly reduces the parameters of the entire network and further improves the generalization capability and speed of the network.However, due to the feature graph of single scale, SqueezeDet performs poorly in unmanned driving, video detection and other rapidly changing scenes.
The third way is to make predictions on a feature pyramid.The SSD object detection network predicts objects with different scales on different feature maps, forming the prototype of feature map pyramid.Both DSSD and FPN are predicted on the pyramid of feature graph to deal with multiscale problem of object.
The above three ideas are effective means to solve the multi-scale problem of detection objects.In this paper, the anchor frame mechanism is improved to solve the multiscale problem of the object.In the anchor frame mechanism of RPN, each anchor point on the feature graph for prediction corresponds to 9 anchor frames of different sizes and ratios.During prediction, 1×1 convolution kernel is used to predict the position and confidence of multiple anchor frames of different sizes (i.e. the probability that the object contained in the anchor frame belongs to a certain category).Therefore, for anchor frames of different sizes corresponding to anchor points, the same feature area on the feature graph is used in prediction, resulting in the mismatch between the feature area used in prediction of RPN network and the corresponding anchor frame area.Therefore, this paper proposes to introduce convolution kernel pyramid structure into the anchor frame mechanism to detect objects with different sizes, so that the size of convolution kernels corresponding to anchor frames of different sizes is different, while the size of convolution kernels corresponding to anchor frames of the same size but with different ratios is the same, so as to alleviate the problem of mismatch.In addition, large convolutional kernels will increase the number of parameters, in order to reduce the time complexity, the model adopts the dilated convolutional mechanism to design convolutional kernels with different sizes of receptive fields [29].Under the action of convolution kernels with different sizes, prediction tensors with different resolutions are generated on feature graphs with rich semantic and detailed information (i.e.feature graphs).The model determines the number of anchor frames according to the generated prediction tensor, making small objects correspond to small anchor frames and the number is large, and large objects correspond to large anchor frames and the number is small, thus reducing the number of anchor frames reasonably.
Context information is very important in object detection.For example, if a person is wearing cat ears, the detection algorithm is very likely to misdetect if only cat ears are seen.If context information can be connected, false checks can be avoided.Dilated convolution can effectively collect multi-scale and context information.Figure 2 shows dilated convolution of 3×3.Red is the center of the convolution kernel and blue is the surrounding point.Compared with traditional convolution, dilated convolution has an extra hyperparameter rate.In figure 2, (a), (b) and (c) are dilated convolution with rate=1,2,3 respectively.Figure 1(a) is exactly the same as traditional convolution, which can be regarded as a special case of dilated convolution.As you can see from figure 2, you can get information around the object by setting different rates.Inspired by the existing method, this paper combines dilated convolution, skip connection and shortcut connection to combine with context and multiscale information and enhance the feature expression ability.

Multi-scale information fusion
The fusing process of the CKP-DC model starts at the deepest level of the original SSD prediction layer.First, the bilinear interpolation is used to increase the resolution of the feature image.Then, it uses the method of adding pixel by pixel to fuse the previous layer feature map.In this way, the layers are sampled and fused up to the shallowest layer of the original SSD prediction layer to form a feature map containing both detail information and semantic information.This fusion process can be seen from the blue feature diagram of the feature information fusion module in figure 3 and the fusion part shown by the line.In addition, feature maps of different stages are spliced to further enhance the semantic and detail information of the predicted feature maps.This fusion process can be seen in the green feature map of the feature information fusion module in figure 3 and the fusion part shown by the line.Then the feature images fused in different ways are spliced into the final prediction feature images by channel splicing.Because it preserves detail information and semantic information better, using such features is not only beneficial to the detection of large objects, but also can enhance the ability of the model to detect small objects.Detection is performed on the last feature graph, and the width and height of the feature graph are 1/16 of the original image, that is, the object of 16×16 in the original image is mapped to the feature graph with a size of only 1×1, and the information loss is very serious.Therefore, the small moving object detection is very difficult.For the convolutional neural, due to the existence of down-sampling, many feature images with different resolutions are generated in the middle layer, which have natural multi-scale information.Generally, the larger feature image in the front has local details of the object, while the smaller feature image in the back has richer semantics.Therefore, this paper considers using Skip Connection to combine feature maps with different resolutions and detect them on larger feature maps.This has two advantages: (1) The newly obtained layer contains both rich semantics and local details of objects, and can make good use of multi-scale information in convolutional neural network; (2) After the feature graph expanding, the number of selected anchor increases and the sampling becomes more intensive, so the location of the object can be better obtained.Generally, the contextual information around the object plays a very important role in object detection.By setting different rates, dilated convolution can obtain object information of different ranges.
Therefore, we should first combine dilated convolution of different rates in parallel to extract information around the object as shown in figure 4  Dilated convolution can rapidly increase the receptive field in series, and the appropriately large receptive field is helpful for object detection.Therefore, this paper considers series of dilated convolution.In order to make parameter utilization higher.In this paper, the number of channels through each layer is halved in the series process, and the feature graph generated in the middle is also taken as the output of the module, so as to realize feature reuse, as shown in figure 6.When the input and output are 512 channel feature graphs, the series and parallel modules can save about 1/3 parameters.In this paper, dilated convolution and Skip Connection are combined.The network structure of the algorithm is shown in figure 7. The pooling of convolutional neural network is often accompanied by the loss of information.In this paper, shortcut connection is also used to connect smaller feature graphs with larger feature graphs in the form of "Eltw sum" to mitigate the loss of information in the pooling process [30].In order to keep the pre-training model unchanged, 1×1 convolution in the shortcut connection module is initialized to 0 in this paper.

Moving object detection
In order to detect objects of different scales on feature maps with rich semantic information and detailed information, convolution kernel pyramid structure is adopted.Convolution operation is carried out with convolution check feature graphs of different sizes to generate prediction tensors of different sizes, confidence degree and location information based on the classification of the prediction object.The convolution operation of the feature graph with different sizes corresponds to the convolution operation of the receptive field with different sizes on the original graph, thus facilitating the detection of objects with different sizes.As large convolution kernels bring a large number of parameters, the computation of the model is greatly increased.Therefore, dilated convolution is adopted in the convolution kernel pyramid to increase the convolution kernel receptive field without increasing the number of parameters.The pyramid module of convolution kernel based on dilated convolution is shown in figure 8.The prediction tensor is generated through the action of different convolution kernels.Two convolution operations are performed on each set of output tensors, and then the categories and positions of boundary boxes are predicted respectively, as shown in the light red and light blue rectangular boxes in the figure.The final feature map for prediction has a resolution of 38×38 pixels.Multiple sets of convolution kernels of different sizes are designed to cover receptive fields of different sizes on the original image as evenly as possible, so as to better predict objects of different scales.The minimum convolution kernel size is min k and its value is 3.
The maximum convolution kernel size is max k and the value is 38.n tensors of different sizes are predicted by using n convolution kernels of different sizes (n=6).In order to keep the convolution kernel size evenly distributed between 3 and 38, the size of the m-th convolution kernel is: The resolution of each group of output prediction tensors should also meet the requirement of uniform distribution so as to make the number of anchor frames with different sizes more effective and reasonable.Therefore, the size of the predicted tensor of the m-th output is: In the formula,   means rounding up.According to the requirements of the convolution kernel size distribution and the resolution of the output prediction tensor, and through experimental verification, a set of convolution kernel size definition mechanism is designed, as shown in table 1. r and d represent the actual convolution kernel size and cavity coefficient.s stands for step size.p represents the filling condition.e is equivalent to the size of ordinary convolution kernel in the case of dilated convolution.o represents the resolution of the prediction tensor.e satisfies the following: In the formula,   means rounding down.The experimental results show that the design scheme can meet the requirement that the convolution kernel uniformly covers different receptive fields on the original image.Compared with the structure that using dilated convolution in multiple series, the dilated convolution structure designed in this paper uses multiple different dilated convolutions in parallel, and there is only one dilated convolution structure in the backbone network.Therefore, the Gridding effect of dilated convolution is not obvious.In addition, the dilated convolution coefficients in this paper are small, each dilated convolution coefficient is different and its greatest common divisor is less than 1, which conforms to the main features of HDC (Hybrid Dilated Convolution) module [31], further reducing the Gridding effect.

EAI Endorsed Transactions on Scalable Information Systems
Online First Yuhang Li

Anchor box mechanism
The number of anchor frames in the SSD model is generated based on the resolution of the feature graph for which the prediction is made.If the resolution of the feature map is N×N, the input image is divided into N×N grids.Four or six anchor frames of varying ratios are generated for each grid.If this scheme is also adopted in this paper, the same number of anchor frames of different proportions will be generated for each convolution kernel on the original image, and 46208 (38×38 (2×4+4×6)) anchor frames will be generated with the input of the model of 300×300 pixels.It can be seen from YOLOv2 that the average detection accuracy will decrease if the number of anchor frames is too much [32].Therefore, this paper proposes a scheme to reduce the number of anchor frames according to the resolution of the generated predictive tensor.Convolution kernel pyramid and anchor frame mechanism are combined to deal with multi-scale problem of object detection.Convolution check of the same size should have multiple anchor frames, which have the same size and different ratios.Convolution of different sizes should have different sizes of anchor frames.Specifically, under the action of convolution kernels of different sizes, the same feature graph generates prediction tensors with different resolutions, so that the grid number divided on the original image and the resolution of prediction tensors remain the same.In this way, the small object corresponds to the small frame, and the number is larger.A large object corresponds to a small number of anchor frames.This can not only solve the problem of detecting objects of different sizes, but also reduce the number of anchor frames reasonably.Using this mechanism, under the 300×300 pixels, it has 8576 anchor frames (38×38×4+18×18×6 +10×10×6+6×6×6+3×3×6+1×1×4), which is similar to the 8732 anchor frames in SSD, it is reasonable.This paper determines the size of anchor frame according to the size of convolution kernel and experiments.When the convolution kernel size is k , assuming that the ratio of an anchor frame is r , then the width W and height H of the anchor frame are respectively: Where, α is the hyperparameter according to the actual situation.When the ratio is 1, one more case is added, i.e, ) 7 Therefore, the size of the anchor frame is different for convolution kernels of different sizes.Convolution kernels of the same size have the same size but different ratios.

Modal training
The CKP-DC model initializes model parameters using SSDS as pre-training models.Similar to SSD, the data is enhanced to improve the detection accuracy and robustness of the model.In the matching process of anchor frame and truth label, each truth label is matched to any anchor frame whose IOU (intersection over Union) is greater than 0.5.When a truth label has no matching object, matching it with the largest anchor box of its IOU.For the anchor frame with no matching, the front one is selected as the negative sample according to the predicted confidence, so that the ratio of negative sample to positive sample is 3:1.The loss function definition of the model is the same as that of SSD model, which consists of the sum of smooth L1 positioning loss and Softmax classification loss.

EAI Endorsed Transactions on Scalable Information Systems
Online First Multi-scale information fusion based on convolution kernel pyramid and dilated convolution for Wushu moving object detection The proposed method is tested and evaluated on PASCAL VOC [33], remote sensing data set UCASAOD [34] and real martial arts moving data.The code is implemented on the Caffe deep learning framework, utilizing some artifacts of the SSD and DSSD Caffe open source libraries.All experiments are performed on an HP workstation equipped with a Titan X GPU.SSD is used as the pre-training model of the proposed method, and the model is fine-tuned on PASCALVOC, UCAS-AOD remote sensing dataset and real martial arts moving data.mAP (Mean Average Precision) is used to evaluate the performance of the new method.The proposed method is compared with other advanced deep learning object detection methods in terms of mAP and detection speed.Precision and Recall are defined as: Where TP indicates that the positive class is judged to be positive class.FN indicates that positive class is judged to be negative class.FP refers to the negative class is judged to be positive class.TN means the negative class is judged to be negative.AP is defined as the average of the maximum accuracy at 11 different recall rate levels [0, 0.1, 0.2,...,1].
 r r P AP (10) Where, ) ( max r P refers to the maximum accuracy when the recall rate is r .AP is the evaluation index of a single category.mAP refers to the average AP of multiple objects.In the first 7×10 4 iterations, the learning rate is 0.001.In the next 3×10 4 iterations, the learning rate is 0.0001.The entire network is then fine-tuned to train at a learning rate of 0.001 in the first 2×10 4 iterations, and then at a learning rate of 0.0001 in the second 2×10 4 iterations. Table 2 shows the comparison of the model parameters with and without dilated convolution.Both basic networks are the same, and the number of basic network parameters is not considered.It can be seen that the parameter number is significantly reduced using dilated convolution.Compared with the two-stage method, CKP-DC is 9.3% higher than Fast RCNN, 6.1% higher than Faster RCNN with anchor frame mechanism, 3.7% higher than ION with channel dimension splice and fusion feature information, and 1.1% higher than MRCNN.Compared with single-stage method, CKP-DC is 1.8% higher than SSD model and 0.9% higher than DSSD model on the same input image and base network.In addition, CKP-DC is better than some improved SSDS, DSSD.CKP-DC is 0.7% and 0.4% higher than MDSSD300 and feature-fused SSD respectively, and 0.6% higher than FSSD300 and RSSD300 on average and 1.6% higher than DSOD300, indicating that the proposed detection model in this paper has better performance.Note that: the bold values are the best.Table 4 shows the detection results with different methods on ResNet-101 for the PASCAL VOC2007 test set.The results of the compared algorithms are from the original literature.CKP-DC is 2.9% higher than Faster R-CNN, 2.2% than SSD, and 0.7% than DSSD.The results show that increasing the depth of the model is helpful to improve the detection accuracy, but the detection accuracy does not improve further when the layer number increases to a certain extent.CKP-DC has a slightly lower accuracy than RFCN and DSSD.However, according to the comparison of test times in figure 9, the test speed of RFCN is 9 frames/s, while that of DSSD is 5.5 frames/s, which is far lower than the 21 frames/s of the proposed model in this paper.When converting all data to an imdb format that Caffe can recognize, the image sizes and bounding box labels are also fully scaled so that the model's input image size is 300×300 pixels.SSD, DSSD and CKP-DC models are initialized with parameters trained on VOC datasets and fine-tuned on remote sensing training datasets.The first 6×10 4 iterations are trained with learning rate of 0.001, and the next 2×10 4 iterations are trained with learning rate of 0.0001.The object detection results of each approach are shown in table 5.The average precision (AP) of CKP-DC is 2.8% and 1.9% higher than that of SSD and DSSD, respectively.

Experiments on Wushu data sets
We select Jab, push, brace, lift, elbow actions in Wushu dataset to conduct experiment [36].The results are shown in table 6.The time is also the shortest.It shows that the proposed has better effect.

Conclusion
A convolution kernel pyramid and dilated convolution model for Wushu object detection is proposed in this paper.Firstly, the feature information is fused by adding pixel by pixel and splicing channel to form a feature map with rich semantic information and detail information, which is used as a prediction feature map to provide rich feature information for predicting the category and position of boundary boxes.Then it introduces convolution kernels in the mechanism of anchor box of pyramid structure, overcomes the problem that the anchor box area does not match the corresponding features.In order to more accurately detect multi-scale object, at the same time, shallow convolution is increased.Due to the effective information fusion and the introduction of convolution kernel pyramid structure in the anchor frame mechanism, compared with the current advanced methods, the model has faster detection speed and higher detection accuracy, especially better solves the detection problems of small objects and overlapping objects.The future works will focus on the practical application and it will produce enormous economic benefits.

Figure 8 .
Figure 8. Filter pyramid based on dilated convolution

4. 1 .
Experiments on PASCAL VOC 2007 DSSD uses ResNet-101 network as the basic network to extract features.In order to facilitate comparison, the DSSD model with VGG-16 version is established based on VGG-6 network in the experiment, and the model is trained according to the training strategy of DSSD in the original text.In order to make equitable comparison with other advanced algorithms, the CKP-DC model is trained on the PASCAL VOC2007 and PASCAL VOC2012 joint training set.The results are evaluated on PASCAL VOC2007 test set.During training, the weight of the original SSD model is fixed first, and only the additional network parts are trained.

Figure 9 .
Figure 9.Comparison of accuracy and speed on PASCAL VOC2007 test set

Figure 10 shows
Figure10shows some detection results of SSDS and CKP-DC on the PASCAL VOC2007 test set.Row 1 and row 3 are the results of SSD model detection, and rows 2 and 4 are the results of CKP-DC model detection.Figure11shows some detection results of the two models on the UCAS-AOD remote sensing dataset.Figure11(a) and (b) are the results of SSD and CKP-DC detection respectively.Only test results with a confidence level higher than 0.8 are displayed in the figure.It can be seen that CKP-DC does a better detection than SSD in detecting object overlap and small objects.

Figure 10 .Figure 11 .
Figure 10.Results of SSD and CKP-DC on PASCAL VOC2007 test set

Table 1 .
Designed dilated convolution filter mechanism

Table 2 .
Number of model parameters comparison

Table 3 .
Results of CKP-DC and other advanced algorithms on PASCAL VOC2007 test set

Table 4 .
Results of other advanced methods of ResNet-101 on PASCAL VOC2007 test set

Table 5 .
Results of CKP-DC and other methods on remote sensing dataset

Table 6 .
AP results of CKP-DC and other methods on Wushu dataset/%