Object detection using voting spaces trained by few samples

Abstract. A method to detect generic objects by training with a few image samples is proposed. A new feature, namely locally adaptive steering (LAS), is proposed to represent local principal gradient orientation information. A voting space is then constructed in terms of cells that represent query image coordinates and ranges of feature values at corresponding pixel positions. Cell sizes are trained in voting spaces to estimate the tolerance of object appearance at each pixel location. After that, two detection steps are adopted to locate instances of object class in a given target image. At the first step, patches of objects are recognized by densely voting in voting spaces. Then, the refined hypotheses step is carried out to accurately locate multiple instances of object class. The new approach is training the voting spaces based on a few samples of the object. Our approach is more efficient than traditional template matching approaches. Compared with the state-of-the-art approaches, our experiments confirm that the proposed method has a better performance in both efficiency and effectiveness.


Introduction
Object detection from a target image has attracted increasing research attention because of a wide range of emerging new applications, such as those on smart mobile phones.][3][4][5][6][7][8][9][10] An object in a sample image is a composite of many visual patches or parts recognized by some sparse feature analysis methods.In the detection process, sparse features are to be extracted in a testing image and a trained classifier is used to locate objects in a testing image.Unfortunately, in most real applications there are always insufficient training samples for robust object detection.Most likely, we may just have a few samples about the object we are interested in, such as the situations in passport control at airports, image retrieval from the Web, and object detection from video or images without preprocessed indexes.In these cases, the template matching approach based on a small number of samples often has been used.
Designing a robust template matching method remains a significant effort. 11Most template matching approaches use query image to locate instances of the object by densely sampling local features.Shechtman and Irani provided a singlesample method 12 that uses a template image to find instances of the template in a target image (or a video).The similarity between the template and a target patch is computed by a local self-similarity descriptor.In Refs.13 and 14, one sample, representing human behavior or action, is used to query videos.Based on this training-free idea, Seo and Milanfar proposed the locally adaptive regression kernels (LARK) feature as the descriptor to match with the object in a target image using only one template. 15This LARK feature, which is constructed by local kernels, is robust and stable, but this LARK feature brings overfitting problem and results in low computational efficiency.In Ref. 16, the authors constructed an inverted location index (ILI) strategy to detect the instance of an object class in a target image or video.This ILI structure saves the feature locations of one sample and indexes feature values according to the locations to locate the object in target image.But this ILI structure just processes one sample.In order to improve the efficiency and accuracy based on a small number of training samples, these methods have to run a few times on each of those samples.
Different from the dense feature like LARK, key-point sampled local features, such as scale invariant feature transform (SIFT) 17 and speeded up robust features (SURF), 18 always obtain a good performance in the case of using thousands of samples to train classifiers.And these key-point features are always in a high-dimensional feature space.If one has thousands of samples and needs to learn classifiers such as support vector machine (SVM), 3,8 key-point features have obtained good performance.Previous works 15,[19][20][21] pointed out that the densely sampled local features always give better results in classification tasks than that of key-point sampled local features like SIFT 17 and SURF. 18ecently, some interesting researches based on few samples have emerged.Pishchulin et al. proposed a person detection model from a few training samples. 22Their work employs a rendering-based reshaping method in order to generate thousands of synthetic training samples from only a few persons and views.However, the samples are not well organized and their method is not applicable on generic object detection.In Ref. 23, a new object detection model is proposed named the fan shape model (FSM).FSM uses a few samples very efficiently, which handles some of the samples to train out the tolerance of object shape and makes one sample the template.However, FSM method is not scalable in terms of samples and is only for contour matching.
In this paper, we propose a novel approach for generic object detection based on few samples.First, a new type of feature at each pixel is considered, called locally adaptive steering (LAS) feature, which is designed for a majority voting strategy.The LAS feature at 1 pixel can describe the local gradient information in the neighborhood, which consists of the dominant orientation energy, the orthogonal dominant orientation energy, and the dominant orientation angle.Then, for each member of this feature, a cell is constructed at each pixel of the template image, whose length is the range of the feature member value.The cells for all pixels construct a voting space.Since this feature is in three dimensions, three voting spaces are to be constructed.We use a few samples to train these voting spaces, which represent the tolerance of appearance of an object class at each pixel location.
Our idea of using a LAS feature is motivated by earlier work on adaptive kernel regression 24 and the work of LARK feature. 15In Ref. 24, to perform denoising, interpolating, and deblurring efficiently, localized nonlinear filters are derived that adapt themselves to the underlying local structure of the image.LARK feature can describe the local structure very well.After densely extracting LARK features from template and target images, matrix cosine similarity is used to measure the similarity between query image and a patch from target image.This method is resilient to noises and distortions, but the computation of LARK features is time-consuming with heavy memory usage.Our LAS feature simplifies the computation of LARK and saves the memory.Moreover, LAS feature also exactly captures the local structure of a specific object class.
Our voting strategy is inspired by the technology of Hough transformation.Many works [25][26][27][28][29][30] have contributed to model spatial information at locations of local features or parts as opposite to the object center by Hough voting.The Hough paradigm starts with feature extraction and each feature casts votes for possible object positions. 4here are two differences from Hough voting.The first one is each cell size of the voting spaces is trained out by samples.But each cell size of the Hough voting space is all fixed.Our trained space can tolerate more deformation.The second one is our voting strategy is based on template matching.Previous Hough voting is based on a trained codebook with thousands of samples.This paper is structured as follows.Section 2 gives an overview of our method.In Sec. 3, we propose LAS feature.Section 4 describes the concept of voting space and the training processing.Section 5 introduces the procedure of object matching in voting spaces.Section 6 gives the experimental results and compares our method with the state-of-the-art methods.This paper is extended from an early conference paper 31 with improved algorithms and comprehensive experimental results.
2 Overview of Our Approach An overview of our method is shown in Fig. 1.In the training processing, for each dimension of LAS feature, a voting space is constructed by image coordinates ðx; yÞ and corresponding value ranges of the feature member.Each cell in the voting space is formed by the corresponding pixel position and its value range of this feature member, which is trained by several samples.Since the voting space is three dimensional (3-D), for simplicity we use a 3-D box to represent a cell.The longer the box is, the higher the cell length.The different sizes of boxes mean the different tolerances of object appearance changes at corresponding pixels.In Fig. 1, LAS feature ðL; S; θÞ is specially designed for object matching in voting spaces, where L, S, and θ represent the dominant orientation energy, the orthogonal dominant orientation energy, and the dominant orientation angle, respectively, in the neighborhood at each location.Thanks to the merit of voting, only a few samples (2 to 10 samples in this paper) are enough to train cells.
In the detection process, by randomly choosing one sample image as the query Q, the instances of an object class are located in target image T, which is always larger than Q.First, the patches T i extracted from T by a sliding window are detected by densely voting in the trained voting spaces.Each component of LAS feature of T i and Q is voted in each voting space to obtain a value of similarity.If the values of similarity of all LAS features are larger than the corresponding thresholds, then T i is a similar patch of Q.Then a refined hypotheses step is used to accurately locate the multiple instances by computing the histogram distance corresponding to the feature θ.The refined step is just processing the similar patches that are obtained in the voting step.If the histogram distance of θ between a similar patch and Q is small enough, then the similar patch is the object instance.

Locally Adaptive Steering Feature
The basic idea of our LAS is to obtain the locally dominant gradient orientation of image.For gray image, we compute the gradient directly.If the image is RGB, the locally dominant gradient orientation is almost the same on each channel.To reduce the computation cost of transforming RGB to gray image, we just use the first channel of RGB.The dominant orientation of the local gradient field is the singular vector corresponding to the smallest singular value of the local gradient matrix. 24,32(The proof of transformation invariance of singular value decomposition (SVD) can be reviewed by interested readers from Ref. 24.)For each pixel ði; jÞ, one can get the local gradient field shown in Fig. 2(a).The local gradient field is a patch in the gradient map around the pixel ði; jÞ.Here, the size is set as 3 × 3 pixels.The dominant orientation means the principal gradient orientation in this gradient field.To estimate the dominant orientation, we compute the horizontal and orthogonal gradients of the image.Then, the matrix GFði; jÞ is concatenated column-like as follows: GFði; jÞ ¼ ; (1) where g x ði; jÞ and g y ði; jÞ are, respectively, gradients of the x and y directions at the pixel ði; jÞ.The principal direction is computed by SVD decomposition GFði; jÞ ¼ U ði;jÞ Λ ði;jÞ V T ði;jÞ , where Λ ði;jÞ is a diagonal 2 × 2 matrix given by The eigenvalues λ 1 and λ 2 represent the gradient energies on the principal and minor directions, respectively.Our LAS feature at each pixel is denoted as ðL ði;jÞ ; S ði;jÞ ; θ ði;jÞ Þ.We define a measure L ði;jÞ to describe the dominant orientation energy as follows: where ξ 0 is the tunable threshold that can eliminate the effect of noise.The parameter q 0 is a tunable threshold.The measure S ði;jÞ describes the orthogonal direction energy with respect to L ði;jÞ .
The measure θ ði;jÞ is the rotation angle of L ði;jÞ , which represents the dominant orientation angle.
where ½v 1 ; v 2 T is the second column of V ði;jÞ .
The LAS feature can describe the local gradient distribution information (see Fig. Some studies 15,[19][20][21] have already pointed out that the densely sampled local features always give better results in classification tasks than that of key-point sampled local features, such as SIFT 17 and SURF. 18These key-point sampled features are always in a high-dimensional feature space in which no dense clusters exist. 15Comparing to the histogram of gradient (HOG) feature, 27 our LAS feature has smaller memory usage.Each location of the HOG feature is 32 dimensions histogram, while our LAS feature is just three dimensions.In Ref. 33, the authors also proposed dominant orientation feature.But this dominant orientation is a set of representative bins of the HOG. 33Our dominant gradient orientation is computed by the SVD decomposition of the local gradient values, which have more local shape information.Comparing to the LARK feature, 15 our LAS feature has 27 times smaller memory usage, for the LARK feature is 81 dimensions at each pixel location.In Ref. 24, the authors mentioned these three parameters, but no one has used them as features.Next, we train the voting spaces based on this LAS feature to obtain three voting spaces.So why can the LAS deal with only a few image samples well?That is because our LAS feature contains more local gradient information than other dense features like LARK 15 and HOG. 27There are three components of our LAS feature L ði;jÞ , S i;j , and θ i;j , which represent the dominant orientation energy, the orthogonal direction energy of dominant orientation, and the dominant orientation angle, respectively.For LARK, 15 there is only gradient energy information, which cannot reflect the energy variations.For HOG, 27 there are just values of region gradient intensity in different gradient orientations, which cannot reflect dominant orientation energy and angle.

Training Voting Spaces
The template image coordinates and the value ranges of the LAS feature component at the corresponding locations form the voting spaces (denoted as ΔL, ΔS, and ΔΘ, respectively, for three components of LAS feature).To match the template image and the patch from the target image (testing image) accurately, the cell length should be trained to reflect the tolerance of appearances at each pixel location.Several samples (2 to 10 samples in this paper) are enough to train the cells in each voting space.
Assume the query samples as Q ¼ fQ 1 ; Q 2 ; : : : ; Q n g and n is the cardinality of Q.We use Eqs.(3) to (5) to compute the LAS feature matrices of nðn ≥ 2Þ samples and obtain L ¼ fL 1 ; L 2 ; : : : ; L n g, S ¼ fS 1 ; S 2 ; : : : ; S n g, and Θ ¼ fθ 1 ; θ 2 ; : : : ; θ n g.We want to get the tolerance at each location from the matrices L i , S i , and θ i for (i ¼ 1; 2; : : : ; n).Because our LAS feature is from local gradients at each location, each value in matrices L i , S i , and θ i reflects the local edge orientation.To reflect the variation range of samples at each location, we define the cell sizes ΔL, ΔS, and Δθ as follows: Fig. 3 Robustness of the LAS feature in different conditions.The sigma of Gaussian noise is 4. In our object matching scheme, there are two steps to search similar patches in T .First step is voting in the trained voting spaces (see Fig. 5).To combine the votes from three voting spaces, we use a joint voting strategy to detect similar patches from T .After the first step, one can get some similar patches T 0 ¼ fT 0 1 ; T 0 2 ; • • • ; T 0 N T 0 g (T 0 ⊂ T and N 0 T is the cardinality of T 0 ).Then a refined hypotheses step follows.In this step, the histogram distance of the LAS feature between Q and T 0 i ði ¼ 1; 2; • • • ; N T 0 Þ is used to measure integral similarity, which can precisely locate the instances of the query Q.

Voting in Trained Spaces
We associate each patch in T with a hypothesis as follows: Because there are three components with respect to the LAS feature, we have three estimated conditional densities pðH i jL Q Þ, pðH i jS Q Þ, and pðH i jθ Q Þ.These conditional densities are defined as the results of voting.Specifically, where kQk is the number of pixels of the image Q, and Kð:; :Þ is a map: R 2 × R 2 → Z þ , which counts the votes in the corresponding space.
To compute the function Kð:; :Þ, we define three variables Δ L ð∈ ΔLÞ, Δ S ð∈ ΔSÞ, and where j:j means to take absolute value of the elements in the matrix.In our framework, the functions where ΔL ðj;kÞ , ΔS ðj;kÞ , and Δθ ðj;kÞ are the trained cell matrices in the previous section.For the component L of LAS feature, if ΔL ðj;kÞ ≥ Δ Lðj;kÞ , then sgnðΔL ðj;kÞ − Δ Lðj;kÞ Þ ¼ 1 at the pixel location ðj; kÞ.This means a vote added to the result of KðL Q ; L T i Þ.From Eqs. ( 12) to ( 14) of function Kð:; :Þ, we can find that the estimated conditional densities in Eqs. ( 9) to (11) represent, for each LAS component, the ratio of votes at the size of the query image Q.The estimated conditional densities pðH i jL Q Þ, pðH i jS Q Þ, and pðH i jθ Q Þ between Q and each element of T are computed after voting.So how can we discriminate the similar patches from these densities?
Our answer is organizing the samples to train the density thresholds between Q and the set Q.In Ref. 15, the authors use a tunable threshold to detect possible objects presented in the target image and nonmaxima suppression strategy to locate the objects in a similarity potential map.But in our scenario, we make use of several samples sufficient to obtain the thresholds, written as τ L , τ S , and τ θ , off-line.These three ( ) , , , , thresholds must contain two properties.The first one is that these thresholds reflect the tolerance of the cells.The second one is that the thresholds must be different when the query image changes.Here, the computation formulas of τ L , τ S , and τ θ are the following: where Q i ∈ Q.In previous section, we showed that the voting spaces are trained by the sample set Q, so the tolerance of the cells is reflected in τ L , τ S , and τ θ .When the query image changes, we can see from Eqs. ( 15) to ( 17) that τ L , τ S , and τ θ are also changed.It is worth noting that the min function is just one of the alternative functions in Eqs. ( 15) to (17).
One can choose mean function, median function, even max function, or so on.The reason that we choose min function is that our samples in the experiment are without rotation, strong noises, and brightness change.Other functions to handle more complex cases need further research.In our experiments, we just use Eqs.( 15) to (17) to compute the thresholds.
Next, we use the estimated conditional densities and trained thresholds to obtain the similar patches.For the LAS feature containing three components, our work is to combine these three components to detect the similar patches T 0 j .So we define a map F k ðQ; H i Þ∶ð0; 1Þ → f0; 1g (k ¼ 1, 2, 3) and the combination F ðQ; where For each In Fig. 4, we show the voting results between Q and the elements in T .The densities in the red bounding box are all larger than the thresholds.So T j is the patch similar to Q.We compute the combination function F for all T i ði ¼ 1; 2; • • • ; N T Þ and put patches whose function values equal to 1 into the set T 0 .In Fig. 5, we draw the graphical illustration of the detection process.

Refined Hypotheses Step
After the density voting step, we obtain the similar patch set T 0 .The refined step just a process of this set T 0 , which is obtained from the voting.However, the first step is just a local voting method at each pixel location.It is not enough to describe the integral information of the object.The construction of LAS feature shows that θ is related to the orientation of the local edge, which is mentioned in Ref. 24.To use the contour information sufficiently, we compute the histogram distance between θ Q and θ T 0 i .For the features θ Q and θ T 0 i , after being quantized here in the bin of 10 deg, one can calculate the histograms denoted as h Q and h T 0 i , respectively.The distance between h Q and h T 0 i is defined as where M is the number of bins of the histogram.We also use a few samples to train the threshold of histogram distance, which can be written as τ h .More specifically, The more similar two histograms are, the smaller χ 2 is.So we use the max function to compute the τ h .If i will be the instance of the query Q.It is efficient to use the χ 2 distance [see Fig. 5(c)].The reason is that the histogram distance between Q and T 0 i reflects the integral difference.
In fact, besides using the histogram distance of θ, we can also use the histogram distance of L and S.But in experiments, we find that using the histogram distance of L or S cannot enhance the precision of detection result, and θ is better than L and S. The reason is that the feature θ more precisely describes the contour information of an object.
Previous works 3,[34][35][36][37][38] have already shown that the histogram is a popular representation for feature description.That is because the histogram encodes the distribution of spatially unordered image measurements in a region. 36The χ 2 distance is used to compare the distance between two histograms in Ref. 3. So, we use this quadratic-χ measurement to discriminant histogram distance.

Experimental Results
The experiments consist of three parts using car detection, face detection, and generic object detection, respectively.To handle object variations on scale and rotation in the target image, we use the strategies provided in Ref. 15, which construct a multiscale pyramid of the target image and generate rotated templates (from Q) in 30-deg steps.The receiver operating characteristic (ROC) curves are drawn to describe the performance of object detection methods.We use the definition in Ref. 15 that Recall and Precision are computed as where TP is the number of true positive, FP is the number of false positive, and nP is the total number of positive in the test data set.And 1 − Precision ¼ FP∕ðTP þ FPÞ.In the following experimental results on each data set, we will present Recall versus 1 − Precision curves.

Car Detection
Now, we show the performance of our method on the University of Illinois at Urbana-Champaign (UIUC) car data set. 39The UIUC car data set contains the learning and test sets.To demonstrate the performance improvement of our method, we compare our method to some state-of-the-art works 15,[39][40][41] (see Fig. 7).Seo et al. 15 proposed the LARK features that detect instances of the same object class and get the best accuracy detection, resulting in template matching methods.This method is referred to as LARK.In Ref. 39, the authors used a sparse, part-based representation and gave an automatically learning method to detect instances of the object class.Wu et al. 41 showed a method based on the perpixel figure-ground assignment around a neighborhood of the edgelet on the feature response.Their method needs to learn the ensemble classifier with a cascade decision strategy from the base classifier pool. 41In Ref. 40, the authors introduced a conditional model for simultaneous partbased detection and segmentation of objects of a given class, which needs a training set of images with segmentation masks for the object of interest.However, these works 39,40,41 are all based on the training methods, which need hundreds or thousands of samples.
From Fig. 7, it can be observed that our method is better than the methods in Refs.15 and 39 and the recall is lower than that in Refs.40 and 41, which need hundreds or thousands of samples.The precision of our method can be improved more if the detected results are combined by querying the object using the training samples one by one.But this is not our main point.Our focus is that detecting the instances of an object by one query using our method is competitive to or better than that of the one-query method executing several times.Compared with the LARK method, because we organize the training samples reasonably, our detection results have more appearance tolerance of the object.Although we just have few samples in hand, the detection result of our method is better than that of the previous works, 39 which need hundreds or thousands of samples.
The comparisons of detected equal-error rates (EER) 15 are shown in Tables 1 and 2. One can also find that our proposed method is competitive to or better than those state-of-the-art methods.Here, we compare our method to the state-of-theart training-based methods 3,[39][40][41] and the one-query method (LARK).The EER on the single-and multiscale test sets are shown in Tables 1 and 2, respectively.From Table 1, it can be found that the EER of our method is higher than that of methods in Refs.15 and 39, and lower than that of the methods in Refs.3, 40, and 41.In Table 2, the EER of our method is higher than that of the methods in Refs.15 and 39 and lower than that of the methods in Refs. 3 and 40.As our strategy is based on few samples, the prior knowledge of the object class is limited.However, the EER of our method also reaches 92.15 and 91.34%, respectively, which are competitive to the methods in Refs.3, 40, and 41.These methods always need thousands of samples to train classifiers.But our method, whose training processing is much simpler than that of these methods, just needs several samples.Compared to these three methods, our method is also competitive.

Face Detection
In this section, we demonstrate the performance of our method to face detection on Massachusetts Institute of Technology-Carnegie Mellon University (MIT-CMU) face data set 42 and Caltech face data set. 11,16The several training samples in our face detection experiments are all chosen from Fundação Educacional Inaciana (FEI) face data set. 43Since we just have few samples in hand, in this section the comparison is only made to the template matching method.As mentioned before, in the template matching methods, the LARK method 15 shows good performance.So we take it as our baseline object detector.First, we show the detection results of our strategy on MIT-CMU face data set.There are 45 images with 157 frontal faces of various sizes in our test set.The query image and training samples are all adjusted to the size 60 × 60.The scale of faces in the data set between the largest and smallest is from 0.4 to 2.0.One can see some of the results in Fig. 8.Although the target image is blurry or contains a cartoon human face, our detection method can localize the faces.Especially in Fig. 9, we detect 56 faces correctly among 57 faces and the precision rate is higher than the results in Refs.15 and 44.
To make a fair comparison, we use the union LARK detection results from several images.For example, if there are six training samples, LARK processes them one by one as the query image.For each target image, we record the true positives of six queries and get the total number of true positives without repeat.In this way, this union multisamples detection result of LARK can be compared with our method fairly.In Fig. 10, we show the comparison between our method and LARK. 15The curve of our method is the  average of five query images with the same training samples.

Training Cell Samples
One can find that our method is superior to the LARK.Next, we perform our method on the Caltech 11,16 face data set, which contains 435 frontal faces in file "Faces" with almost the same scale.
The proposed method on three data sets achieves higher accuracy and a lower false alarm rate than that of the union LARK.The organization of the several training samples is more efficient than the one-by-one detection strategy.We  From above, it can be seen that our detection strategy is consistent and robust on different query images and training samples.This is because our detection method has two steps, voting step (VS) and refined hypotheses step (RHS), which measure the object locally and integrally, respectively.Here, we show how these two steps affect our detection results.We compare the detection results of VS + RHS, VS and RHS on the Caltech data set (see Fig. 12).Each curve is drawn and averaged with the same seven query images and three training samples.One can see that the combination of both steps can get a higher precision rate than that of using each step alone, and that the voting strategy along has a higher accuracy than RHS.A similar conclusion can be drawn with other data sets, which are not shown here.

Generic Object Detection
We have already shown the detection results of our proposed method on the car and face data set.In this section, we use our strategy to some general real-world images containing hearts, flowers, and footballs.To the best of our knowledge, there does not exist a data set for object detection based on a few samples.So we download some real-world images from Google as our data set.One can find these images from our website http://cvlab.uestc.edu.cn/xupei.There are 34 images of red hearts with 49 positive samples and 40 images of sunflowers with 101 positive samples.In all of these data sets, the scale is from 0.2 to 1.8.
The detection examples can be found in Fig. 13.In the real world, the red-heart shape can be found with complex display.So, our detection results contain some false alarms.Now we compare the efficiency between our proposed scheme and the detection scheme. 15For these two methods, there are the same two steps: feature construction and object detection.We compare the time efficiency of these two steps between our strategy and LARK.To formalize the efficiency, t c LAS and t c LARK are, respectively, defined as the evaluation time of the feature construction.

Training Cell Samples Query
with a similar definition for ρ d LAS and ρ d LARK as One can find the comparison results of LAS and LARK features in Table 3.In the experiment, we evaluate 10 testing times (each testing contains 30 images) to record the evaluation time of the two steps in both our method and the LARK, respectively.In Table 3, we can see that the construction time for LARK feature is more ∼30% than that of our LAS feature.This is because the LAS feature just needs to compute the gradients and SVD decomposition to get three parameters.But for LARK feature, after SVD decomposition, the local kernel must be computed and then the principal component analysis (PCA) method is used to reduce the dimension.The superiority of our LAS feature is not just saving memory, but also cutting down the computing steps for each pixel.In Table 3, one can find that ρ d LAS is also lower ∼20%% than ρ d LARK (blue).Our detection step is based on the idea of voting.In Ref. 15, the matrix cosine similarity for each T i and Q is computed.Then, the salience map is constructed which is very time-consuming.In our method, the time of the training step can be ignored.In the experiments, we find that the consuming time of the training step is <7% of the whole running time for detecting objects in a target image.We further compare the performance with some art local features on Shechtman and Irani's test set. 12We use our LAS feature to compare with gradient location-orientation-histogram (GLOH), 45 LARK, 15 Shape Context, 46 SIFT, 17 and Commission Internationale de L'Eclairage (CIE). 12The ROC curves can be seen in Fig. 14.Compared to previous works, our LAS feature is much better than SIFT, Shape Context, and GLOH in the case of a few samples.Comparing to CIE and LARK, our LAS feature is comparable or even better.

Conclusion and Future Work
In this paper, we proposed a generic objects detection method based on few samples.We used the local principal gradient orientation variation information, namely LAS, as our feature.The voting spaces are trained based on a few samples.Our detection method contains two steps.The first step is adopting a combination densely voting method in trained voting spaces to detect similar patches in target image.Through the construction of a voting space, the advantage of our approach is resilient to local deformation of appearance.Then, a refined hypotheses step is used to locate object accurately.
Compared with the state-of-the-art methods, our experiments confirm the effectiveness and efficiency of our method.Our LAS feature has more efficiency and memory saving than that of LARK.Besides, the strategy we proposed in this paper gives a method of object detection when the samples are limited.Previous template matching method is to detect objects using samples one by one, while our method is to organize several samples to detect objects once.In the future, we will extend our work to the problem of multiple-object detection and improve the efficiency further.

Fig. 1
Fig.1The overview of our method.In the trained voting spaces, ðx; yÞ means the query image coordinates.Each bin in the spaces is corresponding to the pixel cell.
ði − 1; j − 1Þ g y ði − 1; 2).L, S, and θ are from the computation of the local gradient field, which can yield invariance to brightness change, contrast change, and white noise as shown in Fig. 3.The results of Fig. 3 are from the computation of LAS feature under different corresponding conditions.Due to the SVD decomposition of local gradients, the conditions of Fig. 3 on each pixel do not change the dominant orientation energy enormously.One can find the proof details of the tolerance of white noises, brightness change, and contrast change from Ref. 24.

Fig. 2
Fig. 2 (a) Locally adaptive steering (LAS) feature at some pixels.The red dots mean the positions of pixels.The ellipse means the dominant orientation in the local gradient patch around the corresponding pixel.(b) The components of LAS feature in an image.L, S, and θ are shown as a matrix, respectively.

Fig. 4
Fig. 4 Voting results comparison between Q and different patches from T. The thresholds τ L ¼ 0.7812, τ S ¼ 0.7793, and τ θ ¼ 0.7704 are computed by Eqs.(15) to(17).The patch T j is the similar one and the voted result is bounded by the red box.

Fig. 5
Fig. 5 (a) The query, target, and training samples.(b) Voting results in the target image.(c) Refined hypotheses step.
The learning set consists of 550 positive car images and 500 noncar images.The test set consists of two parts: 170 gray-scale images containing 200 side views of cars of size 100 × 40 and 108 gray-scale images containing 139 cars.In Fig. 6, we show some detected examples of UIUC car on the single-scale test set with the trained parameters τ L ¼ 0.7042, τ S ¼ 0.7031, τ θ ¼ 0.6852, and τ h ¼ 251.5.The query image and training samples are of size 100 × 40.

Fig. 6 Fig. 7
Fig.6The detection results of UIUC car on the single-scale test set.
draw ROC curves with respect to different query images from the same training samples and to the same query image on different training samples [see Figs.11(a) and 11(b)] on MIT-CMU face data set.
Figure 11 demonstrates that our detection strategy can achieve consistent precisions for both different training samples and different query images.This means our method is robust on different training samples and query images.In fact, we also obtained the same result on other data sets used in this paper.To describe the result clearly, we give the ROC curve on the MIT-CMU face data set.

Fig. 11
Fig. 11 ROC curves on MIT-CMU face data set.In (a), we show ROC curves of different query images from the same training samples.In (b), the ROC curves are drawn with the same query image on different training samples.
t d LAS and t d LARK are the evaluation times of the detection step.Here, we define ρ c LAP and ρ c LARK to describe the time efficiency of LAS and LARK features, respectively, where

Fig. 12 Fig. 13
Fig.12Comparison of different steps on the Caltech data set.The green curve represents the results that combine both steps.The red curve just uses the voting step.The black curve is using only refined hypotheses step.

Table 1
Detection equal-error rates on the single-scale UIUC car test set.

Table 2
Detection equal-error rates on the multiscale UIUC car test set.

Table 3
46mparison of efficiency between LAS and LARK.Comparison of different kinds of features on Shechtman and Irani's test set.12Gradientlocation-orientation-histogram,45LARK,15ShapeContext,46SIFT17 and CIE.