Hybrid Histogram Descriptor: A Fusion Feature Representation for Image Retrieval

Currently, visual sensors are becoming increasingly affordable and fashionable, acceleratingly the increasing number of image data. Image retrieval has attracted increasing interest due to space exploration, industrial, and biomedical applications. Nevertheless, designing effective feature representation is acknowledged as a hard yet fundamental issue. This paper presents a fusion feature representation called a hybrid histogram descriptor (HHD) for image retrieval. The proposed descriptor comprises two histograms jointly: a perceptually uniform histogram which is extracted by exploiting the color and edge orientation information in perceptually uniform regions; and a motif co-occurrence histogram which is acquired by calculating the probability of a pair of motif patterns. To evaluate the performance, we benchmarked the proposed descriptor on RSSCN7, AID, Outex-00013, Outex-00014 and ETHZ-53 datasets. Experimental results suggest that the proposed descriptor is more effective and robust than ten recent fusion-based descriptors under the content-based image retrieval framework. The computational complexity was also analyzed to give an in-depth evaluation. Furthermore, compared with the state-of-the-art convolutional neural network (CNN)-based descriptors, the proposed descriptor also achieves comparable performance, but does not require any training process.


Introduction
In the past decades, affordable visual sensor equipment (e.g., surveillance cameras, smart phones, digital cameras and camcorders) has become widespread in our daily lives. Due to the growing number of images collected from these visual sensors, how to accurately and quickly retrieve the image-of-interest has become a hot topic [1][2][3][4][5][6]. Compared with text-based image retrieval (TBIR), content-based image retrieval (CBIR) is widely considered as an effective and efficient technology that not only extracts low-level visual cues (e.g., color, shape and texture) automatically, but also bridges high-level semantic comprehension. Until now, the feature representation descriptors, such as independent feature descriptor and fusion-based feature descriptor, have been increasing and developing in the CBIR community.
We designed the pyramid color quantization model, which is based on the powerful color probability distribution prior in the L*a*b* color space.

2.
We constructed the perceptually uniform histogram, which integrates color and edge orientation as a whole by exploiting a color difference operator. 3.
We developed the motif co-occurrence histogram in which the perceptually uniform motif patterns are further discussed and analyzed. 4.
We proposed the hybrid histogram descriptor that is comprised of the perceptually uniform histogram and the motif co-occurrence histogram.
The remainder of this paper is organized as follows. Preliminaries are introduced in Section 2, and the feature representation is described in Section 3. Experiments and evaluations are presented in Section 4. Section 5 provides conclusions.

The Color Space Selection
The selection of the color space is a crucial step before feature representation. In the past decades, several types of color spaces (e.g., RGB, L*a*b*, HSV, CMYK, YUV and HSI) have been widely used for CBIR. Among them, the RGB is recognized as one of the most popular color spaces. It is derived from three colors of light, namely, red (R), green (G) and blue (B) [36]. Nevertheless, its disadvantages are often ignored: (1) the redundancy between blue and green; (2) the missing yellow between red and green; and (3) the non-uniform perception of human eye. Consequently, Hering defined the L*a*b* color space, which includes three pairs of color channels consisting of the white-black pair of the L* channel (ranging from 0 to 100), the yellow-blue pair of the a* channel (ranging from −128 to +127), and the red-green pair of the b* channel (ranging from −128 to +127) [37]. Compared with the RGB, the advantages of the L*a*b* color space are summarized as follows: (1) the L*a*b* remedies the redundant and missing information of the RGB; (2) it conforms to human eye's perception mechanism; and (3) it provides excellent decoupling between intensity (represented by the L* channel) and color (represented by the a* and b* channels) [38]. Therefore, our scheme transforms all images from RGB to L*a*b* color space before the feature representation stage. The details of this transformation are defined using standard RGB to L*a*b* transformations as follows [15,39]: with f (u) = u 1/3 for u > 0.08856 where    where X n , Y n and Z n are the values of X, Y and Z for the illuminant and [X n , Y n , Z n ] = [0.950450, 1.000000, 1.088754] in accordance with illuminant D65 [15].

Probability Distribution Prior in L*a*b* Color Space
In the previous color quantization models [12][13][14][15]17,18], three color channels are uniformly mapped into the fixed intervals. However, during the process of quantization, these models lose some useful color information. Hence, reducing the loss of the useful color information is a serious concern. Inspired by this motivation, we have explored and summarized the color probability distribution of the a* and b* channels in different image databases. The example of the AID image database [40] is shown in Figure 1a,b. The frequency of pixels mainly focuses on the center region of the a* and b* channels. the a* and b* channels in different image databases. The example of the AID image database [40] is shown in Figure 1a,b. The frequency of pixels mainly focuses on the center region of the a* and b* channels. To verify the validity of this prior knowledge, we calculated the color probability distribution statistics of the a* and b* channels on hundreds of image databases. The results show that the proposed prior is stable and consistent. Even if an image database has been changed, the property of the color probability distribution prior is still fairly consistent. For example, the color probability distribution of the a* and b* channels in the RSSCN7 [41] dataset and its subset (50% of the RSSCN7 dataset) is shown in Figure 1c-f. Obviously, there is almost no change between RSSCN7 and its subset, except for the pixel frequency. To verify the validity of this prior knowledge, we calculated the color probability distribution statistics of the a* and b* channels on hundreds of image databases. The results show that the proposed prior is stable and consistent. Even if an image database has been changed, the property of the color probability distribution prior is still fairly consistent. For example, the color probability distribution of the a* and b* channels in the RSSCN7 [41] dataset and its subset (50% of the RSSCN7 dataset) is shown in Figure 1c-f. Obviously, there is almost no change between RSSCN7 and its subset, except for the pixel frequency.

Pyramid Color Quantization Model
Inspired by the above prior knowledge, we designed a novel pyramid color quantization model (as shown in Figure 2), in which every layer represents a quantized scheme (including a group of intervals and indexes). The original range (−128, +127) of a* or b* is first projected into two equal intervals in Layer 1, and the indexes of two intervals are flagged as 0 and 1 from left to right, correspondingly. Then, considering the pixels focus on the middle, two middle intervals from Layers 2-7 are split into four equal intervals from the up-layer to down-layer until two middle intervals cannot be split in Layer 7. Finally, the remaining intervals are copied from the up-layer to down-layer, sequentially. In this manner, we refine and retain the color information in the middle of the a* or b* channels effectively. We define the quantization layer of the a* and b* channels as Y a* and Y b* , where Y a* , Y b* ∈ {1, 2, . . . , 7}, and the indexes are denoted as Ỹ a* and Ỹ b* , Ỹ a* ∈ {0, 1, . . . , Ÿ a* } and

Pyramid Color Quantization Model
Inspired by the above prior knowledge, we designed a novel pyramid color quantization model (as shown in Figure 2), in which every layer represents a quantized scheme (including a group of intervals and indexes). The original range (−128, +127) of a* or b* is first projected into two equal intervals in Layer 1, and the indexes of two intervals are flagged as 0 and 1 from left to right, correspondingly. Then, considering the pixels focus on the middle, two middle intervals from Layers 2-7 are split into four equal intervals from the up-layer to down-layer until two middle intervals cannot be split in Layer 7. Finally, the remaining intervals are copied from the up-layer to down-layer, sequentially. In this manner, we refine and retain the color information in the middle of the a* or b* channels effectively. We define the quantization layer of the a* and b* channels as Ya* and Yb*, where Ya*, Yb*  {1, 2, …, 7}, and the indexes are denoted as Ỹa* and Ỹb*, Ỹa*  {0, 1, …, Ÿa*} and Ỹb*  {0, 1, …, Ÿb*}, where Ÿa* = 2Ya* − 1 and Ÿb* = 2Yb* − 1, respectively. In addition, considering the human visual intensity perception mechanism in [5], the L* channel is quantized into three intervals (0, +25), (+26, +75) and (+76, +100). We define the quantization layer of the L* channel as YL*, where YL* = 1, and the index is flagged as ỸL*, ỸL*  {0, 1, …, ŸL*}, where ŸL* = 2YL*. In summary, combining the indexes of the L*, a* and b* channels, the color map of an image f(x, y) is defined as C(x, y), and the index is flagged as C  , C  {0, 1, …, Ĉ }, where Ĉ = 2Ya* × 2Yb* × 3 − 1.

Perceptually Uniform Histogram Definition
The Gestalt Psychology Theory elucidates that the human visual perception mechanism tends to group elements into a local region where the elements share a homologous or approximate property [42]. Based on this theoretical foundation, perceptually uniform regions can be described as a certain visual feature space in which visual elements have the same rule (e.g., color and edge orientation). For the visual feature space Ĩ, an element ξ and its neighborhoods ξg within Ĩ are flagged as Ĩ(ξ) and Ĩ(ξg). Mathematically, the discrimination function φ(·) is formulated as follows: where N  represents the number of neighborhoods. If φ(Ĩ(ξ), Ĩ(ξg)) = 1, Ĩ(ξg) belongs to the perceptually uniform region; if φ(Ĩ(ξ), Ĩ(ξg)) = 0, Ĩ(ξg) does not belong to the perceptually uniform region. With subject to the perceptually uniform region, we construct the perceptually uniform histogram by exploiting the color difference operator [15,43,44] between the color and edge orientation. Herein, given an image f(x, y), the edge orientation map O(x, y) is first extracted by In addition, considering the human visual intensity perception mechanism in [5], the L* channel is quantized into three intervals (0, +25), (+26, +75) and (+76, +100). We define the quantization layer of the L* channel as Y L* , where Y L* = 1, and the index is flagged as Ỹ L* , Ỹ L* ∈ {0, 1, . . . , Ÿ L* }, where Ÿ L* = 2Y L* . In summary, combining the indexes of the L*, a* and b* channels, the color map of an image f (x, y) is defined as C(x, y), and the index is flagged as C, C ∈ {0, 1, . . . ,Ĉ}, whereĈ = 2Y a* × 2Y b* × 3 − 1.

Perceptually Uniform Histogram Definition
The Gestalt Psychology Theory elucidates that the human visual perception mechanism tends to group elements into a local region where the elements share a homologous or approximate property [42]. Based on this theoretical foundation, perceptually uniform regions can be described as a certain visual feature space in which visual elements have the same rule (e.g., color and edge orientation). For the visual feature space Ĩ, an element ξ and its neighborhoods ξ g within Ĩ are flagged as Ĩ(ξ) and Ĩ(ξ g ). Mathematically, the discrimination function ϕ(·) is formulated as follows: ϕ( I(ξ), I(ξ g )) = 1, I(ξ) = I(ξ g ) 0, I(ξ) = I(ξ g ) , g ∈ 1, 2, . . . , where ..
N represents the number of neighborhoods. If ϕ(Ĩ(ξ), Ĩ(ξ g )) = 1, I Ĩ(ξ g ) belongs to the perceptually uniform region; if ϕ(Ĩ(ξ), Ĩ(ξ g )) = 0, I(ξ g does not belong to the perceptually uniform region. With subject to the perceptually uniform region, we construct the perceptually uniform histogram by exploiting the color difference operator [15,43,44] between the color and edge orientation. Herein, given an image f (x, y), the edge orientation map O(x, y) is first extracted by using the Prewitt operator, due to its advantages of extracting the geometry and boundary information from the observed content. Then, experimentally, the edge orientation value is quantized uniformly into four bins to construct the edge orientation map O(x, y) because it is time consuming and unnecessary to consider all edge orientation values. Finally, the edge orientation map O(x, y) and the color map C(x, y) are divided into the overlapping 3 × 3 windows in which the central pixel is flagged as (x, y) and its eight neighbors are flagged as (x g , y g ), g ∈ {1, 2, . . . , 8}. The perceptually uniform histogram (PUH) is defined as follows: PUH ori (C(x, y)) = where ∆f represents the color differences among the central pixel (x, y) and its eight neighbors (x g , y g ) in ψ channels, ψ ∈ L * , a * , b * . The feature vector length of PUH color (O(x, y)) and PUH ori (C(x, y)) are 4 and 2Y a* × 2Y b* × 3, respectively. For an image dataset D, the fitness quantization layers of Y a* and Y b* are computed depending upon the retrieval accuracy score Acc(D|Y a* , Y b* ). This procedure is expressed as the maximization problem as follows: We present the detailed evaluation of different color quantization layers of Y a* and Y b* in Section 4.4.

Motif Co-Occurrence Histogram
The perceptually uniform histogram only extracts the color and edge orientation information, but the texture information is ignored to some extent. Fortunately, the motif pattern, which depicts the texture information by the pre-defined spatial structure model, can remedy this shortcoming.

Motif Patterns
The motif co-occurrence matrix (MCM) is investigated in [29] where the first six types of motif patterns shown in Figure 3, starting from the top-left point P1, are generated because they represent a completed set of space filling curves. However, using merely six motif patterns is insufficient because the perceptually uniform motif patterns (PUMP) are ignored. using the Prewitt operator, due to its advantages of extracting the geometry and boundary information from the observed content. Then, experimentally, the edge orientation value is quantized uniformly into four bins to construct the edge orientation map O(x, y) because it is time consuming and unnecessary to consider all edge orientation values. Finally, the edge orientation map O(x, y) and the color map C(x, y) are divided into the overlapping 3 × 3 windows in which the central pixel is flagged as (x, y) and its eight neighbors are flagged as (xg, yg), The perceptually uniform histogram (PUH) is defined as follows: where ∆f represents the color differences among the central pixel (x, y) and its eight neighbors (xg, yg) in  channels,  L*,a*,b*. The feature vector length of PUH color (O(x, y)) and PUH ori (C(x, y)) are 4 and 2Ya* × 2Yb* × 3, respectively. For an image dataset D, the fitness quantization layers of Ya* and Yb* are computed depending upon the retrieval accuracy score Acc(D|Ya*, Yb*). This procedure is expressed as the maximization problem as follows: We present the detailed evaluation of different color quantization layers of Ya* and Yb* in Section 4.4.

Motif Co-Occurrence Histogram
The perceptually uniform histogram only extracts the color and edge orientation information, but the texture information is ignored to some extent. Fortunately, the motif pattern, which depicts the texture information by the pre-defined spatial structure model, can remedy this shortcoming.

Motif Patterns
The motif co-occurrence matrix (MCM) is investigated in [29] where the first six types of motif patterns shown in Figure 3, starting from the top-left point P1, are generated because they represent a completed set of space filling curves. However, using merely six motif patterns is insufficient because the perceptually uniform motif patterns (PUMP) are ignored. To depict the consistency of spatial structure information, we propose three perceptually uniform motif patterns into which all types of perceptually uniform motif patterns are separated based on the number of equal pixels. Combining the previous six motif patterns, nine motif patterns are obtained, as shown in Figure 3, in which the red dots represent the number of equal pixels in the motif patterns 7, 8 and 9.

Motif Co-Occurrence Histogram Definition
Since the L*a*b* color space provides excellent decoupling between intensity (represented by the L* channel) and color (represented by the a* and b* channels) [38], the L* channel is applied to extract the motif co-occurrence histogram. For simplicity, a 5 × 5 mini-numerical map in Figure   To depict the consistency of spatial structure information, we propose three perceptually uniform motif patterns into which all types of perceptually uniform motif patterns are separated based on the number of equal pixels. Combining the previous six motif patterns, nine motif patterns are obtained, as shown in Figure 3, in which the red dots represent the number of equal pixels in the motif patterns 7, 8 and 9.

Motif Co-Occurrence Histogram Definition
Since the L*a*b* color space provides excellent decoupling between intensity (represented by the L* channel) and color (represented by the a* and b* channels) [38], the L* channel is applied to extract the motif co-occurrence histogram. For simplicity, a 5 × 5 mini-numerical map in Figure 4a is adopted to illustrate the proposed method. In our scheme, each pixel (apart from the lower and right boundary pixels) in the map is divided into the overlapping 2 × 2 grids in Figure 4b. Then, each grid is transformed into a motif pattern with the minimized local gradient to obtain the motif map shown in Figure 4c, which is used to calculate the motif co-occurrence histogram shown in Figure 4d. For example, the red circle in Figure 4c is a pair of motif patterns, indexed as (3,2), in the 0 • direction, corresponding to the red bar "MCH(3, 2) = 1" in the motif co-occurrence histogram in Figure 4d. Mathematically, the probability of co-occurrence of a pair of motif patterns is expressed as follows: where Pr is the probability of co-occurrence of a pair of motif patterns corresponding to (i, j) and its neighbor (i, j + 1) within the motif map M(x, y). MP e1 and MP e2 represent the indexes of a pair of motif patterns, where MP e1 , MP e2 ∈ {1, 2, . . . , 9}. The feature vector length of the motif co-occurrence histogram is 81. We will perform the detailed evaluation of different motif co-occurrence schemes between the motif co-occurrence matrix [29] and the proposed motif co-occurrence histogram in Section 4.5.
Sensors 2018, 18, x FOR PEER REVIEW 7 of 22 adopted to illustrate the proposed method. In our scheme, each pixel (apart from the lower and right boundary pixels) in the map is divided into the overlapping 2 × 2 grids in Figure 4b. Then, each grid is transformed into a motif pattern with the minimized local gradient to obtain the motif map shown in Figure 4c, which is used to calculate the motif co-occurrence histogram shown in Figure 4d. For example, the red circle in Figure 4c is a pair of motif patterns, indexed as (3,2), in the 0° direction, corresponding to the red bar "MCH(3, 2) = 1" in the motif co-occurrence histogram in Figure 4d. Mathematically, the probability of co-occurrence of a pair of motif patterns is expressed as follows: where Pr is the probability of co-occurrence of a pair of motif patterns corresponding to i j ( , ) and its neighbor ( , ) i j + 1 within the motif map M(x, y). MP e1 and MP e2 represent the indexes of a pair of motif patterns, where MP e1 , MP e2 {1,2,...,9} . The feature vector length of the motif co-occurrence histogram is 81. We will perform the detailed evaluation of different motif co-occurrence schemes between the motif co-occurrence matrix [29] and the proposed motif co-occurrence histogram in Section 4.5.

Hybrid Histogram Descriptor Definition
It is widely recognized that an image possesses a rich semantic content that goes beyond the description by its metadata [2]. Hence, it is necessary to take a fusion-based feature descriptor into

Hybrid Histogram Descriptor Definition
It is widely recognized that an image possesses a rich semantic content that goes beyond the description by its metadata [2]. Hence, it is necessary to take a fusion-based feature descriptor into account because it can integrate the merits of the subjective aspects of image semantics. From this point of view, the hybrid histogram descriptor (HHD) is proposed by concatenating the perceptually uniform histogram and the motif co-occurrence histogram, and it is expressed as follows: We present the detailed evaluation of the proposed descriptors among the perceptually uniform histogram, the motif co-occurrence histogram and the hybrid histogram descriptor in Section 4.6.

Distance Metric
The distance metric serves as an important step to measure the feature vector dissimilarity. In the CBIR framework, the query image and database images are converted into feature vectors in the form of histogram descriptors, and they are sent to the distance measure for measuring the dissimilarity. In this paper, the Extended Canberra Distance [15,32] is used, and it is defined as follows: where Q, D, K, and T represent the query image, the database image, the feature vector dimension, and the distance metric result, respectively, where

Evaluation Criteria
The final goal of image retrieval is to search a set of target images from the image database [35]. For a query image I Q and a database image I D , the precision (Pre) and recall (Rec) values are given as follows: where ϑ(·), N σ , and N τ represent the image category information, the number of retrieved images, and the number of images in each category, respectively. The discrimination function ς(·) is used to determine the category information between the query image and the database images.
In the experiments, to guarantee accuracy and reproducibility, all images were chosen as the query image. Referring to the parameter setting in [30,32], the number of retrieved images was set to 10.
For ETHZ-53 [45], the number of retrieved images was set to 5. Further, for N query images, the average precision rate (APR) and average recall rate (ARR) values are defined as follows: where n is the nth query image. Furthermore, considering the order of the retrieved images, the precision-recall curve denotes an auxiliary evaluation criterion that measures the dynamic precision with the threshold recall. Mathematically, the precision-recall curve is formulated as follows: where N τ and N χ represent the number of images in each category, and the total number of the shown images at the recall of χ, χ ∈ {1, 2, . . . , N σ − 1}. A higher precision-recall curve indicates a more accurate retrieval performance.

Image Databases
Extensive experiments were conducted on five benchmark databases, including two remote sensing image databases (RSSCN7 and AID), two textural image datasets (Outex-00013 and Outex-00014), and one object image database (ETHZ-53). The details of these datasets are summarized as follows: 1.
RSSCN7 database The RSSCN7 [41] is a publicly available remote sensing dataset produced by different remote imaging sensors. It consists of seven land-use categories, such as industrial region, farm land, residential region, parking lot, river lake, forest and grass land. For each category, there are 400 images with size of 400 × 400 in JPG format. Some sample images are shown in Figure 5a, in which each row represents one category. Note that there are images with rotation and resolution differences in the same category. Thus, the RSSCN7 dataset can not only verify the effective of the proposed descriptor but also inspect the robustness of different rotations and resolutions. The RSSCN7 dataset can be downloaded from https://www.dropbox.com/s/j80iv1a0mvhonsa/RSSCN7.zip?dl=0.

AID database
The aerial image dataset (AID) [40] is also a publicly available large-scale remote sensing dataset produced by different remote imaging sensors. It contains 10,000 images in 30 categories, for example, airport, bare land, meadow, beach, park, bridge, forest, railway station, and baseball field. Each category includes different numbers of images varying from 220 to 420 with size of 600 × 600 in JPG format. Some sample images are shown in Figure 5b, in which each row is one category. Similar to RSSCN7, there are images with rotation and resolution differences in the same category. The AID dataset can be downloaded from http://www.lmars.whu.edu.cn/xia/AID-project.html.

Outex-00013
The Outex-00013 [46] is a publicly available color texture dataset produced by an Olympus Camedia C-2500 L digital camera. It contains 1360 images in 68 categories, for example, wool, fabric, cardboard, sandpaper, natural stone and paper. Each category includes 20 images, each with size of 128 × 128 in BMP format. Some sample images from Outex-00013 are shown in Figure 5c, in which each row represents one category. There is no difference in the same category. The Outex-00013 dataset can be downloaded from http://www.outex.oulu.fi/index.php?page=classification.

Outex-00014
The Outex-00014 [46] is also a publicly available color texture dataset produced by an Olympus Camedia C-2500 L digital camera. It contains 4080 images in 68 categories, for example, wool, fabric, cardboard, sandpaper, natural stone, and paper. Each category includes 20, each with size of 128 × 128 images in BMP format. Some sample images from Outex-00014 are shown in Figure 5d, in which each

ETHZ-53
The ETHZ-53 [45] is a publicly available object dataset collected by a color camera. It contains 265 images in 53 objects, such as cup, shampoo, vegetable, fruit, and car model. Each object includes 5 images, each with size of 320 × 240 in BNG format. Some sample images are shown in Figure 5e, in which each row represents one category. Note that each object is with 5 different angles. The ETHZ-53 dataset can be downloaded from http://www.vision.ee.ethz.ch/en/datasets/.   Tables 1 and 2, i when Ya* = 6 and Yb* = 5, the HHD achieves the best APR = 79.57% on RSSCN7 and the best APR = 58.13% on AID, respectively. As documented in Tables 3 and 4, when Ya* = 6 and Yb* = 2, the HHD achieves the best APR = 84.21% on Outex-00013 and the best APR = 82.82% on Outex-00014, respectively. As listed in Table 5, when Ya* = 5 and Yb* = 6, the HHD achieves the best APR = 97.89% on ETHZ-53. In addition, we can also see that the simplest color quantization scheme (e.g., Ya* = 1 and Yb* = 1) does not lead to the lowest APR on RSSCN7 and Outex-00013, and the most refined color quantization scheme (e.g., Ya* = 7 and Yb* = 7) does not guarantee the highest APR. This phenomenon demonstrates that it is necessary to adaptively select the fitness quantization layers of Ya* and Yb*. Depending upon the retrieval accuracy score, the fitness quantization layers of Ya* and Yb* will be used in the following experiments.   Tables 1  and 2, i when Y a* = 6 and Y b* = 5, the HHD achieves the best APR = 79.57% on RSSCN7 and the best APR = 58.13% on AID, respectively. As documented in Tables 3 and 4, when Y a* = 6 and Y b* = 2, the HHD achieves the best APR = 84.21% on Outex-00013 and the best APR = 82.82% on Outex-00014, respectively. As listed in Table 5, when Y a* = 5 and Y b* = 6, the HHD achieves the best APR = 97.89% on ETHZ-53. In addition, we can also see that the simplest color quantization scheme (e.g., Y a* = 1 and Y b* = 1) does not lead to the lowest APR on RSSCN7 and Outex-00013, and the most refined color quantization scheme (e.g., Y a* = 7 and Y b* = 7) does not guarantee the highest APR. This phenomenon demonstrates that it is necessary to adaptively select the fitness quantization layers of Y a* and Y b* . Depending upon the retrieval accuracy score, the fitness quantization layers of Y a* and Y b* will be used in the following experiments.         Table 6 shows the average precision rate (APR) and average recall rate (ARR) values on the RSSCN7, AID, Outex-00013, Outex-00014 and ETHZ-53 datasets by using the motif co-occurrence matrix (MCM) and the motif co-occurrence histogram (MCH), respectively. Bold values highlight the best values. In Table 6, the {APR, ARR} of MCH greatly outperforms MCM by {18.14%, 0.45%} on RSSCN7, {15.21%, 0.47%} on AID, {41.75%, 20.87%} on Outex-00013 and {24.63%, 12.32%} on Outex-00014. One possible reason is that MCH takes three perceptually uniform motif patterns. Based on the above results, it can be concluded that MCH is more effective than MCM.  Table 7 shows the average precision rate (APR) and average recall rate (ARR) values on the RSSCN7, AID, Outex-00013, Outex-00014 and ETHZ-53 datasets by using the motif co-occurrence histogram (MCH), the perceptually uniform histogram (PUH) and the hybrid histogram descriptor (HHD). Bold values highlight the best values. As listed in Table 7

Comparison with Other Fusion-Based Descriptors
To illustrate the effectiveness and robustness of hybrid histogram descriptor (HHD), it is compared with nine fusion-based feature descriptors and the fusion of the perceptually uniform histogram and motif co-occurrence matrix (flagged as "PUH + MCM") on the RSSCN7, AID, Outex-00013, Outex-00014 and ETHZ-53 datasets. All comparative methods are detailed as follows: (1) mdLBP [30]: The 2048-dimensional multichannel adder local binary patterns by combining three LBP maps extracted from the R, G and B channels. (2) maLBP [30]: The 1024-dimensional multichannel decoded local binary patterns by combining three LBP maps extracted from the R, G and B channels.  Table 8 reports the comparisons between the proposed descriptors and the former schemes in terms of average precision rate (APR) and average recall rate (ARR). Bold values highlight the best values. In Table 8, it can be seen that HHD yields the highest APR and ARR compared to all former existing schemes on five datasets.  Figure 5a,b), and various illumination differences on Outex-00014 dataset (see Figure 5d), the robustness of the rotation, resolution and illumination is also well illustrated to some extent.  Figure 6a-j shows the performance comparison between HHD and existing approaches in terms of average precision rate versus number of top matches (APR vs. NTM) and average recall rate versus number of top matches (ARR vs. NTM). To guarantee the accuracy and reproducibility, the number of top matches is set to 100, 200, 20, 20 and 5 on RSSCN7, AID, Outex-00013, Outex-00014 and ETHZ-53, respectively. In Figure 6a,b, HHD achieves an obviously higher performance than all other fusion-based feature descriptors on RSSCN7. Meanwhile, we also note that the APR vs. NTM and ARR vs. NTM curves of mdLBP, maLBP, CDH, MSD, LNDP + LBP, MPEG-CED, Joint Colorhist, OCLBP, IOCLBP and PUH + MCM are close to one another extremely. The reason is that only seven land-use categories are very challenging to retrieve the targeted images from RSSCN7. As shown in Figure 6c,d, the APR vs. NTM and ARR vs. NTM curves of HHD achieve an obviously higher curvature than all other descriptors on AID. This phenomenon illustrates that the proposed descriptor can acquire better performance on the large-scale dataset. As expected, as shown in Figure 6e-j, HHD still outperforms all other existing descriptors over Outex-00013, Outex-00014 and ETHZ-53, respectively. Specifically, PUM + MCM and HHD are superior to other descriptors on ETHZ-53 obviously. The main reason is that they not only combine the color and edge information, but also integrate the texture information. Based on the above results, the effectiveness of the proposed descriptor is demonstrated by comparing with other fusion-based methods in terms of APR vs. NTM and ARR vs. NTM. Figure 7a-e shows the performance comparison of the top-10 retrieved images using different methods. The leftmost image in each row of Figure 7a-e is the query image, and the remaining images are a set of retrieved images ordered in ascending order from left to right. For clarity, if a retrieved image owns the same group label as the query, it is flagged as a green frame; otherwise, it is flagged as a red frame. In Figure 7a, there are 7 related images to the query image "River Lake" from RSSCN7 using mdLBP, 8 using maLBP, 8 using CDH, 4 using MSD, 9 using LNDP + LBP, 3 using MPEG-CED, 3 using Joint Colorhist, 8 using OCLBP, 7 using IOCLBP, 4 using PUH + MCM and 10 using HHD. Note that, although the images from "Forest" have a similar color to "River Lake", leading to the error results by most of the existing schemes, HHD can retrieve the targeted images accurately. In Figure 7b, for the query image "Baseball Field" from AID, the number of targeted images using mdLBP, maLBP, CDH, MSD, LNDP + LBP, MPEG-CED, Joint Colorhist, OCLBP, IOCLBP, PUH + MCM, and HHD descriptors are 7, 7, 9, 6, 5, 9, 5, 8, 9, 9 and 10, respectively. It can be seen that HHD not only displays a better retrieval result than all other descriptors, but also shows the robustness of rotation and resolution differences. In Figure 7c, for the query image "Rice" from Outex-00013, the precision achieved by using mdLBP, maLBP, CDH, MSD, LNDP + LBP, MPEG-CED, Joint Colorhist, PUH + MCM, and HHD descriptors are 40%, 40%, 80%, 70%, 30%, 80%, 80%, 90% and 100%, respectively. In comparison, we can see that although all retrieved images show a similar content appearance, yet HHD still outperforms all other descriptors. In Figure 7d, for the query image "Carpet" from Outex-00014, the precision obtained by using mdLBP, maLBP, CDH, MSD, LNDP + LBP, MPEG-CED, Joint Colorhist, OCLBP, IOCLBP, PUH + MCM, and HHD descriptors are 40%, 30%, 70%, 10%, 30%, 40%, 30%, 70%, 50%, 50% and 100%, respectively. As shown in Figure 7e, for the query image "Paper Bag" from ETHZ-53, HHD still outperforms all other existing descriptors. From the above results, we can conclude that HHD not only depicts the image semantic information with similar textural structure appearance but also discriminates the color and texture differences, effectively. In summary, the effectiveness of the proposed descriptor is demonstrated by comparing with existing approaches in terms of the top-10 retrieved images.  Figure 6a-j shows the performance comparison between HHD and existing approaches in terms of average precision rate versus number of top matches (APR vs. NTM) and average recall rate versus number of top matches (ARR vs. NTM). To guarantee the accuracy and reproducibility, the number of top matches is set to 100, 200, 20, 20 and 5 on RSSCN7, AID, Outex-00013, Outex-00014 and ETHZ-53, respectively. In Figure 6a,b, HHD achieves an obviously higher performance than all other fusion-based feature descriptors on RSSCN7. Meanwhile, we also note that the APR vs. NTM and ARR vs. NTM curves of mdLBP, maLBP, CDH, MSD, LNDP + LBP, MPEG-CED, Joint Colorhist, OCLBP, IOCLBP and PUH + MCM are close to one another extremely. The reason is that only seven land-use categories are very challenging to retrieve the targeted images from RSSCN7. As shown in Figure 6c,d, the APR vs. NTM and ARR vs. NTM curves of HHD achieve an obviously higher curvature than all other descriptors on AID. This phenomenon illustrates that the proposed descriptor can acquire better performance on the large-scale dataset. As expected, as shown in Figure  6e-j, HHD still outperforms all other existing descriptors over Outex-00013, Outex-00014 and ETHZ-53, respectively. Specifically, PUM + MCM and HHD are superior to other descriptors on ETHZ-53 obviously. The main reason is that they not only combine the color and edge information, but also integrate the texture information. Based on the above results, the effectiveness of the proposed descriptor is demonstrated by comparing with other fusion-based methods in terms of APR vs. NTM and ARR vs. NTM.  using mdLBP, maLBP, CDH, MSD, LNDP + LBP, MPEG-CED, Joint Colorhist, OCLBP, IOCLBP, PUH + MCM, and HHD descriptors are 40%, 30%, 70%, 10%, 30%, 40%, 30%, 70%, 50%, 50% and 100%, respectively. As shown in Figure 7e, for the query image "Paper Bag" from ETHZ-53, HHD still outperforms all other existing descriptors. From the above results, we can conclude that HHD not only depicts the image semantic information with similar textural structure appearance but also discriminates the color and texture differences, effectively. In summary, the effectiveness of the proposed descriptor is demonstrated by comparing with existing approaches in terms of the top-10 retrieved images. Figure 8a-e shows the performance comparison of the proposed HHD with existing approaches over RSSCN7, AID, Outex-00013 and Outex-00014 in terms of the precision-recall curve. According to Figure 8a,b, it can be observed that the precision-recall curve of HHD is obviously superior to all other fusion-based approaches. According to Figure 8c,d, it can be seen that the precision-recall curve of other fusion-based approaches is inferior to HHD over Outex-00013 and Outex-00014 obviously. Moreover, as shown in Figure 8e, both HHD and PUH + MCM are higher Figure 7. Results of the top-10 retrieved images by considering different query images: (a) "River Lake"; (b) "Baseball Field"; (c) "Rice"; (d) "Carpet"; and (e) "Paper Bag" using different descriptors (Row 1 using mdLBP, Row 2 using maLBP, Row 3 using CDH, Row 4 using MSD, Row 5 using LNDP + LBP, Row 6 using MPEG-CED, Row 7 using Joint Colorhist, Row 8 using OCLBP , and Row 9 using IOCLBP, Row 10 using PUH + MCM and Row 11 using HHD). Figure 8a-e shows the performance comparison of the proposed HHD with existing approaches over RSSCN7, AID, Outex-00013 and Outex-00014 in terms of the precision-recall curve. According to Figure 8a,b, it can be observed that the precision-recall curve of HHD is obviously superior to all other fusion-based approaches. According to Figure 8c,d, it can be seen that the precision-recall curve of other fusion-based approaches is inferior to HHD over Outex-00013 and Outex-00014 obviously. Moreover, as shown in Figure 8e, both HHD and PUH + MCM are higher than mdLBP, maLBP, CDH, MSD, LNDP + LBP, OCLBP, IOCLBP, and Joint Colorhist on ETHZ-53. The reasons can be summarized as follows: (1) Joint Colorhist, mdLBP, maLBP and LNDP + LBP only extract an independent color or texture information. (2) CDH, MSD and MPEG-CED consider the color and edge orientation information from different channels, while the texture information is ignored. (3) OCLBP and IOCLBP combine the color and texture information, but the edge orientation information is lost. (4) Although PUH + MCM integrates the color, edge orientation and texture information as a whole, the perceptually uniform motif patterns are lost. (5) HHD not only integrates the merits of the color, edge orientation and texture information, but also considers the perceptually uniform motif patterns.
Depending upon the above results and analyses, the effectiveness of the proposed descriptor is demonstrated by comparing with other fusion-based methods in terms of the precision-recall curve. (1) Joint Colorhist, mdLBP, maLBP and LNDP + LBP only extract an independent color or texture information.
(2) CDH, MSD and MPEG-CED consider the color and edge orientation information from different channels, while the texture information is ignored. (3) OCLBP and IOCLBP combine the color and texture information, but the edge orientation information is lost. (4) Although PUH + MCM integrates the color, edge orientation and texture information as a whole, the perceptually uniform motif patterns are lost. (5) HHD not only integrates the merits of the color, edge orientation and texture information, but also considers the perceptually uniform motif patterns.
Depending upon the above results and analyses, the effectiveness of the proposed descriptor is demonstrated by comparing with other fusion-based methods in terms of the precision-recall curve.   Table 9 shows the feature vector length, average retrieval time, and memory cost per image of different descriptors to provide an in-depth evaluation of the computational complexity. All experiments are carried out on a computer with Intel Core i7-7700K@4.20 GHz CPU processor, 4 cores active and 16 GB RAM. The feature vector length is compared by dimension (D). The average retrieval time is analyzed by seconds (S). The memory cost per image is measured in kilobytes (KB). Similar to PUM + MCM, the items of 445/229 (D) and 3.48/1.79 (KB) represent HHD with 445 dimensions and 3.48 kilobytes performing retrieval over RSSCN7, AID and ETHZ-53 databases, as well as HHD with 229 dimensions and 1.79 kilobytes performing retrieval over Outex-00013 and Outex-00014 databases. For RSSCN7, AID and ETHZ-53, the feature vector length and the memory cost per image of HHD are inferior to those of MSD, CDH, MPEG-CED and PUM + MCM, while HHD are superior to Joint Colorhist, maLBP, mdLBP, OCLBP, IOCLBP and LNDP + LBP For Outex-00013 and Outex-00014, the feature vector length and the memory cost per image of HHD are worse than MSD, CDH and PUM + MCM, but it is better than MPEG-CED, Joint Colorhist, maLBP, mdLBP, OCLBP, IOCLBP and LNDP + LBP. For the average retrieval time, HHD is more than MSD, CDH, MPEG-CED and PUM + MCM, yet HHD is less than Joint Colorhist, maLBP, mdLBP, OCLBP, IOCLBP and LNDP + LBP. The main reason is that the RSSCN7, AID and ETHZ-53 databases have more complex contents as compared with the Outex-00013 and Outex-00014 image databases. Although HHD does not outperform all other fusion-based descriptors, the usability and practicability of HHD is indicated under the content-based image retrieval framework configuration: adaptive feature vector length, competitive average retrieval time, and acceptable memory cost per image.

Comparison with CNN-Based Descriptors
Apart from the fusion-based descriptors, HHD is also compared with emerging deep neural networks techniques. Referring to the experimental setting in [48], we first extracted the last full-connected layer from the pre-trained CNN model (e.g., VGGM1024 and VGGM128). Then, the extracted feature vectors were L2 normalized. Finally, the normalized feature vectors were sent to perform the distance measure. To guarantee a fair comparison, the number of query images were identically set as all images, and the number of retrieved images were set to 10 on RSSCN7, AID, Outex-00013 and Outex-00014, and 5 on ETHZ-35. Figure 9 shows the comparisons between the proposed descriptors and the CNN-based schemes. In the case of the RSSCN7, Outex-00013, Outex-00014 and ETHZ-35 datasets, HHD performs better than the VGGM1024 and VGGM128 descriptors, and it achieves the highest performance. Particularly, PUM + MCM also outperforms the VGGM1024 and VGGM128 descriptors on the four datasets. Regarding the AID dataset, HHD is worse than VGGM1024. This makes sense because the pre-trained CNN models which are trained on the large-scale imageset, are suitable for the large-scale AID dataset. In contrast to the CNN-based descriptors, the advantages of HHD can be summarized as follows: (1) HHD does not require any training process in the feature representation. (2) The pre-trained CNN-based models have a high memory cost which limits its application.
(3) HHD performs better than the CNN-based descriptors in four datasets out of five. (1) HHD does not require any training process in the feature representation.
(2) The pre-trained CNN-based models have a high memory cost which limits its application.
(3) HHD performs better than the CNN-based descriptors in four datasets out of five. Figure 9. Comparison of the proposed descriptors with the CNN-based schemes over Outex-00013, Outex-00014, RSSCN7, AID and ETHZ-53.

Conclusions
In this paper, we propose a fusion method called hybrid histogram descriptor (HHD), which integrates the perceptually uniform histogram and the motif co-occurrence histogram as a whole. The proposed descriptor was evaluated under the content-based image retrieval framework on the RSSCN7, AID, Outex-00013, Outex-00014 and ETHZ-53 datasets. From the experimental results, it can be concluded that the fitness quantization layers of Ya* and Yb* are computed depending upon the retrieval accuracy score. It is also deduced that the motif co-occurrence histogram (MCH) exhibits significantly higher performance than the motif co-occurrence matrix (MCM). The performance of the proposed descriptor is much improved by confusing the perceptually uniform histogram (PUH) and the motif co-occurrence histogram (MCH). The performance of the proposed descriptor is superior to ten fusion-based feature descriptors in terms of the average precision rate (APR), the average recall rate (ARR), the average precision rate versus number of top matches (APR vs. NTM), the average recall rate versus number of top matches (ARR vs. NTM), and the top-10 retrieved images. Meanwhile, the feature vector length, the average retrieval time, and the memory cost per image were also analyzed to give an in-depth evaluation of the computational complexity. Moreover, compared with the CNN-based descriptors, the proposed descriptor also achieves comparable performance, but does not require any training process.
The increased dimension of the proposed descriptor slows down the retrieval time, which will be addressed in future research, especially using Locality-Sensitive Hashing [49]. Meanwhile, user relevance feedback, feature re-weight and weight optimization will be considered to further improve the accuracy of image retrieval. In addition, we will further investigate the generalization of the proposed method, especially using RawFooT [50] that includes changes in the illumination conditions.

Conclusions
In this paper, we propose a fusion method called hybrid histogram descriptor (HHD), which integrates the perceptually uniform histogram and the motif co-occurrence histogram as a whole. The proposed descriptor was evaluated under the content-based image retrieval framework on the RSSCN7, AID, Outex-00013, Outex-00014 and ETHZ-53 datasets. From the experimental results, it can be concluded that the fitness quantization layers of Y a* and Y b* are computed depending upon the retrieval accuracy score. It is also deduced that the motif co-occurrence histogram (MCH) exhibits significantly higher performance than the motif co-occurrence matrix (MCM). The performance of the proposed descriptor is much improved by confusing the perceptually uniform histogram (PUH) and the motif co-occurrence histogram (MCH). The performance of the proposed descriptor is superior to ten fusion-based feature descriptors in terms of the average precision rate (APR), the average recall rate (ARR), the average precision rate versus number of top matches (APR vs. NTM), the average recall rate versus number of top matches (ARR vs. NTM), and the top-10 retrieved images. Meanwhile, the feature vector length, the average retrieval time, and the memory cost per image were also analyzed to give an in-depth evaluation of the computational complexity. Moreover, compared with the CNN-based descriptors, the proposed descriptor also achieves comparable performance, but does not require any training process.
The increased dimension of the proposed descriptor slows down the retrieval time, which will be addressed in future research, especially using Locality-Sensitive Hashing [49]. Meanwhile, user relevance feedback, feature re-weight and weight optimization will be considered to further improve the accuracy of image retrieval. In addition, we will further investigate the generalization of the proposed method, especially using RawFooT [50] that includes changes in the illumination conditions.