Automatic Detection of Transformer Components in Inspection Images Based on Improved Faster R-CNN

: To detect the categories and positions of various transformer components in inspection images automatically, this paper proposes a transformer component detection model with high detection accuracy, based on the structure of Faster R-CNN. In consideration of the signiﬁcant difference in component sizes, double feature maps are used to adapt to the size change, by adjusting two weights dynamically according to the object size. Moreover, different from the detection of ordinary objects, there is abundant useful information contained in the relative positions between components. Thus, the relative position features are deﬁned and introduced to the reﬁnement of the detection results. Then, the training process and detection process are proposed speciﬁcally for the improved model. Finally, an experiment is given to compare the accuracy and efﬁciency of the improved model and the original Faster R-CNN, along with other object detection models. Results show that the improved model has an obvious advantage in accuracy, and the efﬁciency is signiﬁcantly higher than that of manual detection, which suggests that the model is suitable for practical engineering applications.


Introduction
With the popularization of inspection robots and the accumulation of image data in smart substations, the automatic recognition of power equipment states based on inspection images is increasingly being widely used, such as switch state recognition and insulator breakage identification [1][2][3]. Compared with the traditional manual inspection, the computer vision technology can effectively improve the inspection frequency and efficiency, and avoid the influence of subjectivity on the accuracy [4,5]. As one of the most important parts in substations, the main transformers have a bearing on the safety and reliability of the power system. Thus, defect and fault recognition based on transformer images has important practical value, in which the detection of transformer components is an essential step. Only when the categories and positions of transformer components are detected accurately, can the recognition algorithm be designed for possible defects and faults of different components, such as oil leakage of conservators, silica gel discoloration of breathers and so on.
Traditional detection methods for power equipment and components are mainly rule-based algorithms with low-level features (e.g., Scale Invariant Feature Transform (SIFT) [6], Histograms of Oriented Gradient (HOG) [7]). Although these methods are intuitive and interpretable, they are hard to adapt to complex scenes and usually require abundant manual workloads for tuning when applied to a new scene [8]. In the computer vision field, the practice in recent years shows that the object detection models based on deep learning can abstract and synthesize the low-level features, which obtains substantially higher accuracy on PASCAL VOC and other canonical data sets [9][10][11][12][13]. However, when the deep learning detection models used in ordinary objects (e.g., cars, plants) are and the scale of the data set in this paper, the ResNet-50 is selected as the convolutional neural network (i.e.,  in the feature extraction module [34]. As shown in Figure 1, an image is transformed into a feature map by 5-stage ResNet-50 firstly, and then the feature map is sent to a region proposal network (RPN) to generate n proposals, each of which has 2 probabilities of containing an object or not containing an object, and 4 coordinates encoding the proposal position. Finally, the feature map and n1 selected proposals are sent to region of interest (RoI) pooling, fully-connected (FC) layers and a softmax classifier/bounding-box regressor. After that, each of the n1 proposals has k + 1 probabilities of containing k categories of objects plus a "background" category, and 4k coordinates encoding k adjusted proposal positions of k object categories. Based on Faster R-CNN, the improved model is proposed as shown in Figure 2. Compared with Figure 1, there are two main improvements of the architecture in Figure 2. The first one is to add the feature map of Stage 2 to the category and position detection module, which cooperates with the feature map of Stage 5 to adapt to the change of component sizes dynamically. The second one is to introduce relative position features between components to both the proposal generation module and the category and position detection module, and adopt random forests (RF) models to refine the probabilities and coordinates. Based on Faster R-CNN, the improved model is proposed as shown in Figure 2. Compared with Figure 1, there are two main improvements of the architecture in Figure 2. The first one is to add the feature map of Stage 2 to the category and position detection module, which cooperates with the feature map of Stage 5 to adapt to the change of component sizes dynamically. The second one is to introduce relative position features between components to both the proposal generation module and the category and position detection module, and adopt random forests (RF) models to refine the probabilities and coordinates.

Double Feature Maps for Different Component Sizes
In the category and position detection module, double feature maps generated by different stages of ResNet-50 are used for different component sizes. The feature level of feature map 1 is higher, which is beneficial to the feature abstraction of large objects, but it will lead to too much

Double Feature Maps for Different Component Sizes
In the category and position detection module, double feature maps generated by different stages of ResNet-50 are used for different component sizes. The feature level of feature map 1 is higher, which is beneficial to the feature abstraction of large objects, but it will lead to too much information loss for small objects in the process of convolution [22,23]. On the contrary, feature map 2 has a higher resolution and less information loss, but the feature level is relatively low and the features may not be abstract enough for large objects. To adjust to different object sizes dynamically, for each proposal, two groups of probabilities and coordinates are firstly generated based on the two feature maps separately, and then are weighted according to the proposal size. The calculation method of weight λ 1 and λ 2 is: where size pro represents the size of the proposal, and size ave represents the average size of all the ground-truth boxes (a ground-truth box frames an object correctly) in the training set. According to Equations (1) and (2), the weights change linearly when size pro < size ave . Additionally, if the value of λ 1 (λ 2 ) is λ when size pro = size ave /t (t > 1), then the value of λ 2 (λ 1 ) is equal to λ when size pro = t·size ave . Once the weights λ 1 and λ 2 are obtained, each of the probabilities (coordinates) is calculated by two corresponding probabilities (coordinates) as: where p 1 and p 2 represent two probabilities (coordinates) generated based on feature map 1 and feature map 2 respectively, and p represents a weighted probability (coordinate) of p 1 and p 2 .

Relative Position Features of Components
In the proposal generation module and the category and position detection module, the relative position features between a proposal and the detected components are used to refine the probabilities and coordinates of the proposal. Assume that there are k categories of components in the detection task, and then there are at most k groups of relative position features for each proposal.
Let the position of a proposal be represented by four parameters: x, y, w and h, which denote the proposal's center coordinates and its width and height. Likewise, let the ground-truth box of a detected component be represented by x*, y*, w* and h*, and the four position features of the proposal relative to the detected component are defined as: where the denominators w* and h* are used to avoid the influence of the distance between the component and the camera, since they will change dynamically along with the distance. Thus, the relative position feature vector of each proposal has 4k dimensions representing the position features relative to k categories of components. If a component category has not been detected or does not exist in an image, the four feature values corresponding to the category are missing.

Random Forests for Refining Probabilities and Coordinates
After extracting the relative position feature vectors, we need to select a machine learning algorithm to refine the probabilities and coordinates. Because usually only parts of the components are detected during the refinements, the relative position feature vectors always contain missing values. The machine learning algorithms based on "distance measurement" of samples, such as k-nearest neighbors (k-NN), support vector machine (SVM) and logistic regression, are sensitive to missing values, and the accuracy is likely to decrease if the missing values are filled manually. Conversely, the decision tree algorithm and naive Bayes algorithm are robust to missing values and do not need manual filling. However, there are relatively strong correlations between the relative position features, which will affect the accuracy of the naive Bayes algorithm [35], so the random forests (RF) model based on the decision tree algorithm is selected for the refinement.
Consider the case in Figure 3, in which there are four training samples and two categories of components (i.e., k = 2), and Category 1 is not contained in Sample 1 while Category 2 is not contained in Sample 4. Intuitively speaking, "distance measurement" based algorithms consider one sample's feature values (i.e., a row of feature values) at a time, while a decision tree evaluates one feature's values (i.e., a column of feature values) at a time. If a feature value is missing in a sample's row, the sample's location in the feature space cannot be figured out, so all the feature values in the row will be unavailable; on the contrary, if a feature value is missing in a feature's column, the feature only loses one value for evaluation, while the other values in the column are still available to evaluate the feature effectively [36]. More analyses in depth are as follows: For algorithms based on "distance measurement", take logistic regression for example, of which the key issue is to minimize the loss function as: where a is the number of available training samples, yi and xi are the label and the feature vector of the ith available training sample respectively, θ is a parameter vectors to be fitted, and hθ(x) is the logistic function. Because Sample 1 and Sample 4 contain missing values in the feature vector x, they cannot participate in the calculation of the loss function, and the available feature values are only the ones of Sample 2 and Sample 3 in the boxes with solid lines in Figure 3.
In contrast, the key issue of decision tree construction is to select one feature at a tree node to divide the samples and split out new nodes. Thus, each feature is examined in turn separately according to a splitting criterion. Take the feature xre1 for example, whose division points are chosen through bi-partition [37]. Without loss of generality, consider a division point x * re1 (i.e., if a sample's xre1 value is less than x * re1, the sample will be assigned to the left child node, or else it will be assigned to the right child node) and choose Gini index [36] as the splitting criterion, then the index used in examining the division point x * re1 is calculated as: where a is the total number of available training samples, a1 and a2 are the numbers of available training samples assigned to the left and right child nodes respectively, L1 and L2 are the sets of labels of available training samples assigned to the left and right child nodes respectively, and For algorithms based on "distance measurement", take logistic regression for example, of which the key issue is to minimize the loss function as: where a is the number of available training samples, y i and x i are the label and the feature vector of the ith available training sample respectively, θ is a parameter vectors to be fitted, and h θ (x) is the logistic function. Because Sample 1 and Sample 4 contain missing values in the feature vector x, they cannot participate in the calculation of the loss function, and the available feature values are only the ones of Sample 2 and Sample 3 in the boxes with solid lines in Figure 3.
In contrast, the key issue of decision tree construction is to select one feature at a tree node to divide the samples and split out new nodes. Thus, each feature is examined in turn separately according to a splitting criterion. Take the feature x re1 for example, whose division points are chosen through bi-partition [37]. Without loss of generality, consider a division point x * re1 (i.e., if a sample's x re1 value is less than x * re1 , the sample will be assigned to the left child node, or else it will be assigned Energies 2018, 11, 3496 7 of 18 to the right child node) and choose Gini index [36] as the splitting criterion, then the index used in examining the division point x * re1 is calculated as: where a is the total number of available training samples, a 1 and a 2 are the numbers of available training samples assigned to the left and right child nodes respectively, L 1 and L 2 are the sets of labels of available training samples assigned to the left and right child nodes respectively, and Gini(L) is the Gini index of the label set L. For the feature x re1 , there are three available training samples (i.e., a = 3): Sample 2, Sample 3 and Sample 4, and thus x re1 and x (4) re1 are available feature values. In addition, the unavailable sample (i.e., Sample 1) can be assigned to the left or right child node using techniques like surrogate splits [36]. Similarly, the other features can be examined in the same way, so the available feature values are the ones in the boxes with dashed lines in Figure 3, with no feature values wasted.
Although examining each feature separately is effective in dealing with missing values, it will also lead to a problem that the samples need to be divided too many times when the division boundary is complex. To solve this problem, we establish multivariate decision trees inspired by [38]. At each tree node, instead of using one feature to divide the samples, we use four features of the same category (e.g., x re1 , y re1 , w re1 and h re1 ) at the same time and adopt their linear combination (denoted as l(x re1 , y re1 , w re1 , h re1 ) for x re1 , y re1 , w re1 and h re1 , which can be specifically calculated with techniques like least square method [38]) for division. Similar to Equation (9), denote a division point of l(x re1 , y re1 , w re1 , h re1 ) as l * (x re1 , y re1 , w re1 , h re1 ), then the index used in examining the division point is calculated as: where a, a 1 , a 2 , L 1 , L 2 and Gini(L) are the same meanings as those of Equation (9), but the training samples are assigned to the left or right child node according to l * (x re1 , y re1 , w re1 , h re1 ) instead of x * re1 . For the four features x re1 , y re1 , w re1 and h re1 , there are still three available training samples (i.e., a = 3): Sample 2, Sample 3 and Sample 4. Similarly, the four features of Category 2 can be examined in the same way, so the available feature values are the ones in the boxes with dotted lines in Figure 3, with no feature values wasted.
Hypothetically, consider the case in Figure 4. For a univariate decision tree, according to the illustration above, the available feature values are the ones in the boxes with dashed lines. For our multivariate decision tree, because Sample 2 contains missing values on feature w re1 and h re1 , it cannot participate in the calculation of l(x re1 , y re1 , w re1 , h re1 ). Thus, Sample 2 is an unavailable training sample for the four features in Category 1, and the available feature values are the ones in the boxes with dotted lines, with feature values x (2) re1 and y (2) re1 wasted. However, the case in Figure 4 will not virtually happen since the four features of the same category are existent or absent simultaneously. Therefore, our multivariate decision trees will not result in wasted feature values, and meanwhile can consider several features at the same time to adapting to complex division boundaries.
Energies 2018, 11, x FOR PEER REVIEW 8 of 20 where a, a1, a2, L1, L2 and Gini(L) are the same meanings as those of Equation (9) Hypothetically, consider the case in Figure 4. For a univariate decision tree, according to the illustration above, the available feature values are the ones in the boxes with dashed lines. For our multivariate decision tree, because Sample 2 contains missing values on feature wre1 and hre1, it cannot participate in the calculation of l(xre1, yre1, wre1, hre1). Thus, Sample 2 is an unavailable training sample for the four features in Category 1, and the available feature values are the ones in the boxes with dotted lines, with feature values x (2) re1 and y (2) re1 wasted. However, the case in Figure 4 will not virtually happen since the four features of the same category are existent or absent simultaneously. Therefore, our multivariate decision trees will not result in wasted feature values, and meanwhile can consider several features at the same time to adapting to complex division boundaries. The decision tree algorithm used in the RF models is Classification and Regression Trees (CART) [36]. Decision tree classifiers and regressors are used to refine probabilities and coordinates respectively. In CART algorithm, a test sample's refined coordinate is predicted to be the average of the training samples' true coordinates in the corresponding leaf node. However, the box coordinates are continuous in images, so simply taking the average will discretize the coordinates and decrease  The decision tree algorithm used in the RF models is Classification and Regression Trees (CART) [36]. Decision tree classifiers and regressors are used to refine probabilities and coordinates respectively. In CART algorithm, a test sample's refined coordinate is predicted to be the average of the training samples' true coordinates in the corresponding leaf node. However, the box coordinates are continuous in images, so simply taking the average will discretize the coordinates and decrease the accuracy. To predict the coordinates more accurately, in each leaf node, we build a linear regression model with the training samples, and use it to predict the refined coordinates of the test samples. Take the refinement of coordinate x for example. Assume that there are r training samples in a leaf node, and the features of each training sample are 4 + 4k dimensions (4 dimensions of coordinates output by RPN or bounding-box regressor, and 4k dimensions of relative position features, explained in Section 4.1). Because the relative position features may contain missing values, we just extract the 4 dimensions of coordinates (i.e., x, y, w, h) and construct multivariable linear regression as: where x refine represents a refined coordinate, and a true coordinate x* of a train sample serves as its label (likewise for y refine , w refine , h refine ). The parameters b 0 to b 4 are fitted by the r training samples with the least square method. Based on the linear regression models in a leaf node, when a test sample with (4 + 4k)-dimensional features is assigned to the leaf node, its refined coordinates can be predicted as Equation (11).

Training Process of Improved Faster R-CNN
The training process includes two parts. The first one is the training of the neural network, which modifies the four-step alternating training in [11] to suit the double feature maps. The second one is the training of the RF models, especially for various situations of feature value absence.
The training steps of the neural network are presented in Table 1 and Figure 5. In Table 1, all the initialization and reinitialization are carried out by an ImageNet-pre-trained model [39]. Step Training Task Corresponding Arrow in Figure 5 1 Initialize ResNet-50, and train from Stage 1 to RPN. Fix ResNet-50, and fine-tune from RoI pooling to softmax/bounding-box regressor in feature map 2 branch using the proposals selected from the ones generated by RPN in Step 4.
Blue arrow 6 Fix ResNet-50, and fine-tune from RoI pooling to softmax/bounding-box regressor in feature map 1 branch using the same proposals in Step 5. Purple arrow After training the neural network, the training steps for the RF models are as follows: Step 1: Generate several combinations based on all the ground-truth boxes of each training image, in order to consider various situations that some ground-truth boxes (components) have not been detected. Assume that there are m ground-truth boxes in an image (m = 1, 2, …, k), each of which may have been detected or not during the detection process, and then the total number of the combinations generated by all the ground-truth boxes in the image is: After training the neural network, the training steps for the RF models are as follows: Step 1: Generate several combinations based on all the ground-truth boxes of each training image, in order to consider various situations that some ground-truth boxes (components) have not been detected. Assume that there are m ground-truth boxes in an image (m = 1, 2, . . . , k), each of which may have been detected or not during the detection process, and then the total number of the combinations generated by all the ground-truth boxes in the image is: It should be noted that all the combinations of an image are different from each other.
Step 2: Train the RF classifier in the proposal generation module. Through the trained ResNet-50 and RPN, an image will generate n proposals with two probabilities of containing an object or not containing an object. With 2 m ground-truth box combinations, each proposal can generate 2 m relative position feature vectors, so the image will produce n × 2 m vectors of 2 + 4k dimensions (2 dimensions of probabilities and 4k dimensions of relative position features) for training. The training labels are the same as the probability labels in RPN training.
Step 3: Train the RF regressor in the proposal generation module like Step 2, but replace the two probabilities in the training vectors with the four coordinates generated by RPN. The training labels are the same as the coordinate labels in RPN training. According to the loss function of RPN in [11], the proposals with labels of not containing an object are not used in the RF regressor training.
Step 4: Train the RF classifiers in the category and position detection module. Each of the n 1 proposals from the proposal generation module will generate k + 1 weighted probabilities of k component categories and a "background" category, and each component category needs a RF classifier. Meanwhile, with 2 m ground-truth box combinations, each component category can generate 2 m relative position feature vectors, so the image will produce n 1 × 2 m vectors of 1 + 4k dimensions (1 dimension of weighted probability and 4k dimensions of relative position features) for the training of each component category's RF classifier. The training labels are the same as the probability labels in the softmax training. For the "background" category, there are no corresponding weighted coordinates for calculating the relative position features, and thus we take its weighted probability as its refined probability.
Step 5: Train the RF regressors in the category and position detection module like Step 4, but replace the one weighted probability in the training vectors with the four weighted coordinates. The training labels are the same as the coordinate labels in the bounding-box regressor training.
According to the multi-task loss function in [10], for a proposal, if the probability label of a component category is 0, the proposal will not be used to train the RF regressor of this component category. Thus, similar to bounding-box regressors, the RF regressors just fine-tune the coordinates of the proposals.

Detection Process of Improved Faster R-CNN
After the training process, the improved Faster R-CNN model is used to detect transformer components, as shown in Algorithm 1. The statements with asterisks in Algorithm 1 will be explained in detail.

Algorithm 1. Algorithm for transformer component detection
Input: an image, n 1 *, confidence_threshold* Output: a set final_boxes* containing k 1 final box tuples (k 1 = 0, 1, . . . , k), each of which contains a component category with a probability and 4 coordinates Input an image to ResNet-50 and generate feature map 1 and feature map 2 Initialize final_boxes to empty set While the number of elements in final_boxes < k do Input feature map 1 to proposal generation module* and generate n 1 proposals Input n 1 proposals, feature map 1 and feature map 2 to category and position detection module* and generate n 1 × k box tuples (ignore the background category), each of which contains a category with a probability and 4 coordinates, and form a set boxes with n 1 × k box tuples Delete all the box tuples whose categories already exist in final_boxes from boxes Select the box tuple with the highest probability in boxes as final_box If the probability of final_box < confidence_threshold then Break Else Append final_box to final_boxes

End if End while
The details of Algorithm 1 are as follows: 1.
The parameter n 1 decides the number of the proposals selected into the category and position detection module; the parameter confidence_threshold decides the probability threshold of the output final box.

2.
Each final box in the set final_boxes frames a detected transformer component in the image, and provides its category and probability.

3.
The relative position features of a proposal in the proposal generation module or the category and position detection module are generated by the coordinates of the proposal and the final boxes.

Data Set, Models and Indices
A total of 2000 main transformer inspection images, taken in 2016-2017 from a 220 kV substation in China, are selected as the data set for our experiment. The resolution of each image is 800 (width) × 600 (height). Parts of the images are shown in Figure A1. According to the type of the transformers in the data set, six categories of components are chosen to be detected, including the conservator, the #1-#9 radiator group, the #10-#18 radiator group, the gas relay, the breather of the main body, and the breather of on-load tap-changer (OLTC). Then, the positions of the six categories of components in the images are labeled with ground-truth boxes. In the experiment, we use 10-fold cross-validation (i.e., train on 1800 images and test on 200 images for 10 times in turn), and put the detected results of all the tested images (which are exactly the 2000 images in the data set) together to  Table 2. According to the experiments in [11], for the neural network of improved Faster R-CNN model (FRCNN-improved, for short), the number of proposals selected into the category and position detection module (i.e., n 1 ) is set to 2000 and 300 in the training and test process respectively, and the anchors in RPN are of 4 sizes (8 2 , 32 2 , 128 2 , 512 2 ) and 3 aspect ratios (2:1, 1:1, 1:2). We also use a learning rate of 0.003 and a mini-batch size of 8, and set the number of training steps as 50,000 by parameter optimization. For the RF models, the key parameters are shown in Table 3. To examine the effect of the two main improvements in FRCNN-improved, three models in Table 4 are used as controls, in which FRCNN-original is exactly the model in [11]. Compared with FRCNN-improved, FRCNN-v1 only adopts a single feature map output by Stage 5 of ResNet-50 in the category and position detection module, and the results produced by the softmax and bounding-box regressor are directly refined by RF models without a weighting process, while FRCNN-v2 omits the refinements with relative position features in both the proposal generation module and the category and position detection module, and takes the weighted probabilities and coordinates in the category and position detection module as the final results without an iterative refinement process. The parameters of the three models as controls are set in accordance with the corresponding ones in FRCNN-improved. Meanwhile, two recent object detection models, Cascade R-CNN [28] and YOLOv3 [25], as well as two classical models with relatively high accuracy, SSD [12] and R-FCN [13], are chosen for comparison. In addition, each image in the data set contains only one transformer, that is, each component category appears at most once in an image. Therefore, in all the detection models, for each category in an image, we only select a detected box with the highest probability as the final box. A common index in object detection tasks, mean average precision (mAP), is evaluated in the experiment. Although mAP can reflect the accuracy of detection models in a relatively comprehensive perspective, it needs complicated calculation by repeatedly changing the parameter confidence_threshold to obtain several groups of precisions and recalls, and thus its engineering significance is not clear enough. To directly reflect the accuracy of the models in engineering application, the confidence_threshold is set to 0.6 by parameter optimization, and the precision and recall of each component category as well as the total precision and recall are evaluated. Additionally, the average detection time per image is calculated, which is measured on a Nvidia GeForce GTX 750 Ti GPU (NVIDIA, Santa Clara, CA, USA).
To determine whether a box frames a component correctly, the intersection over union (IoU) is adopted [11], which means the area ratio of the intersection of the box and the ground-truth box to the union of them. If IoU > 0.5, the box is judged as framing a component correctly.

Detection Results and Discussion
After training, the models are employed to detect the transformer components in the test images, and the detection results are shown in Table 5. As Table 5 shows, compared with the other models in Table 4, FRCNN-improved performs significantly better in the mAP and the total precision and recall, which are all over 94%. Thus, the two main improvements in the proposed model are both beneficial to raising the accuracy to a satisfactory level. Because the final boxes need to be generated iteratively in the detection process, the detection time of FRCNN-improved is the longest. However, the transformer inspection is a regular task without a high real-time requirement, while the efficiency of 2.1 s per image is obviously higher than the manual efficiency, so the proposed model is still significant for practical applications. Moreover, the accuracy of the four models based on Faster R-CNN is generally higher than that of YOLOv3, SSD and R-FCN, which indicates that Faster R-CNN is more suitable for transformer component detection. Similar to FRCNN-improved and FRCNN-v1, Cascade R-CNN adopts an iterative way to refine the predicted results, but its accuracy is lower mainly because its refinement is based on statistical results in ordinary object data sets, which is less effective and interpretable than the relative position features.
In addition, from the comparison of the precisions and recalls of specific component categories, it can be found that on the three relatively small components in Table 5 (i.e., gas relay, breather of main body, breather of OLTC), FRCNN-improved performs obviously better than FRCNN-v1, and similarly FRCNN-v2 performs better than FRCNN-original, which show that the models with double feature maps are better at detecting small objects than the corresponding single-feature-map models. Besides, on all the categories, the accuracy of the models with relative position features is higher than that of the corresponding models without relative position features, especially on the components with a similar appearance (i.e., the two categories of radiator groups, the two categories of breathers). To give more detailed explanations for these results, the two main improvements in the proposed model are analyzed in depth. P represents the precision, R represents the recall, t represents the detection time per image.

Double Feature Maps
The differences in the component sizes and the shooting distances lead to the great change of the object sizes in the images. In the data set, the largest ground-truth box is 227,454 pixels, while the smallest one is only 48 pixels. Moreover, small objects are more difficult to detect [23], so a low-level feature map is necessary. To verify the advantage of double feature maps in adapting to the change of object sizes, the sizes of all the ground-truth boxes in the data set are divided into five intervals according to the order of magnitude. Then, the recalls of the four Faster R-CNN based models in different size intervals are evaluated; the results are shown in Figure 6. From the comparison of FRCNN-improved and FRCNN-v1 (or the comparison of FRCNN-v2 and FRCNN-original), it can be seen that the recalls of models with double feature maps are higher than the corresponding single-feature-map models, and the advantages are more obvious in smaller sizes. This is also why the models with double feature maps perform better on the three relatively small components in Table 5. The detection results of two images are shown in Figure 7 to give an intuitive illustration, in which the red boxes are the ground-truth boxes, the yellow boxes are the detection results of FRCNN-improved, and the blue boxes are the detection results of FRCNN-v1. In Figure 7a, FRCNN-v1 fails to detect the gas relay; in both the two images, the positions of breathers and gas relays detected by FRCNN-improved are more accurate. From the comparison of FRCNN-improved and FRCNN-v1 (or the comparison of FRCNN-v2 and FRCNN-original), it can be seen that the recalls of models with double feature maps are higher than the corresponding single-feature-map models, and the advantages are more obvious in smaller sizes. This is also why the models with double feature maps perform better on the three relatively small components in Table 5. The detection results of two images are shown in Figure 7 to give an intuitive illustration, in which the red boxes are the ground-truth boxes, the yellow boxes are the detection results of FRCNN-improved, and the blue boxes are the detection results of FRCNN-v1. In Figure 7a, FRCNN-v1 fails to detect the gas relay; in both the two images, the positions of breathers and gas relays detected by FRCNN-improved are more accurate.
than the corresponding single-feature-map models, and the advantages are more obvious in smaller sizes. This is also why the models with double feature maps perform better on the three relatively small components in Table 5. The detection results of two images are shown in Figure 7 to give an intuitive illustration, in which the red boxes are the ground-truth boxes, the yellow boxes are the detection results of FRCNN-improved, and the blue boxes are the detection results of FRCNN-v1. In Figure 7a, FRCNN-v1 fails to detect the gas relay; in both the two images, the positions of breathers and gas relays detected by FRCNN-improved are more accurate.

Relative Position Features
In the proposal generation module, with the relative position features between proposals and final boxes, the RF classifier can refine the proposal probabilities of containing an object or not containing an object (e.g., decrease the probability of containing an object if a proposal is far from the final boxes), while the RF regressor can refine the proposal coordinates by learning the proportion relations of the component sizes and positions. The detection process of FRCNN-improved is used to illustrate the advantage of relative position features. In the detection process of an image, as the number of iterations increases, more final boxes will be generated and thus more relative position features are available for proposal generation. Because a test image

Relative Position Features
In the proposal generation module, with the relative position features between proposals and final boxes, the RF classifier can refine the proposal probabilities of containing an object or not containing an object (e.g., decrease the probability of containing an object if a proposal is far from the final boxes), while the RF regressor can refine the proposal coordinates by learning the proportion relations of the component sizes and positions. The detection process of FRCNN-improved is used to illustrate the advantage of relative position features. In the detection process of an image, as the number of iterations increases, more final boxes will be generated and thus more relative position features are available for proposal generation. Because a test image actually contains only 5 categories of components at most, the number of final boxes used in an iteration may be 0 to 4. Under each number of final boxes, the recall of the proposals output by the proposal generation module and the average IoU of the "correct" (i.e., IoU > 0.5) proposals are evaluated, as shown in Figure 8.  It can be seen from Figure 8 that as the number of final boxes increases (i.e., the number of relative position features increases), the recall of the proposals is improved, which indicates that the probabilities of the proposals are more accurate. Meanwhile, the average IoU is higher with more final boxes, which suggests that the accuracy of the proposal coordinates is improved.
In the category and position detection module, the RF models' refinement principles are similar to that of the proposal generation module, but the refinement is aimed at specific categories of components. Intuitive explanations are given in the two images in Figure 9. In both the images, we set that the #1-#9 radiator group and the #10-#18 radiator group have been detected, as the red boxes show. Then, in Figure 9a, we evenly select 4 × 7 boxes (the green boxes) of the same size as the breather of OLTC, and set that each box's probability of containing a breather of OLTC is predicted to be 60.0% (the value of confidence_threshold) by the softmax classifier. After that, the probability and the position features relative to two radiator groups of each green box are input to the RF classifier, which then outputs the refined probabilities of containing a breather of OLTC, as It can be seen from Figure 8 that as the number of final boxes increases (i.e., the number of relative position features increases), the recall of the proposals is improved, which indicates that the probabilities of the proposals are more accurate. Meanwhile, the average IoU is higher with more final boxes, which suggests that the accuracy of the proposal coordinates is improved.
In the category and position detection module, the RF models' refinement principles are similar to that of the proposal generation module, but the refinement is aimed at specific categories of components. Intuitive explanations are given in the two images in Figure 9. In both the images, we set that the #1-#9 radiator group and the #10-#18 radiator group have been detected, as the red boxes show. Then, in Figure 9a, we evenly select 4 × 7 boxes (the green boxes) of the same size as the breather of OLTC, and set that each box's probability of containing a breather of OLTC is predicted to be 60.0% (the value of confidence_threshold) by the softmax classifier. After that, the probability and the position features relative to two radiator groups of each green box are input to the RF classifier, which then outputs the refined probabilities of containing a breather of OLTC, as the labels over the green boxes show. The refined probabilities indicate that, even if the neural network cannot distinguish the probabilities of different boxes, the RF classifier can still refine the probability of each box effectively through the relative position information, and select the correct final box with the highest probability.  In Figure 9b, we set that the position of the breather of OLTC predicted by the bounding-box regressor is shown as the green box. Then, the coordinates and the position features relative to two radiator groups of the green box are input to the RF regressor, which then outputs the refined coordinates denoting the final box of the breather of OLTC, as the yellow box shows. Compared with the green box, the yellow box refined by the RF regressor is more accurate in the position and the size, which suggests that the coordinates can be refined with the aid of relative position features.
Therefore, in Table 5, the components with a similar appearance, which are easy to be confused when merely detected by the neural network, can be effectively distinguished by the RF models according to the relative position features. For the other components, the relative position features can also provide useful information of the proportion relations between components for the refinement, which will also improve the accuracy of detection.

Conclusions
To realize the automatic detection of transformer components in inspection images, an improved Faster R-CNN model is proposed in this paper, considering the difference in component sizes and the relative position information between components. The case study shows that the model significantly improves the accuracy of transformer component detection, which has evident practicability. It should be noted that all of the features used in the model, including those used by the neural network and the RF models, do not require manual construction for specific components. Therefore, the model is available for the component detection of various types of transformers and even other power equipment.
Our future work is to realize the automatic state recognition of transformers, of which the foundation is provided by the accurate detection of the categories and positions of various transformer components. Based on the detection results, the recognition algorithms need to be designed for different types of defects and faults related to transformers, which is also an important direction in our further research.  In Figure 9b, we set that the position of the breather of OLTC predicted by the bounding-box regressor is shown as the green box. Then, the coordinates and the position features relative to two radiator groups of the green box are input to the RF regressor, which then outputs the refined coordinates denoting the final box of the breather of OLTC, as the yellow box shows. Compared with the green box, the yellow box refined by the RF regressor is more accurate in the position and the size, which suggests that the coordinates can be refined with the aid of relative position features. Therefore, in Table 5, the components with a similar appearance, which are easy to be confused when merely detected by the neural network, can be effectively distinguished by the RF models according to the relative position features. For the other components, the relative position features can also provide useful information of the proportion relations between components for the refinement, which will also improve the accuracy of detection.

Conclusions
To realize the automatic detection of transformer components in inspection images, an improved Faster R-CNN model is proposed in this paper, considering the difference in component sizes and the relative position information between components. The case study shows that the model significantly improves the accuracy of transformer component detection, which has evident practicability. It should be noted that all of the features used in the model, including those used by the neural network and the RF models, do not require manual construction for specific components. Therefore, the model is available for the component detection of various types of transformers and even other power equipment.
Our future work is to realize the automatic state recognition of transformers, of which the foundation is provided by the accurate detection of the categories and positions of various transformer components. Based on the detection results, the recognition algorithms need to be designed for different types of defects and faults related to transformers, which is also an important direction in our further research.