Change Detection: The Framework of Visual Inspection System for Railway Plug Defects

Railway plug defects impact the safety of a railway system. To detect railway plug defects, we establish the framework of a visual inspection system (VIS), which is the first system that can perform railway plug inspection automatically and intelligently. Using the idea of change detection, the framework includes three algorithm modules, which are named the object location, image alignment and similarity measurement modules. After the image acquisition system captures a rail image as the input, the three algorithm modules process the image in order. First, in the object location module, a deep convolutional neural network is used to perform plug location. Second, in the image alignment module, a simple and fast method is designed to align key images using histogram of oriented gradients features. Third, in the similarity measurement module, the $\chi 2$ distance is used to compute the similarity between the two plug regions in an inspection image and in an aligned ground-truth image. The results of the similarity measurement are sorted when all inspection images are processed. Therefore, the inspection images with smaller similarity values are ranked higher and the plugs in the images have larger probabilities of defects. The framework has passed the practice tests, and the visual inspection system using this framework has already been authorized by the China Railway Corporation and will be equipped in many inspection trains belonging to local railway corporations.


I. INTRODUCTION
In recent years, high speed railway (HSR) transportation has become more important and the length of the HSR in China is increasing greatly. Hence, the demands of HSR maintenance have also rapidly increased and become urgent. For HSR maintenance, inspection is an important and necessary preliminary task to look for and confirm defects in the tracks, catenaries, tunnels, subgrades and various equipment in or by railway lines. However, traditional inspection, depending on contact measurement techniques and human patrolling detection, is slow, subjective, dangerous, and inefficient. Therefore, automatic and noncontact measurement systems are being proposed to overcome the limitations of human inspection.
Since there are many different kinds of inspection targets, many different kinds of methods are used in automatic and The associate editor coordinating the review of this manuscript and approving it for publication was Min Xia . noncontact measurement systems. For example, rail internal cracks can detected by ultrasonic techniques [1]- [3], eddy current testing methods have been proposed to confirm railroad damage [4], [5], and acoustic emissions and signal processing techniques have also been used in rail inspection [6]. In addition to all of those, visual inspection systems (VISs) are used more widely, including to detect track surface defects [7], [8], fasteners [9]- [11], bolts [12], slabs [13], squats [14], and catenary geometry parameters and units [15]- [18].
A plug is an important component that is used to transmit control information signals when a train is moving, and its defects may cause very severe incidents, such as rear-end or head-on collisions. Figure 1 shows the appearance of two plugs. Different from the inspections for the track infrastructure and catenaries mentioned above, there are some specificities and challenges for the inspection of railway plugs. First, plug shapes vary because their tail cables are not rigid; therefore, normal plugs (samples) have various appearances when being assessed using a machine. Second, since the installation sites of plugs are nonuniform and discrete along railways, locating plugs precisely becomes very difficult. According to our rough statistics, approximately 0.5% of inspection images contain plugs. Third, series of images that may contain plugs captured by the same camera look almost the same because the installation sites and angles of those cameras are fixed. At last, there are fewer plugs than other inspection targets, such as tracks and catenaries; therefore, the number of abnormal samples of plugs is lower than that of others and traditional machine learning methods cannot be used directly and practically in an automatic plug inspection system. Therefore, unlike track and catenary inspections, plug inspections always depend on human patrolling detection before our VIS is used. To solve those problems with plug inspection, we design a VIS using a change detection framework. Although the concept of change detection is already used in the fields of remote sensing [19,20], video surveillance [21], and medical diagnosis and treatment [22], the change detection framework proposed in this paper is designed especially for railways. Because the detected objects vary based on the application, it is difficult to compare algorithms using the change detection framework [23]. It is applied to solve the plug inspection problem for the first time, and the VIS we designed is the first automatic and intelligent plug inspection system. Among our early works, paper [24] presented our earliest work on an objection location module, a small part of the whole VIS, for plug location. The object location methods used in this work are traditional machine learning algorithms. Then, in paper [25], we designed a kind of deep neural network for the plug location module, and it performed better. However, both papers focus only on the object location module. In this paper, we greatly advance the previous work, and detail the whole inspection system for plug defects, including the hardware for image acquisition and the software for the change detection framework that contains three algorithm modules.
The rest of this paper is organized as follows. Section II introduces the overview of the VIS for plug defects. Section III describes the change detection framework. Section IV presents the field experiment results. Finally, section V provides some discussions and conclusions.

II. SYSTEM OVERVIEW
We design a VIS for railway plug defects. The VIS is composed of a hardware and a software part. The hardware part includes two high-speed digital cameras (named the inboard cameras) installed under a test train, as displayed in figure 2. Those cameras are used to capture series of images that may contain plugs when the train runs. This is named the image acquisition subsystem (IAS) in this paper.
Then, the obtained images are analyzed intelligently using a series of image processing and machine learning algorithms in the software part of the system. Those algorithms constitute the change detection framework in our work. Briefly, the core idea of change detection is the following: how to find plugs that contain appearance changes between two inspections. More changes mean greater probabilities of plug defects in one inspection compared to the ground-truth, which consists of nondefective plugs recorded just after the railroad was constructed or that were confirmed via manual inspection. In other words, the ground-truth means that plugs are undamaged in the inspection. In addition, the ground-truth dataset includes serial images captured by the IAS and their locations. The locations of these images are provided by the vehicle positioning system (VPS) that is standard equipment for the test trains used in China and it is not the theme of this paper. It is noted that the image location is presented in the form of kilometers plus meters and all images in this paper are equipped with location information. Using image processing software, the change detection framework is embedded in our VIS for railway plug defects.

A. HARDWARE: IAS
The IAS is composed of some commercially available components, such as cameras, light sources, etc. As shown in figure 2, two Dalsa Spyder 2 line-scan cameras installed under the test train are used to capture the plug images of the left and right tracks. Using those cameras, the maximum line rate is 65 000 lines/s and the images have a resolution of 1024 pixels. The protocol of those cameras is the Cameralink protocol and those images are captured by a IPC(Industrial Personal Computer)-CamLink frame grabber. An illumination setup equipped with a line array of LED light sources is installed under the test train near the line-scan cameras to reduce the effect of natural light. In addition, a wheel encoder is used to trigger the inboard, outboard, and central cameras to synchronize their data acquisition every two meters. The images captured by these cameras were taken synchronously at the same location of the rail, and can be stitched into a complete picture of the tracks and accessories. Equipped with those components, the IAS can capture highresolution images for the VIS.

B. SOFTWARE: CHANGE DETECTION FRAMEWORK
The VIS in this study includes the IAS and change detection modules for plug inspection. The IAS is described in the previous section. In this section, the change detection modules are introduced.
Concretely speaking, the change detection framework is composed of three algorithm modules, which are named the object location, image alignment and similarity computation modules, respectively. It is noted that the input images are all preprocessed using histogram equalization and median filtering for enhancement and denoising, respectively. Since the two preprocessing algorithms are classical operations and commonly used in image processing systems, we do not describe them below. The diagram of the framework is shown in figure 3.

1) OBJECT LOCATION
After an image is captured by the IAS, it is processed by the object location module that is used to locate plugs precisely by providing their rectangular coordinates in the image. The algorithm in this module is ''object detection'' algorithm from the field of computer vision [26]. Here, we use it to meet the second challenge of plug inspection mentioned in section I. As displayed in the first module of figure 3, the located plug (shown in figure 4(a)) is surrounded by a white rectangle denoted with coordinates in the image plane. In the object location module, an object detection algorithm using a Convolutional Neural Network (CNN) is proposed to locate plugs. The algorithm is presented in section III.A.

2) IMAGE ALIGNMENT
Then, the image alignment module is used to align the images containing plugs in one inspection with the images containing the same plugs in the ground-truth dataset. It is necessary to perform image alignment considered the VPS's precision. The images with the same location denoted by the VPS during two inspections may contain different objects. Similarly, images containing the same objects may be denoted with different locations by the VPS during two inspections. As shown in figure 4, the left image (figure 4(a)) was captured during one inspection and it was denoted by the VPS as 130 kilometers plus 219 meters on the railway from Jinshan North to Haining West. The middle image (figure 4(b)) with the red border was the image denoted by the VPS as from the same location in the ground-truth dataset. Obviously, the two images denoted by the VPS as the same location are different images that contain different objects. The right image (figure 4(c)) with the green border was the aligned image that contains the same objects in the ground-truth dataset, but its location was 130 kilometers plus 226 meters. Hence, to ensure the applicability of the next module, an image alignment algorithm must be executed. The module can align images with pixel-level precision because the plugs belonging to some railways must be periodically detected according to the maintenance plans and the IAS captures images with almost the same shooting angles and distances. Therefore, although the inspections for the plugs of the same railway are scheduled at different times, similar images are obtained. Although there are many image alignment algorithms in the field of machine learning [27], the algorithm used in this module is designed according to the characteristics of plug images. The algorithm of this module can match the image captured during an inspection to the image of the groundtruth dataset with pixel-level precision. Figure 4 shows that the plug image is precisely matched to the image in the plug ground-truth dataset. The algorithm will be presented in section III.B.

3) SIMILARITY MEASUREMENT
The similarity computation module is used to compute the similarity between the two plug regions in an inspection image and an aligned ground-truth image. The similarity measurement is conducted after all inspection images are processed directly. The inspection image with a smaller similarity value is ranked higher and the plugs in the image have a larger probability of having a defect. The details of the measurement are in section III.C.

4) GROUND-TRUTH DATASET
The plugs in the images contained in the ground-truth dataset are all perfect. In other words, those images are standard images. How can we build the dataset? There are two ways: first, when a new railway is just finished and its suitability for operation is tested, serial images could be taken and stored in the dataset; and second, when all the plugs in a railway are found to be perfect after manual inspection, the serial images The necessity and feasibility of image alignment: (a) the image that was captured during one inspection and it was denoted by the VPS as 130 kilometers plus 219 meters; (b) the image with the red rectangle was the image denoted by the VPS as the same location in the ground-truth dataset; and (c) the image with the green rectangle was the aligned image that contains the same objects in the ground-truth dataset, but its location was 130 kilometers plus 226 meters.
containing those plugs in the railway could be stored in the dataset. Obviously, those serial images are indispensable in the image alignment and similarity measurement computation modules. As displayed in figure 3(c), some serial images in the ground-truth dataset are joined in succession and the aligned segment is denoted with a green rectangle.

III. CHANGE DETECTION A. OBJECT LOCATION
The object detection algorithm utilized in this module to locate plugs has three steps: (1) 'Region Proposal' (RP) [28] provides some possible rectangle regions that may contain a plug, (2) a CNN is used to extract plug region features, and (3) a support vector machine (SVM) is used as a classifier to judge whether the region features represent a plug.

1) REGION PROPOSAL (RP)
The source image captured by the IAS is shown in figure 5(a). The region of the rail waist structures, as the input image for the object detection algorithm, is shown in figure 5(b). Obviously, the plug looks small and salient in figure 5(b), but the rest of the region looks smooth and monotonous. In other words, if an input image contains a plug, the input image should present a different texture. Because different textures can be distinguished in the frequency domain, the magnitude spectrum of the input image containing a plug (figure 5(b)) can have the magnitude spectrum of the average image subtracted from it, and the result is called the spectrum residual. Figure 5(c) is the average image and its pixel values are the mean pixel values of the 384956 rail waist structure regions that do not contain plugs in this work. Then, the spectrum residual is processed by the phase-holding IFFT and the resulting image will distinctly show the salient regions. Those salient regions are proposal regions where plugs may be contained, as displayed in figure 5(d). In the figure, the real region is denoted by an arrow. From the above, the algorithm proposed to conduct RP is named the spectrum residual region proposal (SRP) algorithm, and figure 6 shows the flowchart. The outputs of the SRP are some  proposal regions. Those proposal regions, which are encircled by rectangular boxes, are then used as the input images to the following CNN.

2) PLUG CNN
The CNN is designed for plug detection in this paper, and we name it the plug CNN (pCNN). The function of the pCNN is to extract the features of the input images, and its structure is shown in figure 7. Above all, the terminology of the pCNN in this paper is the same as those in the famous classical paper about the CNN [29]. The pCNN includes four convolution modules, and each module consists of a convolution layer, a nonlinear activation layer, a normalization layer and a pooling layer.
In detail, the input image is resized to 32 * 32 pixels, and 2 pixels are padded to surround the resized image. This is convenient for the convolution operation. Next, in the first convolution layer named 'conv1', the size of the receptive field, defined as the size of the region in the input that produces the feature [26], is 5 * 5, its dimension is 32 and the step size is 1. As the nonlinear activated layer, to accelerate the convergence of the pCNN, the 'relu1' function is used to get the maximum value of the convolution results. Then, local response normalization is applied to the results of relu1 to gain better generalization. The operation is named 'norm1' in the paper. Then, a pooling layer is used to get the maximum values of the previous step's results in a 3 * 3 receptive field and a step size of 2. We name the layer 'pool1'. In short, conv1, relu1, norm1 and pool1 build up the first convolution module of the pCNN. It is noted that the measurement unit is pixels in the description of the pCNN. Similarly, the following second convolution module of the pCNN can be described as follows: one convolution layer named 'conv2', the size of the receptive field is 5 * 5, the dimension is 64, there are 2 padding pixels and the step size is 1. Then, the 'relu2' function, which is the same as 'relu1', is used; local response normalization is applied to the results of relu2 and the operation is named 'norm2'. Then, the pooling layer is the same as pool1 and is named 'pool2'. Next, the following third convolution module of the pCNN is described as follows: one convolution layer named 'conv3', the size of the receptive field is 5 * 5, the dimension is 256, 2 padding pixels are used and the step size is 1. Then, the function 'relu3', which is the as same as 'relu1' and 'relu2', is used; local response normalization is applied to the results of relu3 and the operation is named 'norm3'. Then, a pooling layer is used to get the maximum values of the previous results in the 1 * 1 receptive field and the step size is 2. We name the layer 'pool3'. Next, the following forth convolution module of the pCNN is described as follows: one convolution layer named 'conv4', the size of the receptive field is 1 * 1, the dimension is 1024 and the step size is 1. Then, the 'relu4' function, which is as same as 'relu1', 'relu2' and 'relu3', is used; local response normalization is applied to the results of relu4 and the operation is named 'norm4'. Next, a fully connected layer named 'fc1' is used to realize the inner product of the previous results. Then, a dropout layer named 'drop1' is used to set the results of fc1 as 0 or 1 randomly. It can also gain better generalization for the CNN. The output of the pCNN in this work is the feature of the input image, and the feature dimension is 4096. Finally, a fully connected layer named 'fc2' is used to determine whether the input image contains a plug. This layer is only used in the pCNN training.
In figure 8(a), the input image containing a plug is used to illustrate the features extracted by the pCNN. The intermediate results are shown in figures 8 (b)∼(e). Those figures show that as the result of SPR, the input image is processed by the first to the fourth convolution modules in order and the results of the 4096-dimension features are sparse, as displayed in figure 8(f).
To train the pCNN, 6000 typical plug images are collected and resized to 32 * 32 as positive samples. In the same way, 6000 typical nonplus images are collected and resized to 32 * 32 as negative samples. Those images

3) SVM
The SVM classification follows the pCNN. The input is the features such as figure 8(f) and the result is 0 or 1, which represents whether the input image contains a plug. In addition, the SVM is a linear classifier [30] realized using the libsvm toolbox [31]. To train the linear SVM, the features of fc1 are input after the pCNN is trained. In SVM training, 10-fold cross-validation is adopted and the regularization factor C is set as 2.
The experiments about the object location module are shown in section IV.A. Figure 10 shows the flow of the image alignment process. At first, the location (denoted as kilometers plus meters) of an image is provided by the VPS. The image is processed by the previous module (object location module) to ascertain whether a plug is contained in the image. The image is named I s (shown in figure 3(a)). Second, series of images in the ground-truth dataset are retrieved and joined to an image named I j (shown in figure 3(c)). Those series of images' locations (also denoted as kilometers plus meters) are all within a 20 meters range of the location of I s . The range of 20 meters can be deemed to be the error of the VPS. In our IAS, ten images can cover twenty meters of continuous track without overlap. Therefore, I j , the joined image, is 21 times the height and the same width as I s . Third, the rectangular window that is same size as I s moves pixel by pixel in the joined image I j from head to end (shown as the red rectangle in figure 3(c)). During the moving, a new image named I n is created when the window moves a pixel. The new image is the same as the region of I j under the window. Obviously, we can get a series of I n (shown in figure 3(d)), when the window is moving. Fourth, the histogram of oriented gradients (HOG) features are extracted from I n since I n has more lines and right angles [32]. In addition, the HOG features of I s are also computed. HOG is a classical and stable feature descriptor that is used for object detection in the field of computer vision. It presents a normalized histogram that is obtained by computing the histogram of oriented gradients in a local image region. Fifth, the χ2 distances between I s and those created images (the series of I n ) are computed [33] in the space of the HOG histograms. The image in the series of I n , named I a , with the minimum distance to I s is the aligned image in the ground-truth dataset.

B. IMAGE ALIGNMENT
The χ 2 distance d(x, y) is the distance between two histograms x = [x 1 , · · · , x n ] and y = [y 1 , · · · , y n ] with both having n bins. Moreover, both histograms are normalized, i.e., their entries sum up to one. The HOG features used in this work are just normalized histograms, and so we can denote the features of the two images as x and y. The distance d is usually defined as It is often used in computer vision to compute the distances between two images. Figure 3 illustrates the plug image alignment algorithm. The input image I s containing a located plug (denoted as 240 kilometers plus 610 meters) is shown in figure 3(a) and the image with the same location in the ground-truth dataset is shown in figure 3(b). Obviously, those images do not present the same scene. The fragment of joined image I j is shown in figure 3(c). Some images in the series of I n are shown in figure 3(d). The resulting image I a aligned to the input image I s is shown in figure 3(e). The location of I a can be considered to be 240 kilometers plus 614 meters. The more experiments about the image alignment module are shown in section IV.B.

C. SIMILARITY MEASUREMENT
From the previous steps, the aligned images in the ground-truth dataset are created and we get pairs of images-inspection images containing plugs and their aligned images in the ground-truth dataset. In this module, the similarity measurements for same plug regions in those pairs of images, as displayed in figure 3, are computed with their local binary pattern (LBP) features [34]. The LBP is a classical and stable operator to extract an image's statistical and structural features. It describes the statistical properties and texture structure in the form of a normalized histogram. That is, in figure 3, the white solid rectangular region in the inspection image (shown in figure 3(a)) and the white dashed rectangular region in the aligned image (shown in figure 3(e)) are the corresponding area to be measured. In addition, the similarity measurement is also the χ2 distance in the space of the LBP histograms. An example is shown in Figure 11, where the LBP histograms of the images in the ground-truth dataset and those captured during inspection were calculated. The χ2 distance between (a) and (b) was 0.0248 and that between (c) and (d) was 0.0349. The image with a defective plug has larger distance value. Obviously, the similarity decreases as the distance becomes larger, and we may replace the similarity with respect to the distance in the following. Then, the values of the measurement metric, the χ2 distances, are sorted from the largest to the smallest. Obviously, a larger value means that it is more likely that the inspection image contains defective plugs. Finally, the inspection images containing defective plugs are sorted according to the possibility that they contain a defect. Those images are the results we wanted. The experiments of this module are shown in section IV.C.

IV. EXPERIMENTS
In this section, we illustrate some experiment results about the change detection framework. The operating environment is a computing workstation that is equipped with dual Intel Xeon E5-2680V4 (14 kernels, 2.4 GHz) CPUs and 256G DDR4 RAM.

A. OBJECTION LOCATION
We illustrate some typical experimental results for the objection location module in Figures 12∼13. The parameter settings are the same as those described in section III.A. Figure 12 shows some typical results. In figure 12(a), the image containing no plug was captured in a normal environment and the location result shows that no object was located. In figure 12(b), the image containing (c) the image containing a plug was captured in a normal environment; (d) the image containing a plug was captured in a complex environment, such as a rail switch; (e) the image contains a half plug; (f) the image containing a plug was captured in insufficient illumination and it looks smudged due to corrosion and scratches; (g) the image contains a plug that was also manually marked by trackwalkers; and (h) the image containing a plug that looks a little motion blurred.
no plug looks smudged due to corrosion and scratches and the location result shows that no object was located. In figure 12(c), the image containing a plug was captured in a normal environment and the location result shows that the object enclosed by a rectangle was correctly located. In figure 12(d), the image containing a plug was captured in a complex environment, such as a rail switch, and the location result shows that the object enclosed by a rectangle was correctly located. In figure 12(e), the image contains a half plug and the location result shows that the half plug enclosed by a rectangle was correctly located. In figure 12(f), the image containing a plug is captured in insufficient illumination and it looks to be smudged due to corrosion and scratches. The location result shows that the object enclosed by a rectangle was correctly located. In figure 12(g), the image contains a plug that was also manually marked by trackwalkers and the location result shows that the located enclosed by a rectangle was correctly located. In figure 12(h), the image containing a plug looks a little motion blurred and the location result shows that the object enclosed by a rectangle was correctly located. Moreover, the precision-recall curve also shows that the object location module in this paper (SRP+pCNN+SVM) gets better results than those in our previous work [24], as represented by the curves denoted as LBP+SVM and Haar+Adaboost, as displayed in figure 13. Our previous article [25] presents more details about the object location module in this paper (SRP+pCNN+SVM). Obviously, deep learning results perform better than traditional machine learning results.

B. IMAGE ALIGNMENT
In this section, we illustrate the experimental results for the image alignment module. As shown in figure 14, the upperleft image ( figure 14(a)) was captured during one inspection and it was denoted by the VPS as 239 kilometers plus 545 meters on the railway from Changzhou to Zhenjiang. The upper-middle image ( figure 14(b)) was the image denoted by the VPS as the same location in the ground-truth dataset. Obviously, the two images denoted by the VPS as the same location are different images that contain different objects. The upper-right image ( figure 14(c)) was the aligned image that contains the same objects in the ground-truth dataset, but its location was 239 kilometers plus 541 meters. Similarly, the bottom-left image ( figure 14(d)) was captured during one inspection and it was denoted by the VPS as 127 kilometers plus 710 meters on the railway from Yuanping to Xinzhou. The bottom-middle image ( figure 14(e)) was the image denoted by the VPS as the same location in the ground-truth dataset. Obviously, the two images denoted by the VPS as the same location are different images that contain different objects. The bottom-right image ( figure 14(f)) was the aligned image that contains the same objects in the ground-truth dataset, but its location was 127 kilometers plus FIGURE 14. Two experiments on the image alignment module: (a) the image was captured during one inspection and it was denoted by the VPS as 239 kilometers plus 545 meters; (b) the image denoted by the VPS as the same location in the ground-truth dataset; (c) the image was the aligned image that contains the same objects in the ground-truth dataset, but its location was 239 kilometers plus 541 meters; (d) the image was captured during one inspection and it was denoted by the VPS as 127 kilometers plus 710 meters; (e) the image denoted by the VPS as the same location in the ground-truth dataset; and (f) the image was the aligned image that contains the same objects in the ground-truth dataset, but its location was 127 kilometers plus 712 meters. 712 meters. Using this method, we can get 100% alignment accuracy.

C. SIMILARITY MEASUREMENT
In this section, we illustrate some experimental results about the similarity measurement module. Those results are also the final results for the plug inspections and the images were captured along the Shanghai-Hangzhou high-speed railway line. It is noteworthy that those experiments are processed by the VIS installed on the inspection train owned by the China Academy of Railway Sciences. We present the first four results measured by the χ2 distances and sorted the distances from the largest to smallest. Figure 15 shows the plug inspection results. For example, figure 15(a) shows that the plug cable was moved compared with the aligned image in the ground-truth dataset, and this means that the plug may be defective with the maximum possibility. It is also noted that the aligned image in the ground-truth dataset is the left image and the inspection image is the right one in figure 15(a). Comparing the rectangles in above figures, the rectangles surrounding the plugs in this figure are enlarged to avoid missing some details of those plugs. This is the same with figures 15(b)∼(d). Obviously, those plugs were all touched by something or somebody, and this would very likely cause some defects since plug tail cables could be broken off easily. With those results, professional maintenance engineers can judge whether the plugs need to be maintained. Using this means, we also provide some possible defective plug positions when railway maintenance managers make a predictive maintenance plan for the next maintenance period.
To evaluate the performance of the whole system, we chose a ground-truth dataset with 191,530 images in which 9,276 included a plug. Since it is impossible to obtain defective plugs via destructive tests on real railways, every time five images were randomly picked from the 9,276 plug images, defective plugs were imitated. The precision-recall curve of the 10 training and calculation sessions of the Top-1 accuracy and the Top-20 accuracy are shown in Figure 16. In the field of machine learning, Top-N accuracy means that the correct result gets to be in the Top-N probabilities for it to count as ''correct''. The processing time for one image using these algorithm modules was approximately 0.1 second. In most cases, if a defective plug was found, it must be replaced by a new one instantly. So, there will be no defective plug in this railway section for a very long time. Therefore, the experiment ( Fig. 16) with imitated and augmented defective plugs was designed to provide some statistical results in normative rule of academic standard. As an industry application in practical use, the top-20 results (those are the first twenty plug images presented by our system), sorted by defect probability from large to small, must include the defective ones, if defective plugs exist in this railway section. The top-1 result is not used in practice. In fact, the framework has passed the practice tests of 1,1905.5 km and 150 hours.

V. DISCUSSION AND CONCLUSION
The VIS with the change detection framework is the first visual system that can perform railway plug inspection automatically and intelligently. Regarding the algorithm modules, the precision-recall curve of the object location module looks better than those of the other methods that we used before [24], and the image alignment module can also achieve 100% alignment accuracy in our experiments; however, the series of inspection images look almost the same except for those images containing plugs. In the last module experiments, we find that plugs that are very likely to be defective can be selected from a large number of plug images automatically and intelligently. Then, that small number of selected plug images will be reviewed manually, and this is practically easy work. It is also the reason why we do not provide the final system accuracy because that small number of recommended images should be reviewed by professional maintenance engineers to finally confirm whether those plugs are defective or not. In fact, during practical inspection work, the speed of a test train may be 120 km/h (kilometers/hour), and the run time is about three hours for every inspection. Therefore, the total number of images is approximately 180,000 since one image can cover about two meters of track. If the object location module is the only one adopted to select plug images, and the states of those plug images are then manually judged by maintenance engineers, approximately 9,000 plug images will be reviewed manually after every inspection. Using our system with the framework, after a large number of experiments, we have determined that the first twenty plug images, sorted by defect probability from large to small, are enough to be reviewed manually. Thus, the VIS with the change detection framework can tremendously alleviate maintenance engineers' work compared with traditional manual detection.
The change detection framework can be implemented in various ways according to significantly different applications, which sometimes makes it difficult to compare algorithms directly [23]. Compared to the change detection in other fields, our framework has the following characteristics.
(1) The specific image capture approach. In our system, the image sensors were fixed on a moving train, and so the captured images were spatially continuous and temporally discrete. In addition, there was no deformation of these images because of the fixed relative position of the camera and the tracks. For video surveillance, the image sensors were always fixed, and the acquired images were temporally continuous for a static scene. In the remote sensing field, the images are large scale and have greatly varying resolutions compared to our system. Furthermore, medical images, such as MRI images, show some image deformation due to some individual differences and body movements.
(2) The specific alignment algorithm. In our framework, the alignment is a kind of image location method from image series. However, for video surveillance, the Gaussian mixture model was used for static scene recovery [21]. Meanwhile, for remote sensing and medical images, to perform image alignment, some elastic registration algorithms should be used to deal with the deformed images [20], [22] Our study is the first practical system for plug inspection, and so there are hardly any images containing defective plugs collected by us. Therefore, just at the present stage, the number of defective plugs is too scarce to train a classifier that can classify defective plugs directly from large numbers of inspection images. However, since the massive construction of HSR urgently needs more efficient maintenance work, the VIS with the change detection framework can be further developed to satisfy the practical work. Considering the safety of the system, we prefer some reliable and stable algorithms, such as the HOG and LBP, which were used widely before. In a similar way, the deep neural network we used is concise because the computation speed and stability must be considered in industrial application. Although some of the latest networks are tested, such as Senet [35], the concise network, the pCNN, is found to be the optimal structure with respect to the system efficiency, maintainability and stability.
In addition, compared with other railway infrastructures, such as fasteners [9], [11], the number of plugs is much less, and thus the number of defective plugs is less. In practice, there is often up to one or two defective plugs in one inspection. This means that the ratio (the number of abnormal plugs: total plugs) is very imbalanced at over 1: 4500, and the data are extremely skewed. Therefore, even if detective plugs were collected for many years, the number of detective plugs might not be enough to train the classifier mentioned above. It is the open challenge named 'Learning from Imbalanced Data' in the machine learning field [36]. Moreover, because even a few defective plugs may cause a potential safety hazard, a stable and high-performance system is required. Thus, we propose the change detection framework to bypass the problem and it does work well in practice. In summary, to solve the problem of imbalanced samples in practice, we establish the framework of a visual inspection system for railway plug defects. Using the idea of change detection, the framework includes three algorithm modules, which are named the object location, image alignment and similarity measurement modules. In the view of application studies, it could be considered as the most important ''novelty'' of the paper and it has been used to solve the actual problems in railway inspections reliably. The framework has passed field tests, and it is also found in this paper that it conforms to the special section subject of this journal. In addition, the change detection framework also presents another approach for the inspection of many railway infrastructures, such as balises, rail surface defects, and catenary support devices. The limitations of this technique are as follows: 1) building a ground-truth database needs to consume lots of resources, including time, labor costs and memory storage; and 2) if the IAS camera shooting angle is changed in one inspection, although it may hardly happen, the final results could be worse because the image alignment module cannot work well in this case.
As far as we know, there are still no other study teams that have published a paper on automatic inspection systems for railway plug defects in influential journals or conference proceedings. In this paper, we propose the change detection framework used in our VIS for railway plug inspection for the first time. The framework includes the ground-truth dataset and three algorithm modules, which are the object detection, image alignment and similarity measurement modules. The results of the experiments showed that the system can detect the defective plugs with high accuracy and that can improve the efficiency of railway inspection. The VIS embedded with the framework has already been authorized by the China Railway Corporation and it will be equipped in many inspection trains belonging to many local railway corporations in China.