A semi-automatic motion-constrained Graph Cut algorithm for Pedestrian Detection in thermal surveillance videos

This article presents a semi-automatic algorithm that can detect pedestrians from the background in thermal infrared images. The proposed method is based on the powerful Graph Cut optimisation algorithm which produces exact solutions for binary labelling problems. An additional term is incorporated into the energy formulation to bias the detection framework towards pedestrians. Therefore, the proposed method obtains reliable and robust results through user-selected seeds and the inclusion of motion constraints. An additional advantage is that it enables the algorithm to generalise well across different databases. The effectiveness of our method is demonstrated on four public databases and compared with several methods proposed in the literature and the state-of-the-art. The method obtained an average precision of 98.92% and an average recall of 99.25% across the four databases considered and outperformed methods which made use of the same databases.


INTRODUCTION
Video surveillance technology is rapidly proliferating across public and private spaces. Traditionally, surveillance systems could only be found on buildings owned by the Government and large organisations. Currently, they can be found in a variety of settings such as shops, stadia, airports, schools and private residences. Two main factors are responsible for the ubiquity of video surveillance systems (VSS). The first is increased ease of acquisition and installation of VSS. This is due to the advancements in technology from analogue to digital systems and the significant drop in the cost of acquisition. The second factor is the increasing need for security globally. There is a high demand for persistent surveillance systems which can monitor round the clock all year round. As most VSS use visible-light cameras, the presence or absence of light hinders their ability to monitor persistently. Thermal cameras are viable substitutes because they function in poor lighting and at night. These cameras contain sensors which measure and create images from the thermal infrared energy emitted from objects in the scene (Negied, Hemayed & Fayek, 2015).
The amount of infrared detected determines how bright or how dark an object will appear in the final image. Emissivity is the ratio of infrared energy radiated from an object to that radiated from a perfect emitter under the same conditions. Given that 1 is the emissivity of a perfect emitter, also called a blackbody, pedestrians have a value of 0.98 (Fluke, 2020). Thermal imaging finds extensive application in pedestrian detection and tracking because pedestrians have high emissivity which creates a good enough contrast between them and the background. The challenge to detecting pedestrians in thermal images arises from the fact that, while pedestrians can emit infrared energy almost perfectly, only a fraction of the emissions are detected by the thermal camera. The amount of infrared energy reaching the thermal camera sensors depends on the prevailing weather conditions, the reflectivity of other objects in the scene and even the thermal camera itself. Thus, thermal images have lower resolution and lack the number of details present in visible-light images and the applications of thermal imaging are not as varied as those of visible imaging.
The motivation of this article is to propose a new method to detect pedestrians in thermal imaging acquired under different conditions. State-of-the-art algorithms for visible images usually do not perform with similar accuracy on thermal images and generally do not perform well across different datasets. This is because Image Analysis is slightly different when performed on visible and thermal images. Some of the characteristics of thermal images introduce additional challenges and/or nullify some steps in algorithms used for visible light images. For instance, there are immediate changes in appearance as illumination changes in visible images while appearance changes much slowly because detected radiation increases or decreases gradually in thermal images. Also, objects in thermal images do not cast shadows. Therefore, applying algorithms such as background subtraction to thermal images will not urgently need steps for scene update and shadow removal as will be the case for visible images. Furthermore, objects in visible images are commonly differentiated by their colour and displayed in the RGB (Red-Green-Blue) colour space while thermal information is commonly mapped to grayscale. It is important to remember that while RGB can be converted to grayscale, they still do not present the same information as thermal infrared images even if both images capture the same scene.
Furthermore, many of the methods put forward for pedestrian detection in thermal images require several steps grouped broadly into two: candidate generation and validation. Candidate generation involves extracting likely regions containing pedestrians. Candidate validation involves examining the extracted regions and discriminating between pedestrian and non-pedestrian. Errors tend to accumulate from each of these steps. Thus, different from other methods put forward, the proposed method is a single-model algorithm for pedestrian detection that eliminates the need for separate modules of candidate generation and validation. It integrates the appearance properties of the image with motion patterns such that all the fine-tuning and adjustment happens during energy formulation.
The contribution of this article is a novel Graph Cut energy function, referred to as motion-constrained energy (MCE), which repurposes binary segmentation for pedestrian detection in infrared images. Inspired by the semi-automatic framework of Boykov & Jolly (2001) that integrates the image region and boundary information into a single energy function, the proposed energy function incorporates an additional term to penalise pixels based on motion characteristics to accurately detect pedestrians in thermal images. The formulation in Boykov & Jolly (2001) presents an energy function E incorporating a region D(h) and boundary term S(h) shown as follows where N are unordered pairs of neighbouring pixels from a standard neighbourhood system e.g., 4-, 8-or 26-neighbourhood system and k is used to balance the contribution of the region and boundary term to the final segmentation result. D(h) measures how well pixels fit into the object or background models. J(h) is also called the smoothness term and it measures the similarity of intensity values between neighbouring pixels. There are two areas where this formulation falls short in thermal images. Firstly, the low resolution and noisy nature of IR images mean that more importance will be given to the region term in many instances using this formulation. This means that a robust model for each class will have to be determined. As mentioned earlier, most models and approximated distributions in the literature do not generalise well across datasets, therefore, it is important to add another element to reduce over-dependence on the region term. Secondly, this formulation produces solutions where regions with similar intensity values as the pedestrians are included in the solution irrespective of their location.
The proposed energy function (MCE) incorporates motion constraints and is defined in Eq. (2) as where M(h) is the motion term, T is the set of pixels containing one or more motion pixels and Dcomb is the set of pixels with the highest energies from four directional difference images. The impact of the proposed energy is expressed in Fig. 1. The result of using the energy of Boykov & Jolly (2001) produces topologically unconstrained solutions shown in Fig. 1B. This means that all pixels with the same properties as the object of interest will be included in the final result. However, MCE constrains the solution to only the object of interest as shown in Fig. 1C.
The rest of the article is organised as follows. Section 2 presents the related works. Section 3 presents the proposed framework. Section 4 provides the experimental results. Section 5 presents the conclusion and future work.

RELATED WORKS
The task of detecting pedestrians is necessary for understanding and recognising human activity and behaviour in video surveillance footage. In thermal infrared images, this task is carried out in two major steps. The first step is to detect all regions likely to contain pedestrians. This is called Candidate Generation. The second step is to discriminate from among the extracted regions those belonging to the pedestrians. This is called Candidate Validation.
Many methods put forward for candidate generation in thermal infrared images depend on the contrast between the pedestrian and background. Thresholding methods have, therefore, found extensive use in this domain and are into two categories: parametric and Thresholding methods produce excellent results when the approximated distribution of the image fits the dataset under consideration. However, this means that they can easily become too dataset-dependent. Also, in situations where the contrast is not pronounced, the pedestrians are not of uniform appearance, or polarity reversal occurs due to change of weather and the presence of artefacts such as halos, detection based on appearance alone suffers setbacks.
To reduce dependence on the contrast for pedestrian detection, candidate generation has been carried out by detecting moving regions. Background Subtraction and Optical flow-based methods are commonly used for detecting moving regions, but Background Subtraction is less computationally expensive (Choudhury et al., 2018). Generally, Background Subtraction is carried out by creating a model of the image background and comparing that model with each video frame. A similarity function is employed to determine which pixels are likely to belong to the object of interest. Background Subtraction by Frame differencing detects moving regions and is commonly used in tracking algorithms (Gawande, Hajari & Golhar, 2020). The presence of motion can be obtained from the absolute difference between consecutive image pairs. Jeon et al. (2015) created a background model using pixel difference image and combined edge information with the result of background subtraction to detect the pedestrians. Jeyabharathi & Dejey (2018) made use of frame differencing to extract likely pedestrian regions and reflectional symmetrical patterns to provide geometrical information for accurate background modelling. Motion is one feature that can cut across a wide range of infrared images.
Candidate validation have be performed using unsupervised and supervised approaches. Unsupervised methods make use of known or calculated physical properties of the pedestrians to discriminate between pedestrian and non-pedestrian. Younsi, Diaf & Siarry (2020) proposed a global similarity function that uses the sum of sub-similarity functions to discriminate between human moving objects and non-human moving objects. The drawback of unsupervised methods is that they also tend to be data-dependent. Supervised methods depend on feature extraction and training. Although recent efforts are moving towards the use of Convolutional Neural Networks (CNN) where feature representation is an inherent part of the training framework, feature representation is still a challenge because thermal images have low resolution and fewer details compared with visible images. Recent efforts such as those of  2019) proposed a CNN-based classifier with three input channels for fine-grained pedestrian detection. The input channels take in the original image, a Difference image from the previous frame and a background subtraction mask. In their results, they noted that training and testing needed to be carried out on similar datasets for best performance. Chen & Shin (2020) developed an attention-guided autoencoder network that includes a skip-connection block which combines features from the encoder-decoder modules to increase contextual information for robust and distinguishable features in infrared images with low SNR and resolution. YOLOv3 was used by Krišto, Ivasic-Kos & Pobar (2020) and Tumas, Nowosielski & Serackis (2020) for pedestrian detection under different weather conditions. Gao, Zhang & Li (2020) redesigned the visual geometry group (VGG-19) CNN to extract more features from infrared images for better detection results. The rationale for using these methods is that they perform well on visible images and achieve state-of-the-art results. However, their performance is lower on infrared images for two reasons. First, the models developed by Huda et al. (2020) for testing infrared images were trained on visible images. Second, different thermal cameras output different levels of detail. Therefore, even for models trained on infrared images such as done by Krišto, Ivasic-Kos & Pobar (2020) and Park et al. (2019), the performance of the trained model depends on how similar the test dataset is to the training dataset.
To the best of our knowledge, semi-automatic methods requiring human inputs have not yet found extensive application in the thermal domain. This work is inspired by the methods put forward by Boykov & Jolly (2001) and Viola, Jones & Snow (2003). Graph Cut is a powerful optimization method that guarantees an exact solution for binary labelling problems. Graph Cut's effectiveness is shown in the framework of Boykov & Jolly (2001) which seamlessly combines edge and appearance information into its energy formulation to produce topologically unrestrained solutions Boykov & Funka-Lea (2006). This means that all pixels with the same properties are given the same label regardless of their location. Viola, Jones & Snow (2003) proposed a method which eliminates the need for separate modules for pedestrian detection and put forward a detector that integrates appearance and motion patterns such that all the fine-tuning and adjustment happens during training.
Both methods are similar in that they seamlessly combine different attributes to accomplish one goal that would otherwise have required several steps. Also, both methods were tested and achieved state-of-art on visible images. However, the framework of Boykov & Jolly (2001) is semi-automatic while that of Viola, Jones & Snow (2003) is supervised.

PROPOSED METHOD
This work considers a Graph-Cut based method for pedestrian detection which combines intensity (region and boundary) information with motion characteristics. The task of pedestrian detection is formulated as a binary labelling problem where the goal is to partition the image into two classes. Formally, the labelling problem is a function that maps observed data to labels. For our purposes, the observed data is the image and the labels are the classes. Let labels Z assigned to a pixel be given as Z ¼ ð'ped', 'bkg') where 'ped' refers to the ROI and 'bkg' refers to the rest of the scene. The labelling of X over Z is a function h : X ! Z. h x specifies the label assignments to x in X and is taken from Z. To solve the binary labelling problem, Graph Cut performs efficient searches for the optimal labels among the possible set of labels. A graph is constructed over the image and a cut on the graph corresponds to the binary partitioning of the image. An energy function is used to represent the information in the image and the global minimum of the energy corresponds to the optimal partitioning. The overview of the proposed method is presented in Fig. 2.

Graph construction
The first step is to construct a graph G over an image. G ¼ hV; Ei where V are the nodes of the graph and E are the edges. V correspond to the pixels of the image and include two additional nodes, source s and sink t, called terminals. The edges which connect the pixels to each other are referred to as N-links while the edges which the pixels to the two terminals are referred to as T-links. A neighbourhood system N determines the placement of edges between the nodes. A non-negative weight, discussed in "Weight assignment", is assigned to each edge. An illustration of a graph constructed over an image is shown in Fig. 3 Weight assignment The non-negative weights for each edge edge e 2 E of the graph G are calculated from the region, boundary and motion terms of Eq.
(2). The region term, D y ðh y Þ reflects the extent to which each pixel fits into the image intensity model of "object" and "background". These weights, D y ð"object") and D y ð"background"), are computed as negative log-likelihoods as follows. D y ð"object"Þ ¼Àln PrðY y j"object"Þ D y ð"background"Þ ¼ Àln PrðY y j"background"Þ The intensity model for D y ðh y Þ is built using pixels, called seeds, which definitely belong to the "object" and "background". These seeds are chosen interactively by the user.
The boundary term J y;z ðh y ; h z Þ assigns penalties to discontinuities between neighbouring pixels y and z. Therefore, the edge weights between pixels with dissimilar pixel intensity values will be higher and vice versa. These weights are calculated as follows S y;z ¼ exp ðy À zÞ 2 2r 2 (4) In the above equation, r has been calculated as the variance of the video frame under consideration.
The motion term M y ðh y Þ computes the cost of labelling a pixel as "object" or "background" as determined by the motion constraint Dcomb. Dcomb provides an estimate of the location of each pedestrian in the image obtained by thresholding and combining four images obtained by frame differencing. The direction of motion can be obtained from the absolute differences DL f , DR f , DU f and DD f between consecutive image pairs I f and shifted versions of I f þ1 to the left, to the right, up and down respectively. The difference image computations are given as follows Figure 4 shows how the shifted difference images provide information about the direction of motion. In our experiments, we found that the energy of the image was highest when the image was shifted in the direction of motion and the least when shifted in the opposite direction. Also, because the surveillance footage is taken from different angles and there are usually several pedestrians going in different directions, we found that the energy for each subject is higher in at least two directions, that is, either in the ↑ or ↓ direction and either in or ! direction. Dcomb is, therefore, created by combining the pixels with the highest energies from each directional difference image and is defined as where Th is used to extract the highest energies from each directional difference image. M y ðh y Þ is, thus, defined as follows where g is an arbitrarily large number to ensure that the object or background label is assigned to a pixel if the stated condition for each class of assignment is satisfied. Table 1 provides the weights for the graph edges. As discussed in "Graph construction", the elements of V for graph G are the image pixels. Each node, corresponding to pixel y, is connected to the source s and sink t terminals using edges {y, s} and {y, t} called T-links. Also, each node is connected to other nodes in its neighbourhood. A four-neighbourhood system, for example, would mean that a pixel was connected to its four neighbours above, below, to the left and the right of it. The edges which connect a node to its neighbours {y, z} are called N-links. A higher weight on the T-link connecting a node to either s or t implies a higher likelihood of a pixel to be labelled as "object" or "background" respectively. Likewise, a higher weight on the N-link between vertices implies a greater dissimilarity between pixels. It should be noted that D y ðh y Þ and M y ðh y Þ are unary terms acting on each pixel to compute the weight on the T-links of the graph while J y;z ðh y ; h z Þ is a binary term acting on pixel pairs y and z in a specified neighbourhood N to compute the weight on the N-links.
To obtain T, the image is divided into non-overlapping equal-sized detection windows such that only windows which have one or more pixels from Dcomb are considered by D (h) and S(h).

Energy minimization
Following the graph construction and weight assignment, the energy is minimized using the Boykov-Kolmogorov minimum cut/maximum flow algorithm (Boykov & Kolmogorov, . The aim of this algorithm is to find the cut C that partitions a two-terminal graph into two disjoint sets S and T such that s is in S and t is in T. The optimization problem, to find the minimum among all possible cuts, is solved by finding the maximum flow moving from the source s to the sink t. The cost of the cut C ¼ fS; Tg is the sum of the weights on the edges (y, z) where y ∈ S and z ∈ T. The final labelling on the original image is produced by the minimum cut separating the two terminals shown in Fig. 5.

Dataset
The proposed method is tested on the following public databases as previously described in Oluyide, Tapamo & Walingo (2022): 1. The Linkoping Thermal InfraRed (LTIR) dataset put forward by Berg, Ahlberg & Felsberg (2015) 2. LITIV dataset put forward by Torabi, Massé & Bilodeau (2012) 3. OTCBVS benchmark -Terravic Motion IR database put forward by Miezianko (2005) 4. OTCBVS benchmark -Ohio State University (OSU) thermal pedestrian database put forward by Davis & Keck (2005) Performance metrics The performance of the proposed method is measured using Recall and Precision given in Eqs. (9) and (10).  Table 2 and Fig. 6. The visual comparison of GC and MCE is shown in Figs. 7-10. The performance of both GC and MCE is lowest on the LTIR database. This could be because LTIR has the most varied scenes of all the datasets. The images were either too bright or too dark and there were cases of slight camera motion and reversed polarity. Conversely, it shows the greatest improvement in performance when MCE is used.
LITIV database has the most uniform appearance but is the most varied in perspective; images were captured from different angles from the side view to the top view. Most of the images were very dark and the contrast was poor except in images taken from the top view. Significant improvement in performance is also observed when MCE is used.  The Terravic database had the best contrast, but the pedestrians were not always moving and, compared to the other databases, it took a long time for the pedestrians to move significantly. The impact of this slow or lack of movement is in deciding the interval between consecutive frames. Ideally, the next immediate frame should be used but this might depend on the footage.
The OSU database is the oldest and most extensively used because it was created specifically for evaluating pedestrian detection algorithms. The database contains details about the weather condition and comprehensive ground truth. The images were taken over different days and under different weather conditions but from the same scene. As mentioned in the introduction, temporal changes in appearance do not occur in thermal images unless there is a drastic change in weather conditions, and these changes occur much slowly as detected radiation increases or decreases gradually. Table 3 shows the weather conditions for each video sequence in the database, the total number of pedestrians in the database and the true positive (TP) and false positive (FP) detection results using the proposed algorithm. It can be concluded that the proposed method is quite robust to changes in weather.

Comparison with other methods in the literature
The performance of the proposed method is presented in comparison with other methods in the literature (Table 4). Tables 5 and 6 compare the number of True Positive (TP) and False Positive (FP) detections obtained by the proposed method with other methods which use the OSU dataset including the creator of the Dataset Davis & Keck (2005). In Tables 5  and 6, the best result(s) for each sequence from each author is highlighted in bold. It is important to note that Sequence 3 has its polarity reversed, therefore, the pedestrians appear dark. Thus,  do not provide results for sequence 3 because their method is for detecting bright regions in thermal images. While the proposed method does not always produce the best result for each sequence in Table 5, the average results outperform the methods put forward.  The proposed method is also compared with methods which apply the state-of-the-art algorithms for object detection in visible images to thermal images using Precision and Recall. Table 7 presents the results of this comparison. As mentioned in "Related works", the low performance of the state-of-the-art is because the models were either trained on visible images or trained on datasets dissimilar to the test set. However, the proposed method performs well across the different datasets.

Time complexity and execution time
The steps of the proposed method are given in Algorithm 1. The time complexity can be determined as follows. In step 1, the directional difference images are computed using Eq. (5) and each computation takes O(n) time. In step 2, the location estimate image is computed using Eq. (6) and it involves two stages: finding the highest energies in each    Table 1 is the third step and it involves the use of two matrices; an adjacency matrix for the N-links and an nx2 matrix for the T-links. For the adjacency matrix, adding a node takes O(n 2 ) time, adding an edge takes O(1) time and finding neighbours takes O(n) time. The overall time for the adjacency matrix is O(n 2 ). In the nx2 T-links matrix, one column holds the weights for pixels connected to the Source terminal and the second column holds the weights for pixels connected to the Sink terminal. Computing the weights for each terminal takes O(n) time. Thus, the overall time for step 3 is O(n 2 ). In step four, the minimization algorithm has a worst-case time complexity of O(mn 2 jCj) where n is the number of nodes, m is the number of edges and |C| is the cost of the minimum cut. This algorithm outperforms standard minimization algorithms on typical Computer Vision problems even though the complexity of the algorithm is theoretically worse. The reader is referred to the work of Boykov & Kolmogorov (2004) for more details. Therefore, the overall time complexity of the proposed method is O(mn 2 jCj).
The proposed method was implemented using MATLAB R2018a TM on an Intel i7-4790 3.60 GHz CPU with 8 GB RAM. The average execution time for each video frame ranged from 6.8 to 11.3 s depending on how fast the user selects seeds.

Limitations of the proposed method
The main limitation which potentially reduces the effectiveness of the proposed method is the presence of extreme camera motion. A bit of camera motion was encountered in the 1: Compute the directional difference images using Eq. (5) 2: Compute the location estimate map Dcomb using Eq. (6) 3: Compute edge weights according to Table 1 4: Minimise the energy using Boykov-Kolmogorov min-cut/max-flow algorithm LTIR database which can account for its lower performance compared to the other four datasets. However, if it is extreme, then it can hamper the effectiveness of the difference images produced using Eq. (5) because stationary objects might be included in the results. Although there are methods to correct camera motion, the additional step implies increased computational cost.

CONCLUSION
In this article, a motion-constrained Graph Cut framework for pedestrian detection in thermal infrared videos has been presented which integrates appearance information with motion characteristics in a single model. The proposed method has been compared with the framework of Boykov & Jolly (2001) to show the advantages of including an additional constraint and the performance of the detection framework. In addition, the method has been tested on four publicly available datasets and with different methods in the literature which make use of the same datasets to showcase the robustness of the framework. As the process of selecting seeds significantly increases the execution time, future work will involve optimising the algorithm to require as little human input as possible.

INFORMATION ON IMAGES USED IN THE FIGURES ADDITIONAL INFORMATION AND DECLARATIONS Funding
The authors received no funding for this work.