Multistage Real-Time Fire Detection Using Convolutional Neural Networks and Long Short-Term Memory Networks

Fire is one of the most commonly occurring disasters and is the main cause of catastrophic personal injury and devastating property damage. An early detection system is necessary to prevent fires from spreading out of control. In this paper, we propose a multistage fire detection method using convolutional neural networks (CNNs) and long short-term memory (LSTM) networks. In the first stage, fire candidates are detected by using their salient features, such as their color, flickering frequency, and brightness. In the second stage, a pretrained CNN model is used to extract the 2D features of flames that are the input for the LSTM network. In the last stage, a softmax classifier is utilized to determine whether the flames represent a true fire or a nonfire moving object. The experimental results show that our proposed method can achieve competitive performance compared with other state-of-the-art methods and is suitable for real-world applications.


I. INTRODUCTION
We frequently hear about fires in the news. Every year, fires cause thousands of human deaths and billions of dollars in property damage. Fire monitoring and protection are always key concerns when managing apartment buildings, warehouses, forests, substations, railways, and tunnels. If fires are not detected early and become out of control, the consequences are often disastrous. Developing a system that is able to automatically detect fire at an early stage is necessary for protecting both human life and property.
The first generation of fire detection systems use sensors, such as ionization detectors, photoelectric sensors and carbon dioxide detectors. Although these systems have had some success in detecting fire, sensor-based systems still have many limitations, especially in large, open areas. Each sensor The associate editor coordinating the review of this manuscript and approving it for publication was Essam A. Rashed . monitors a narrow region, so a fire cannot be detected immediately if it is far from a sensor. Furthermore, the accuracy of this methodology depends heavily on the sensor density and sensor reliability, which leads to some inconvenience in terms of the cost and installation. Sensor-based fire detection is widely used; however, missed detections of real fires and a high false detection rate are common.
Digital cameras, namely, closed-circuit television (CCTV), have been rapidly evolving in the field of security surveillance. Compared to sensor-based systems, security cameras are easy to install and can be used to monitor large, open areas. Currently, CCTV is everywhere, and utilizing a CCTV system for monitoring fire may be an economical and efficient solution. Recently, sensor-based systems have started to be replaced by surveillance cameras and video analysis systems. A large number of image processing algorithms have been proposed for smoke and fire detection via video analysis, and some of them have achieved considerable success. Although computer vision-based systems have been VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ shown to have advantages over sensor-based systems, their performance and results are still far from ideal. Machine learning (ML) algorithms are algorithms that automatically build a mathematical model using sample data, also known as training data, to make decisions without being specifically programmed to make those decisions. Machine learning has been developing since the 1950s; however, initially, the archived results were not very impressive. The fundamental difficulties faced during this period in the development of machine learning were data collection and the limitations of computing resources. However, over the last decade, as computers have become faster and especially due to the explosion of the internet, data collection has become easier than ever, and the development of ML is rapidly progressing. Recently, a new branch of ML called deep learning has emerged as a state-of-the-art machine learning algorithm. These algorithms have shown extremely good performance for computer vision applications, including object detection and image classification. This significant change provides an opportunity to solve many problems that still exist in computer vision, including fire detection.
This paper proposes a method to process fire videos using a multistage CNN-LSTM model. The proposed method consists of a CNN model and an LSTM model that are utilized to extract spatial and temporal features, respectively, to effectively detect fire in videos. The main contributions of the proposed method are as follows: -We have proposed a method that integrates fire candidate extraction and CNN-LSTM classification for detecting fire in videos.
-The proposed method can detect fire with high accuracy while maintaining a low false detection rate.
-The proposed method can detect fire at different scales and environmental conditions.
-The proposed method is fast; therefore, it can be integrated into real-world applications.
-We have collected a fire video dataset for training and testing the algorithm. We have also made it public for future research purposes.
The remainder of this paper is organized as follows. Section 2 reviews the related works. Section 3 describes the proposed system. Section 4 reports the experimental results. Future research directions and a discussion are provided in Section 5.

II. RELATED WORK
In the literature, many approaches have been proposed for early fire detection based on video surveillance. The workflows of these approaches are mainly separated into the following three substeps: fire-colored pixel detection, moving pixel detection, feature extraction and classification.
From observation, we can easily determine that the flame color ranges from red-yellow to white depending on the temperature and material. Flame color is a very important component of fire detection in many algorithms; there are many proposed approaches that rely on modeling fire-colored pixels. These approaches can be roughly classified into the following three categories: fire color rule-based methods, polynomial-based color models and Gaussian distributionbased models. The color rules require several heuristic rules, which are based on the RGB (Red -Green -Blue) [1]- [6], Ycbcr [7]- [9], YUV [10] or Lab [11] color space. The fire region is manually segmented, and the relationship between the pixel values of the three channels is analyzed to estimate a rule to classify the fire-colored pixels in an image. The RGB distribution of the fire flame pixels in images is a two-dimensional representation of the flame pixel color. Different color distributions and flame pixel processes can be obtained by observing different types of fires, depending on the burning material. Based on this distribution, several heuristic rules combined with decision thresholds are estimated to classify the fire-colored pixels in an image. Although different methods are used to model the color of flames, most of these approaches have achieved good results when detecting fire-colored pixels in images. However, in real-world situations, many objects may be similarly colored; therefore, if we use only the color information, false detections could occur. Therefore, further steps are needed to eliminate non-fire-colored objects.
In addition to color, motion is also an important feature for recognizing fire. Background subtraction is commonly used to segment out the moving objects in a scene for surveillance applications. There are several methods [2]- [4], [6], [11]- [13] that treat fire as a moving object with the assumption that the appearance of fire will change the background; therefore, these methods use background subtraction as the first step when segmenting flames. Almost all of the presented papers in the state-of-the-art literature use either frame difference or background-based methods to detect nonstationary pixels. The candidate fire pixels are detected by combining the color and motion results detected previously. However, these methods cannot distinguish fire-colored moving objects from fire. Therefore, additional steps are required to accurately extract flames from a video sequence.
In the literature, flickers are also a widely used feature for detecting fire. We can easily observe that the luminance of fire randomly changes over time. Based on this feature, the author of [12] proposed a 1-D wavelet transform for temporal color variation analysis. Wavelet signals can easily reveal the random characteristic of a given signal, which is an intrinsic feature of flame pixels. Similarly, [1] proposed an algorithm that uses the flicker characteristic to detect fire. This algorithm calculates the cumulative time derivative of luminance because of the tendency of fire to periodically flicker around a region, thus giving the flickering regions of the fire the strongest values. The author of [4] analyzed the flame flickering frequency and discovered that the frequency was approximately 10 Hz; therefore, they verified the candidate flame pixels using a hidden Markov model (HMM) to determine whether their frequencies were approximately 10 Hz. Ref. [13] extracted the normalized red skewness, LH wavelet coefficient skewness, HL wavelet coefficient skewness, and HH wavelet coefficient skewness from candidate flame pixels. Then, they applied Bayesian networks to estimate the probability that the current frame contains fire.
Other algorithms have been used to extract features from candidate pixels, such as the position and area, wavelet information, boundary and flickering frequency, and the temporal and spatial variation of the intensity. Probabilistic models are estimated based on these features, and heuristic rules are applied to distinguish between true fires and nonfire moving objects. These algorithms may have good detection rates; however, their false detection rates are still too high to meet the requirements of security applications. Probabilistic models have a major shortcoming in that it is difficult to determine the optimal thresholds for classification. A lenient threshold could result in many false-positives while a strict threshold could lead to many missed fires.
Reducing false fire detection is a challenge, and increasingly more studies are attempting to minimize false detections. Recently, some image classification methods using machine learning algorithms have been developed, which may be a good approach for distinguishing true fires from nonfires.
Deep learning is a subset of machine learning [14]- [17], [18] that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts, where each concept is defined in relation to simpler concepts and more abstract representations are computed in terms of less abstract representations. A deep learning technique learns categories incrementally using its hidden layer architecture and defines low-level categories. Deep learning promises to provide more accurate machine learning algorithms compared to traditional machine learning algorithms, which have little to no feature engineering. Deep learning algorithms based on CNNs have emerged as a state-of-theart technique in image classification and object recognition. These algorithms have achieved extremely good performance in computer vision applications, including image classification and object detection. Many experiments show that CNNs are very good for image classification and that applying deep learning algorithms to fire classification may be able to reduce false detections. A recent trend uses the advantage of CNNs to improve the accuracy of fire detection, and some methods using this strategy have achieved considerable success. Ref. [19] investigated the problem of fire detection in the real world by creating an unbalanced dataset. The author stated that fire is a rare event in real-world conditions that leads to the failure of many state-of-the-art CNN models. They proposed a deeper CNN model that achieved promising results in detecting fire in videos to address this problem. The author of [20] applied a CNN image classifier [14] at the last layer of a cascade classification model to distinguish fire-like moving objects from real fires. The precision of CNN image classification is impressive compared to previous research methods. However, CNNs are only good at modeling an object's 2D texture and ignore the temporal characteristics of fire flames. In fact, there are many cases in which human eyes are unable to distinguish different objects with a single image; instead, we need to observe the dynamic texture of an object over a period of time to reach a final conclusion. Ref. [21] used the extreme learning machine classifier after the CNN model to detect fire in video. Their method is able to outperform state-of-the-art deep CNNs in fire detection accuracy while maintaining a fast processing time. In [22], the author tried to detect fire directly by using advanced CNN object detection models such as YOLO, Faster-RCNN, R-FCN, and SSD. However, these algorithms are also based on 2D textures and do not consider temporal information. Furthermore, fire flames exhibit an amorphous shape, and this shape leads to difficulties when making training datasets for training a deep learning fire flame detection model. Ref. [12] proposed a 1-D wavelet transform for temporal color variation analysis of fire flames, and [23] also proposed a randomness test model to verify the dynamic textures of fire flames. However, the general disadvantage of these probabilistic methodologies is that it is difficult to determine the optimal decision threshold; furthermore, handcrafted analysis features require highly skilled expertise.
An LSTM network [24], [25] is constructed by looping the LSTM units, where each LSTM unit consists of a memory cell and three multiplicative components (an input gate, an output gate, and a forget gate). Each memory cell (or LSTM unit) uses four neural network layers, instead of one, as the RNN unit. In addition, the output of one neural network layer, in which a sigmoid function is applied, is used to control the cell state from the past cell. Because the output of the sigmoid function is in the range of 0 to 1, the past cell state is ''forgotten'' if the output of the sigmoid function is zero. Otherwise, it accumulates in the cell after multiplication by the controlling factor. Using this method of manipulation, LSTM is able to remember information for a long period of time. While CNNs and handcrafted feature probabilistic models have limitations when modeling dynamic textures, recurrent neural networks (RNNs) and long short-term memory (LSTM) networks have achieved great success when processing sequential multimedia data and have yielded state-of-the-art results in speech recognition, digital signal processing, video processing, and text data analysis. In [26], CNN-LSTM was used to detect the novel coronavirus (COVID-19) using X-ray images. The CNN is used to extract the potential features that are passed to the LSTM model to obtain the final results. Their proposed method can achieve an accuracy of 99.4% and notably achieved a sensitivity of 99.3% for detecting the appearance of COVID-19. Similarly, the author of [27] proposed a combination of a CNN and an RNN to solve the same problem of detecting COVID-19 from X-ray images. They used a pretrained VGG-19 model as the backbone to extract spatial features and an RNN to extract temporal features. In addition, they also used gradient-weighted class activation mapping (Grad-CAM) to visualize the region responsible for causing COVID-19 in X-ray images. In order to solve foreground segmentation, the author of [28] used a combination of a 3D CNN and LSTM. They modeled the foreground-background segmentation problem as an encoder-decoder. The results show that their proposed algorithm achieved competitive performance in terms of the figure of merit evaluated compared to stateof-the-art methods. Inspired by the success of LSTM for sequential data analysis, Amin Ullah proposed a model that combines a CNN and LSTM to analyze the dynamic textures of image sequences [24]. This model inherits the advances of both the CNN and the LSTM network to achieve significant results for real-time action recognition in video. CNNs can act as feature extractors for each image in a sequence while LSTM acts as sequence analysis to predict human activities. Since the CNN-LSTM combination works well for human activity recognition, the method is also well suited for analyzing the dynamic features of fire flames. In the literature, many studies have applied this model to enhance the accuracy of fire flame detection and have achieved good results. The author of [29] added CNN layers to the front end, followed by LSTM layers with a dense layer on the output. The CNN layer works as a feature extractor, and LSTM works as a video classifier for fire and nonfire classification. Similarly, the author of [30] also proposed a CNN-LSTM model for fire image sequence classification. While [29] directly inputs raw images into the CNN input layer, the author uses optical flow to transform raw images into motion images before the images are input to CNNs for feature extraction. Although both of the abovementioned algorithms have reported good results, there is no preprocessing step to localize the position of fire flames; therefore, these algorithms are only suited for large fires in which fire flames are a major part of the image. The prediction accuracy may decrease when the flame is small and occupies a small area in the scene.

III. MULTISTAGE FIRE FLAME DETECTION
The proposed fire flame detection algorithm is illustrated in Figure 1. This approach is mainly separated into three stages. In the first stage, candidate fire flames, which are very likely real fires, will be detected and localized by using the color and motion features. Our algorithm can detect multiple fire candidates in an image and track the fire candidates in the next frames. This approach can ensure that no real fire is missing from the video. In the next stage, a cropped image of each candidate from sequence images will be input into the CNN model and then transformed into visual features through CNN layers. Finally, the extracted fire flame features in the image sequence are passed to the LSTM model, and the LSTM output predicts whether the image sequence contains a real fire or a nonfire moving object.

A. DETECTING CANDIDATE FIRE REGIONS
In the first step, we seek to localize the candidate fire regions, i.e., the regions where there is a high possibility of apparent fire flames. Only color and motion characteristics are used to detect fire flames at this stage. Color is a distinct feature of fire flames; commonly, flame color varies from yellow to red to white, depending on the fire temperature. To simulate the color sensing properties of the human visual system, RGB color information is usually transformed into a mathematical space that decouples brightness (or luminance) information from color information.
Among these color models, the HSV (Hue -Saturation -Value of intensity) color model is suitable for providing a more human-oriented method for describing colors. Therefore, HSV is very convenient for color analysis.
A useful fire color model that uses HSV information was previously introduced in [2]. We slightly modify this model for our fire pixel classification, in which an image pixel can be classified as a fire pixel if it meets the following conditions: where H, R, and S are the hue, intensity of the red channel, and saturation of the image pixels, respectively. H T 1 , H T 2 , R T and S T are defined through intensive experiments. H T 1 and H T 2 usually range from 0 to 70, S T is approximately 60 and R T is approximately 120. Figure 2 displays the results of fire-colored pixel classification. We can see that many background image pixels are incorrectly classified as fire-colored. Therefore, color is insufficient to detect fire, and further processing steps are needed to eliminate stationary background image pixels that are fire-colored. In addition to color, another useful feature is flame flickering, which can be used to improve fire pixel classification. Because of flickering, the intensity of both flame pixels and the surrounding pixels tends to change randomly over time. Many previous approaches have attempted to detect the fire pixel flicker frequency; however, this frequency is very difficult to measure.
The author of [10] developed an effective method for detecting fires according to the frequency of luminance flicking. This algorithm calculates the cumulative time derivative of the luminance. Because of fire's tendency to periodically flicker around a region, the flickering fire regions have the strongest values.
The time derivative of the luminance is zero for stationary regions and nonzero for moving objects. Thus, the time derivative for the video images is able to track a moving object. The sum of the absolute values of the derivatives increases if the object moves periodically around a region. In a fire scene, the flickering of the fire permanently increases the pixel values near the fire region. Following this method, we can estimate the cumulative fire flicker energy map as follows: where I (t) and E(t) are the intensity and flicker energy of the image pixels at time t, respectively. A(t) and α are the cumulative flicker energy and a cumulative factor, respectively.    Once both the color map and the flicker energy map are estimated for the fire pixels, we can use per-pixel operations to fuse the color map and flicker map into a possible fire pixel map. Figure 4 illustrates the steps involved in possible fire pixel map segmentation. From top to bottom, Figure 4 includes the original image, color map, flicker energy map, fusion map, and refined possible fire pixel map, respectively. The final step extracts fire pixels to apply morphological transformations, such as eroding, dilating, opening or closing functions, to the fusion binary map to refine the map.
The candidate fire regions can be segmented from possible fire pixel maps. Figure 5 shows an example of detecting candidate fire regions, where a is the original image; b is the fire-colored pixel map; c is the flickering energy map; d is the fusion map; e is the refined map; and f is a candidate fire region in the subimage, shown by the green rectangle. Multiple candidate fire regions can be detected concurrently, and regions that are near each other will be merged into one region.
Per-pixel level analysis is insufficient for removing moving objects that resemble fire. Therefore, further processes at high spatial analysis levels are required. To ensure that all connected possible fire pixels are clustered into subregions as candidate fire regions, each candidate fire region is treated as a moving object. Finally, the detected moving objects are tracked and classified to eliminate nonfire moving objects.

B. CLASSIFYING FIRE CANDIDATE USING CNN-BiLSTM
As analyzed, the previous methods that predict fires based on a single image are insufficient, and we need to observe objects in sequence to make decisions. For real fire/nonfire classification, we use a CNN-BiLSTM model. The CNN-BiLSTM architecture involves using CNN layers for extracting features from input data, combined with BiLSTM to support sequence prediction as shown in Figure 7.
The network architecture selected for feature extraction in this paper is based on the ResNet-18 architecture [15], which represents a perfect trade-off between model complexity and performance. The network consists of five convolutional stages as demonstrated in Figure 6. The network constructs a hierarchical representation of input images in which deeper layers contain higher-level features constructed using the lower-level features from earlier layers. To obtain feature representations from the training and test images, activation functions are used in the global pooling layer, 'pool5', at the end of the network. The global pooling layer pools the input features over all spatial locations, producing 256 features in total.
The sequence images from each candidate fire region will be passed through CNNs for visual feature extraction. Then, these independent features are fed into a many-to-one multilayer BiLSTM network that temporarily fuses this extracted information. Figure 7 describes the architecture of the CNN-BiLSTM model for fire/nonfire classification. We use a multilayer BiLSTM model to boost the network performance. The architecture includes two BiLSTM layers stacked on top of each other: one LTSM moves in the forward direction and the other moves in the reverse direction. Their combined output is then computed based on the hidden state of both layers. The softmax classifier is applied during the final state of LSTM to make the final decision.
The operation of the LSTM network is explained in Figure 8. The left side is the LSTM unit, and the right side is the multilayer LSTM unit. x t is the image feature extracted by the CNNs, and it is also the input to the LSTM unit at time t. First, LSTM determines which information from the previous cell memory state (c t−1 ) can be forgotten at forget gate f t . The input to f t is x t , and the previous time step's hidden state h t−1 . The output is a number in the range [0, 1] for each state of C t−1 . If the output is 1, all information will be saved; however, if the output is 0, all information will be forgotten. Equation (6) illustrates this step. The next step is to select which information i t can be stored in the cell memory states at the input gate layer. This layer is s a sigmoid layer that determines which information should be updated. The next layer is the tanh layer, where a new valueĉ t is added to the cell states. Then, the previous cell state (c t−1 ) will be updated to the new state (c t ). The previous cell memory state is multiplied by f t to clear information that the unit determined should be forgotten in the previous step. Next, i t * ĉ t is added to produce new memory cell states. The principles behind this step are described in Equations (7, 8, and 9).
In the final step, we determine which information is needed at the output, and the value of the output depends on the state of the memory cell. A sigmoid layer is applied to determine which part of the information should be output. The final decision is made by applying a softmax classifier to the final LSTM state to classify real fires from nonfire moving objects. This step is demonstrated in Equations (10, 11, and 12).
The performance of the deep neural network has been boosted by increasing the number of layers in the neural network models. Similarly, in our network, we stacked two LSTM layers to improve the system accuracy. Figure 8 shows the architecture of the two-layer LSTM network used in our algorithms. Layer 1 obtains input x t from CNN feature extraction, and the input to layer 2 comes from its time step h 2 t−1 and the output of the current time step of layer one h 1 t . We also use a bidirectional LSTM model, where the output at time t is dependent not only on the previous frames in the sequence but also on the upcoming frames. The bidirectional LSTM used here is quite simple and includes two LSTMs stacked on top of each other. One LSTM moves in the forward direction, and the other moves in the backward direction.

C. TRAINING CNN MODEL
To train the CNN model, we use a transfer learning strategy. We fine-tune a pretrained ResNet18 model that was well trained on the ImageNet dataset with millions of images from 1000 object categories. Although the ImageNet dataset is very large, there is little fire flame and fire-like moving object data. Retraining the CNN model on a new dataset allows deeper study of the visual features of fire flames, and fire-like moving objects are necessary for high-quality training. Because the original network was well trained on a large dataset, instead of retraining the network from scratch, transfer learning utilizes a model trained on a different dataset and adapts it to train a new classifier model by continuing to back-propagate the trained model weights with a new image dataset. Transfer learning helps improve model accuracy, reduce training time and solve overfitting problems while training models on small datasets. The training dataset includes a total of 30,000 fire images and 30,000 nonfire images in different scenarios. The nonfire object images were collected from various sources, such as the PETA dataset (pedestrian images), the Cars dataset (vehicles), and the PASCAL dataset (backgrounds and other moving objects). We also manually segmented nonfire objects from surveillance videos that we ourselves recorded or downloaded from the internet. The fire object images from the videos we collected were manually segmented. Here, 60% of the dataset was used for training, 20% was used for validation and 20% was used for testing. Figure 10 shows examples of non-fire and fire training images.
We began the fine-tuning process with a learning rate of 0.01 and decreased it by a factor of ten every 2,000 itera- tions. We used a smaller learning rate for weights that were fine-tuned under the assumption that the pretrained CNN weights were relatively good. We did not want to distort them too quickly or by too much. The optimization process was run for a maximum of 50,000 iterations. The accuracy of the trained fire classification model was 97.5%, the false negative rate was 3.3%, and the false-positive rate was 2.7%. Figure 9 shows the training process. A learning curve visualizes the training and validation losses and validation accuracy according to the number of iterations. The learning curve shows that in the first 2000 iterations, the model accuracy improved quickly and reached a limitation at 10,000 iterations. Then, the accuracy of the model did not change significantly.

D. TRAINING THE LSTM MODEL
We implemented a bidirectional LSTM network with 16 time steps and 512 hidden units for fire/nonfire sequence image classification. The frame rate of our video dataset is reduced to 10 FPS (Frame per Second) to reflect the fact that the change in flame shape is nonsignificant between consecutive frames in the original videos. The visual features of a flame candidate over 16 consecutive frames are extracted to form a training segment. Each training segment is a 16 x 512 matrix that is used to model the temporal features of the respective flame candidate. We extracted more than 10,000 fire flame segments and nonfire moving objects from the video dataset to train the LSTM network. These segments are divided into three parts: 60% are used for training, 20% are used for validation and 20% are used for testing. We used the cross-entropy loss function during training, which was optimized by the stochastic gradient descent (SGD) algorithm. The learning rate was adjusted after each epoch to better learn the model parameters.

IV. EXPERIMENTAL RESULTS
ResNet-18 is used as our backbone network for extracting spatial features. The ResNet-18 architecture can balance the accuracy and network complexity. We implemented an experiment to prove the advantage of the network in Table 1. ResNet-18 can achieve an accuracy of 93.8% with 11.4M parameters while VGG-19 can achieve an accuracy of 96.2%; however, the complexity of VGG-19 is much higher than that of ResNet-18. The accuracy of these popular models is calculated using our collected dataset of 60,000 images (30000 fire videos and 30000 nonfire videos).
By performing experiments with different network architectures of LSTM variants, as shown in Table 2, we found that BiLSTM has a better result than a single LSTM model. With more than two layers, the performance did not considerably improve while the processing time increased. Therefore, BiL-STM is the optimal architecture for our system. In addition, the proposed method, which has 14003394 parameters, can achieve an inference time of 25 ms with the setup of an Intel core i9 CPU, Windows 10, and an NVIDIA GeForce RTX2070 Super GPU.
We tested our algorithm on a large dataset with different scenarios, including large fires, medium fires, small fires, close fires, distant fires and nonfire moving objects. We have also compared our method with previous methods on our collected dataset, which includes 1032 videos (534 videos have fire, and 498 videos have no fire). Table 3 shows the detailed descriptions for a part of the full dataset. Figure 11 shows the real fire frames in the videos of the testing dataset, which demonstrates the diversity of the data. Fires appear in various shapes and environmental conditions. In the first row, the fires are large and occupy most of the area of the image frames. However, the fires are small and only a few pixels in the third row images. Unlike Figure 11, Figure 12 shows fire-like videos, which include fire-like objects such   as moving cars and moving human and red colored objects. These samples are used to test the ability of the algorithm to eliminate false fire detection. In the Table 4, [1], [12], [20] are weak at eliminating false fire detections, resulting in degraded performance of the overall system.
For evaluation, we also implemented previous algorithms and then conducted the test using the same dataset for comparison. The algorithms that were selected included the methods of [1], [12], [19], [20], [29]. In the case of large and medium fires, the visual features of fire flames, such as color or brightness, are outstanding, so the detection rate for most algorithms is very good. All tested algorithms worked well for large and medium fires with a detection rate of 100%; however, for small fires or distant fires, where fire features are not dominant compared to other objects in the scene, some algorithms do not perform well. False detection is also a problem in fire detection. In fact, there are many moving objects with similar characteristics to fire flames. Especially in some cases (test20), even human eyes cannot recognize whether an object is a fire in a single image; therefore, the algorithms that rely only on spatial features for detecting fires will have many false detections and missed real fires. The following paragraphs provide more details about our experiments. The author of [1] detected fire flames based on  color and movement attributes. Then, the temporal variation in the fire intensity and the spatial color variation are analyzed to eliminate false detection. These algorithms use heuristic rules and thresholds to make decisions. It is not easy to select a suitable threshold. If we apply a lenient threshold to increase the detection rate, the false alarm rate could increase; conversely, if we use a strict threshold to prevent false detections, many true fires may be missed. Furthermore, these algorithms ignore spatial structures that identify object shapes, so they sometimes cannot distinguish fires from nonfire moving objects by using only color and simple rule-based temporal analysis. Experimental results show that these algorithms have good detection rates but also have many false detections. Similarly, [12] uses color and temporal variation to detect fires. The difference between the methods of [12] and [1] is that the author of [12] uses a 1-D temporal wavelet transform and a 2-D spatial wavelet transform to eliminate false detections. Our experiment shows that these algorithms also have the same problems as the method of [1], and there are many false detections.
Ref. [19] used a CNN model to detect the occurrence of fire in videos. However, the CNN model can only characterize the spatial information of fire regions and lacks temporal features to detect fire correctly. The author of [20] first detects candidate fire regions by using color and flickering features. Cascade classification was also constructed to combine many weak classifiers for robust fire/nonfire classification. Deep learning image classification using convolutional neural networks (CNNs) was also applied to increase the accuracy of the algorithm. The algorithm produces good results regarding both detection and eliminating false detections; however, the algorithm still has the same problem as the others during test20 when the human eye cannot distinguish whether there is fire using a single image. In this case, we need to monitor an object for a period of time to make the final decision, and CNNs can only work with a single image at a time.  To solve the problem that the CNN model has in [20], the author of [29] applied the CNN-LSTM combination model to classify sequences of images. In these algorithms, the sequences of images are directly classified into fire or nonfire categories. However, these algorithms use an entire image to train the CNN-LSTM model, which leads to problems when detecting small fires in which the visual features of fires do not dominate in scenes. This leads to the algorithm sometimes modeling the dynamic textures of moving objects rather than modeling the movement of the fire flames in the image sequences. Experimental results show that these algorithms missed many real but small fires.
By applying multistage fire detection, our proposed algorithm achieved the best results for small fires and for avoiding the false detection of nonfire moving objects. Table 4 demonstrates the qualitative results of our method and different methods evaluated with a small part of our full dataset (23 videos).
In addition to the qualitative evaluation of different methods, we conduct a quantitative evaluation of the methods on the full dataset. The evaluation metrics used are the precision, recall, and F1 score via 10-fold cross-validation. The detailed results are given in Table 5. Our proposed method can achieve the highest F1-score, which proves that our proposed method can detect fire correctly with a low false-positive rate. Other methods achieve high accuracy; however, the F1-scores are low due to high false-positive rates, such as for the methods of [12], [1]. The methods of [20], [29], [19] achieve low recall because these methods are not suitable to detect the small fire objects in videos, such as those that occur frequently in our dataset.
In the performance test, we measure the processing time for every step of the algorithms. The time needed for visual CNN feature extraction is approximately 10 milliseconds, and the time to run LSTM sequence image classification is approximately 10 milliseconds, which is fully suited to realtime applications.

V. CONCLUSION AND FUTURE WORK
Automatic fire detection is very important as prompt fire warnings give people a better chance to escape and reduce damage from fires. This paper introduces a video-based fire detection algorithm for use in an early fire-alarm system. This algorithm uses a combination of a CNN and an LSTM network to analyze fire flames in both the spatial and temporal domains. By exploiting the advantages of both CNNs and LSTM in computer vision tasks, our algorithm achieved a significant improvement over the existing methods. The experimental results show that our algorithm is fast, reliable and fully suited for use in real-time surveillance systems. Furthermore, the advantages of our proposed method can be highlighted as follows: -The proposed model is threshold-free and therefore suitable for application in various weather conditions. -The proposed model can achieve high accuracy with a low false alarm rate.
-Unlike other methods that apply fire classifiers to entire images, resulting in missed detections of small fires, our proposed method incorporates a fire candidate extraction stage so that our system can detect various sized fires. In addition, our method can be fast because of processing small cropped fire images as the input of the CNN-LSTM model. -We collected the fire video dataset from two main sources: crawling fire videos from the internet and fire videos from real-world deployment of our proposed method in various weather conditions. Our dataset will be made available for research purposes.
Despite the positive results, our proposed method still has several limitations such as detection of fire in different colors (blue fire, white fire), unstable basic features for extracting fire candidates, and lack of fire image data. In future work, we will improve the proposed method as follows: -We will collect more data during the deployment of the proposed system in the various real-world environments, especially for blue and white fire flames.
-We will integrate a fire candidate segmentation model to eliminate the instability of current basic features.
-We will investigate the end-to-end model training scheme by changing the annotation of fire flames.
-We will investigate how to integrate smoke detection to improve the early fire detection ability.