A deep learning model using geostationary satellite data for forest fire detection with reduced detection latency

ABSTRACT Although remote sensing of active fires is well-researched, their early detection has received less attention. Additionally, simple threshold approaches based on contextual statistical analysis suffer from generalization problems. Therefore, this study proposes a deep learning-based forest fire detection algorithm, with a focus on reducing detection latency, utilizing 10-min interval high temporal resolution Himawari-8 Advanced Himawari Imager. Random forest (RF) and convolutional neural network (CNN) were utilized for model development. The CNN model accurately reflected the contextual approach adopted in previous studies by learning information between adjacent matrices from an image. This study also investigates the contribution of temporal and spatial information to the two machine learning techniques by combining input features. Temporal and spatial factors contributed to the reduction in detection latency and false alarms, respectively, and forest fires could be most effectively detected using both types of information. The overall accuracy, precision, recall, and F1-score were 0.97, 0.89, 0.41, and 0.54, respectively, in the best scheme among the RF-based schemes and 0.98, 0.91, 0.63, and 0.74, respectively, in that among the CNN-based schemes. This indicated better performance of the CNN model for forest fire detection that is attributed to its spatial pattern training and data augmentation. The CNN model detected all test forest fires within an average of 12 min, and one case was detected 9 min earlier than the recording time. Moreover, the proposed model outperformed the recent operational satellite-based active fire detection algorithms. Further spatial generality test results showed that the CNN model had reliable generality and was robust under varied environmental conditions. Overall, our results demonstrated the benefits of geostationary satellite-based remote sensing for forest fire monitoring.


Introduction
Forest fires are among the most frequently occurring disasters in the world. A single forest fire can cause extensive damage, including the destruction of forest ecosystems and emission of air pollutants, which requires significant time and financial resources to recover (Yarragunta et al. 2020; Bar et al. 2021). Globally, the occurrence of forest fires has increased in recent years owing to elevated temperatures and dry weather conditions (Ruffault et al. 2018;Xie et al. 2018). Rapid suppression of forest fires can prevent their spread and minimize environmental and economic damage. Thus, there is an urgent need for the early detection and monitoring of forest fires (Huh and Lee 2017).
Satellite data are regularly captured over large areas in various spectral channels, including heat-sensitive infrared (IR), and have been widely used to monitor forest fires (Xie et al. 2018;de Luca, Silva, and Modica 2021). The most widely used polar-orbiting satellite-derived fire products are Moderate Resolution Imaging Spectroradiometer (MODIS) and Visible Infrared Imaging Radiometer Suite (VIIRS) active fire products (MCD14DL, VNP14IMGTDL_NRT, and VJ114IMGTDL_NRT). The MODIS active fire product is generated using a threshold-based algorithm primarily based on the contextual information of the mid-infrared (MIR; 4 μm) channel (Giglio et al. 2003;Giglio, Schroeder, and Justice 2016). High brightness temperature (BT) in the MIR channel has been used as a critical indicator for detecting forest fires. Many forest fire detection studies using polar-orbiting satellite data have frequently used contextual approaches that take advantage of high spatial resolution (30 m-1 km) (Kumar and Roy 2018;Hu, Ban, and Nascetti 2021;Schroeder et al. 2016). However, owing to their relatively low temporal resolution, polar-orbiting satellite sensors have difficulty detecting forest fires in real time (Hall et al. 2019).
Geostationary satellite sensors have a high temporal resolution (~min) and are thus ideal for rapid detection and monitoring of forest fires. Himawari-8, Geostationary Operational Environmental Satellite series, and Spinning Enhanced Visible and InfraRed Imager are widely used geostationary satellite sensors for monitoring forest fires worldwide (Hall et al. 2019;Xu and Zhong 2017). Threshold-based anomaly detection using multitemporal images or contextual analysis has been frequently used for detecting fires from geostationary satellite data (Lin et al. 2018;Wickramasinghe et al. 2016). However, only a few studies have considered the temporal and spatial characteristics of forest fires, which are often useful for reducing detection errors (Lin et al. 2019;Xie et al. 2018). Although previous studies using geostationary satellite data reported faster detection of forest fires than MODIS products, they have focused on active fire detection rather than on reducing detection latency (Koltunov et al. 2016;Xie et al. 2018;Xu and Guang 2017). Detecting forest fires in the early stages is challenging because of the coarse spatial resolution (e.g. 2 km) of geostationary satellite data (Kato et al. 2021;Xie et al. 2018). Moreover, threshold-based algorithms frequently struggle to identify appropriate thresholds that are adequate for multiple forest fires in areas with different environments and times. This is often problematic for generalization (Huh and Lee 2017).
Recent studies have shown that artificial intelligence can overcome the shortcomings of existing threshold-based approaches for satellite-based forest fire detection owing to its ability to find more complex and precise relationships among thresholds (de Almeida Pereira et al. 2021;Jang et al. 2019;Kato et al. 2021). For example, Jang et al. (2019) demonstrated that the random forest approach enabled the detection of small-scale forest fires (e.g. damaged areas of 1-2 ha) using geostationary satellite data, thereby significantly reducing false alarm rates. Convolutional neural networks (CNN), a deeplearning technique specialized for image analysis and spatial pattern recognition, successfully classified forest fire perimeters (de Almeida Pereira et al. 2021;Gargiulo et al. 2019). However, deep learning-based forest fire detection has typically been conducted using high spatial resolution satellite images, where even the human eye can observe forest fires (de Almeida Pereira et al. 2021;Rostami et al. 2022). Therefore, a new strategy is required to apply deep learning to geostationary satellite images with relatively low spatial resolution. Recently, Hong et al. (2022) examined the effectiveness of CNN for Himawari-8 satellite image-based active fire detection; however, they focused on CNN structures and their performance rather than on temporal and spatial transferability. Therefore, for effective and operational deep learning-based fire monitoring based on geostationary satellite images, the latency of forest fire detection, useful input feature investigation, and potential limitations should be examined further.
Although no studies have yet been conducted on using deep learning for the early detection of smallscale forest fires from satellite data, CNNs with timeseries information are expected to be capable of detecting forest fires in a timely manner. Therefore, in this study, a novel deep learning-based approach that considers both temporal and spatial information, focusing on reducing detection latency, was proposed for detecting forest fires. The specific objectives were to develop a CNN model for forest fire detection that combines spatial and temporal information optimized for geostationary satellite images, evaluate the proposed model for the early detection of forest fires, characterize false alarm patterns, and examine the generalization of the approach across East Asia.

Study area
The study area was East Asia, specifically two sites shown in Figure 1, where forest fires frequently occurred. Although in situ forest fire data were officially provided by the Korea Forest Service for South Korea (site 1 in Figure 1), such data were not available for site 2. Therefore, model construction and validation were conducted using data from South Korea (site 1). Fire cases in North Korea and China were used as additional test data to examine the generalization of the proposed models (site 2). Fire cases in China were extracted from previously published papers (Jang et al. 2019;Xie et al. 2018), whereas those in North Korea were extracted from MCD14DL/ VNP14IMGTDL and cross-checked with Sentinel-2 satellite images ( Figure 1; Table S1).
Forest fires in South Korea usually occur in the dry and warm spring seasons and rarely in summer when precipitation is concentrated (Jang et al. 2019;Kang et al. 2020). Most forest fires in South Korea are caused by human activities, do not last more than one day, and are limited to a damaged area of approximately 1-2 ha. Eastern China, which is largely covered by forests, crops, and grass, has a typical boreal forest landscape that is prone to the spread of forest fires (Xie et al. 2018).

Satellite-based data
The Japan Meteorological Agency launched the geostationary Himawari-8 satellite in October 2014. The Himawari-8 Advanced Himawari Imager (AHI) observes the study area every 10 min and provides 0.5-2 km images for 16 channels in the visible to thermal IR spectral regions. In this study, full-disk images with a spatial resolution of 2 km provided by JAXA P-Tree were used (https://www.eorc.jaxa.jp/ptree/index.html). The channel data used for forest fire detection in this study are summarized in Table S2. Six visible and five IR channels were used for cloud masking (Lim et al. 2018). A channel with a central wavelength of 3.85 μm and four thermal IR channels, which has been frequently used in previous studies, was used to detect forest fires (Jang et al. 2019;Lin et al. 2018;Guang and Zhong 2017). The Himawari Wild Fire Level 2 product was used for comparison with the proposed algorithm and provides the location and related information of fires with 10 min intervals at a 2 km spatial resolution. The Himawari wildfire detection algorithm uses a threshold-based approach based on the normalized deviation of the BT at 4 μm from the background temperature determined by the BT at 10.8 μm within an 11 × 11 window (https:// www.eorc.jaxa.jp/ptree/index.html). MODIS and VIIRS active fire products were used to compare the results of the proposed models. These products have been widely used for comparison in forest fire detection studies (Lin et al. 2018;Xie et al. 2018;Guang and Zhong 2017). The MODIS (MCD14DL) and VIIRS (VNP14IMGTDL) active fire data were downloaded from the National Aeronautics and Space Administration Fire Information for Resource Management System (https://earthdata. nasa.gov/active-fire-data). The MODIS product is provided four times a day, with a spatial resolution of Figure 1. The study area includes South Korea, North Korea, and the eastern part of China. Site 1 is the main study area used to develop the forest fire detection models. Site 2 was selected to evaluate the transferability of the proposed models using fire cases A1-6. The detailed images for cases A4-6 are shown with Sentinel-2 RGB images. Blue dots indicate fire locations. The background image depicts land cover classes extracted from the Moderate Resolution Imaging Spectroradiometer land cover type product (MCD12Q1).
1 km, whereas the VIIRS product is provided twice a day, with a 375 m spatial resolution (Louis, Schroeder, and Justice 2016;Waigl et al. 2017).
The MODIS land cover type product (MCD12Q1), with a spatial resolution of 500 m for 2018, was downloaded from the United States Geological Survey Earth data (https://search. earthdata.nasa.gov/). In total, 17 subclasses, provided by the International Geosphere-Biosphere Programme (IGBP), were reclassified into seven classes as follows: crop, barren, forest, grass, urban, water, and wetland . Land cover data were used in post-processing of the model output.

Forest fire reference data
The Korea Forest Service (https://www.forest.go.kr) provided in situ forest fire data collected through site visits by regional public experts. The data contained the fire start and end times, location, extent of damaged area, and cause. In this study, forest fires with damaged areas > 0.7 ha (based on Jang et al. (2019)) occurring from October 2015 to December 2019, without cloud contamination, were selected. Finally, 91 forest fire cases were used as the reference data. Seven of these cases were recorded as large forest fires with damaged areas of >100 ha, whereas 16 cases were considered smallscale forest fires of <1 ha.

Methods
In this study, machine-learning-based forest fire detection models were developed using Himawari-8 AHI as the primary input data source to target in situ forest fire/non-fire reference data. Three RF-and two CNN-based schemes were designed using various combinations of input features. The machine learning-based modeling results were post-processed to reduce false alarms, including non-forested areas, using the land cover ratios calculated from MODIS land cover data. The final output was compared with the MODIS and VIIRS active fire products. Figure 2 shows a schematic flow diagram proposed in this study.

Data preprocessing
Clouds obscure the energy emitted by forest fires; thus, cloud removal is crucial for detecting forest fires using optical satellite data. Every 10 minutes, the Himawari-8 AHI produces an official cloud product, with a 5 km spatial resolution and quality flag. Due to its coarse resolution (i.e. 5 km), many studies have used customized cloud masks with a finer resolution of 2 km (Jang et al. 2019;Wang et al. 2020;Guang and Zhong 2017). This study adopted the cloud-masking procedure used in the aerosol optical depth retrieval algorithm using Himawari-8 satellite data (Hyunkwang et al. 2018).
Although forest fires are characterized by high BT values, high BT values in sparsely vegetated areas are the main source of false alarms (Louis, Schroeder, and Justice 2016; Jang et al. 2019). Thus, the 500 m land cover data, with seven classes, were used to calculate the ratio of each class to the area of a 2 km window for post-processing of the machine learning-based modeling results.

Modeling process
CNNs have recently been used in active fire detection owing to their ability to overcome the limitations of fixed threshold-based contextual analysis while retaining the contextual concept. RF was adopted as the control model for comparison with the proposed CNN-based model because it is a widely used machine learning technique that performs well in many classification problems, including fire detection (de Luca, Silva, and Modica 2022; Jang et al. 2021;Stroppiana et al. 2022). Focusing on forest fire monitoring with detection latency, it is beneficial to investigate the performance and characteristics of the two representative, pixel-and image-based, models.

Convolutional neural networks
CNN extracts useful features from 2-D input data through a series of convolutional or pooling operations (Lee et al. 2021;Zhao et al. 2020). The convolution layer calculates the dot product of the input layer, while applying a series of moving filters of various sizes to the entire image. By extracting localized features within the window, the pooling layer reduces the image size and simplifies the model to prevent overfitting (Lee et al. 2020;Yoo et al. 2019).
The optimal structure of the CNN was determined by experimenting with various combinations of filter sizes, activation functions, numbers of convolution and pooling layers, and batch sizes (e.g. 32, 64, and 128 filter sizes) (Figure 3). It was finally trained using 100 epochs and 64 batch sizes; the other parameters are shown in Figure 3. To detect forest fires, a specific window size must be guaranteed to account for the surrounding area, which is not affected by active fires. However, cloud contamination increases the likelihood of missing values in the windows. After testing several window sizes, ranging from 7 × 7 to 13 × 13, an appropriate window size was determined. Consequently, the input data consisted of 9 × 9 window images containing N input features, and the output was a binary class indicating whether the center pixel of the window was a forest fire. The CNN structure was constructed using Keras library in Python 3.8.

Random forest machine learning
RF is an ensemble machine-learning technique based on a multitude of decision trees (Breiman 2001). The final decision was made for classification problems using majority voting from multiple trees. To ensure independence among multiple decision trees, RF constructs a tree using random subsets of input features and samples (Kang et al. 2021). In RF, the number of trees and maximum depth are critical parameters. The number of trees was set to 500 using grid search optimization, and the maximum depth was set to "none," allowing each tree to grow as deep as possible. The RF model was implemented using the scikitlearn library in Python 3.8.

Sampling strategy
For each fire case, forest fire samples were extracted from in-situ reference data from the start to the peak time. Owing to the wide variability in non-fire samples, it is crucial to obtain representative non-fire samples. Therefore, three strategies were used to extract 93,270 non-fire samples from each image containing forest fire(s).
The first strategy focuses on the Gaussian distribution of the input feature values (BTs in this study). For each satellite image, non-fire samples covering the entire range of the input feature values were extracted using percentile ranks for every 5 from 0-80% and 0.05 from 80-100%. The extraction of more non-fire samples at high percentile ranks can contribute to reducing false alarms.
The second strategy considers the spatial distribution of the non-fire samples. As this study used the spatial component of the input features, it was critical to consider not only the BTs themselves, but also their surrounding information. Because the spatial variability of BTs depends on factors such as elevation and land cover composition, non-fire samples were collected regularly throughout the study area.
Third, missing values within the input images that satisfied the first and second strategies were filled. If any missing values were not tolerated within a window, the number of available samples is lower when a large window size is used. The number of available samples was increased by allowing a certain percentage of missing values. The missing value ratio within the window was determined to be 40% using empirical tests with a range of ratios ( Figure S1). To fill in the missing values, various methods, ranging from simple averaging to spatial interpolation, have been used in many studies Zhang, Li, and Guo 2020). In this study, the average was used to fill in the missing pixels based on the missing value ratio within a window.
Using these procedures, 91 fire cases (2,157 Himawari-8 AHI images with 7,795 fire samples and 93,270 non-fire samples) were extracted. Fires do not always appear isodirectional on a satellite image; instead, they may develop in the horizontal or vertical direction and have an irregular shape owing to the wind. Because the shape of the fire pixels is crucial for CNN model training, sample images were processed through straightforward data augmentation using rotation and flip to detect all fires with various shapes. Finally, the number of samples increased eight-fold.

Schemes
Three types of input features, spectral, temporal, and spatial, were used in this study (Table 1). Firstly, the spectral input features were BT at 3.85 μm and BT differences between 3.85 μm and other longwave IR wavelengths. Unlike BT at 3.85 μm, longwave IR did not change significantly in response to forest fires; thus, BT differences were used (Jang et al. 2019;Lin et al. 2018;Guang and Zhong 2017). Secondly, a temporal input feature was used to focus on the abnormal change in BT at 3.85 μm in the forest fire area compared to that before the fire. For each pixel, the temporal input feature was derived using the difference in BT between the targeting time and the average of the previous 15 times over 15 days. To exclude erroneous observations in the temporal differences, the maximum and minimum values of the 15 days were not included in the calculation. Lastly, the spatial input feature assumed that the BT at 3.85 μm in the fire area was greater than that in the surrounding areas. After excluding the maximum and minimum values, the spatial difference between the center pixel and the surrounding mean within a 9 × 9 window was calculated.
To investigate the contribution of the input features to the model, five schemes were designed using various combinations of input features as follows (Table 1): three RF-based schemes designated RS1-3 Table 1. Description of the input features and schemes. RS1-3 used only pixel-based information as input features, whereas CS1-2 used 2-D images. "O" indicates the input feature was used in the corresponding scheme. and two CNN-based schemes designated CS1-2. RS1 used only spectral input features, whereas RS2 and RS3 combined temporal and spatial information, in addition to spectral features. CS1 used spectral input features, whereas CS2 combined spectral and temporal features. CNN itself uses the spatial patterns of the input images, and the spatial features were not additionally used for the CNN-based schemes.

Post-processing
The outputs of the RF and CNN models were further processed to reduce false alarms by incorporating information regarding the surrounding land-cover ratio. Specific land cover ratio thresholds were established to differentiate false alarms from forest fires. False alarm pixels generally had lower forest ratios and higher land cover ratios than forest fire pixels (Figure 4). In particular, crop, built-up, and water classes revealed significant differences between fire and false alarm pixels, which could be used to reduce false alarms. The following four conditions were applied to reduce false alarms after empirically testing various combinations of thresholds. If a potential fire pixel met any of these conditions, it was considered a non-fire pixel. Condition 1: Crop ratio within the 9 × 9 window > 0.4 Condition 2: Forest ratio within the 9 × 9 window < 0.25 Condition 3: Built-up ratio within the 9 × 9 window > 0.2 Condition 4: Water ratio within the 9 × 9 window > 0.2

Accuracy assessment
A total of 91 forest fire cases (1,857 Himawari-8 AHI images with 7,795 fire samples and 93,270 non-fire samples) were divided into five folds (19, 18, 18, 18, and 18 cases) for cross-validation (CV). One fold was used for testing the models, whereas the other four were used for training; this process was repeated for all five folds. The training fire cases were further divided into 80% for model construction and 20% for validation, which were used to optimize the parameters for model construction. To investigate the spatiotemporal transferability of the models, the fire cases in each fold were divided based on the damaged area and dates. Each fold was subjected to at least one large fire (>50 ha) and one small fire (<1 ha).
The developed models were evaluated using the following commonly used accuracy metrics for classification problems: overall accuracy (OA), recall, precision, and F1-score Jang et al. 2019;Xie et al. 2018). The OA is an intuitive indicator of the classification accuracy of each class across the entire dataset. Recall measured the model's ability to correctly classify forest fires among reference forest fires. Precision is the proportion of correctly classified forest fires in decisions made by the model. The F1-score is frequently used to evaluate unbalanced data, as the overall accuracy could be misled with an unbalanced sample number of classes Bekkar, Djemaa, and Alitouche 2013). The F1-score was calculated as the harmonic mean of recall and precision, ranging from 0 to 1. Figure 5 depicts the evaluation metrics (OA, precision, recall, and F1-score) of the five schemes. All schemes produced high OA values, indicating that they were capable of discriminating between fire and non-fire pixels. RS2 and RS3 had a higher recall than RS1, implying that temporal information aided in the detection of forest fires. Compared with RS2, RS3 had an increase in precision, implying that false alarms were effectively reduced by simple spatial information. RS3 outperformed the other RF-based schemes, demonstrating the importance of considering both the temporal and spatial information when detecting forest fires. The high precision of CS1 and CS2 also confirms the effectiveness of considering spatial information using a CNN for forest fire detection. These findings indicate that image-based CNNs outperformed pixel-based RFs in detecting forest fires. Finally, based on accuracy metrics, CS2, which had the highest F1-score, was the best model ( Figure 5).

Assessment of forest fire detection accuracy
Among the 91 total cases, 63, 82, and 88 were detected by RS3, CS1, and CS2, respectively ( Figure 6). This finding is consistent with the results of accuracy assessment, i.e. CNN has a higher recall than RF. Meanwhile, only 16 cases were detected by the Himawari-8 AHI wildfire product. Despite using the same satellite data, our proposed deep learningbased fire detection algorithm outperformed the threshold-based Himawari-8 product.
The current threshold-based approach has limitations in detecting small-scale fires that are prevalent in South Korea. Because the approach is designed to effectively recognize global wildfires, it rarely detects small fires, especially those with weak signals (Jang et al. 2019). The high performance of our proposed model can be attributed to its training mostly on smallscale forest fires that have occurred in South Korea. Thus, spatial generality was examined to determine whether this model could detect wildfires in regions other than South Korea (section 4.5).
MODIS and VIIRS detected only 10 and 16 cases, respectively, because of their relatively long revisit cycles. VIIRS detects forest fires more accurately than MODIS, possibly because of its high spatial resolution (Fu et al. 2020;Schroeder et al. 2014;Figure 4. Land cover ratios calculated within a 9 × 9 window for (a) RF and (b) CNN models were used for post-processing of the forest fire detection results. The box represents the interquartile range, and the middle line represents the median value. The upper whisker indicates the range between upper quartile values and the nonoutlier maximum, and vice versa. Waigl et al. 2017). Because MODIS and VIIRS have limited temporal resolutions, they are not suitable for detecting forest fires in real time. Among the 40 fire cases that overlapped with the MODIS/VIIRS observation times, 22, 24, and 26 were detected by RS3, CS1, and CS2, respectively. Consequently, the developed models detected forest fires better than MODIS and VIIRS, despite the lower spatial resolution of the Himawari-8 data.

Assessment of feasibility of reducing detection latency
RS3, CS1, and CS2 (i.e. the top three schemes) were evaluated using the detection latency for initial detection ( Figure 6). Detection latency was examined in the following ways: 1) observed prior to the reporting time, 2) detected at the same time as the reporting time, 3) detected within 60 min of the reporting time, and 4) observed after 60 min.  In most cases, CS2 detected fires at the same rate or slightly faster than CS1 (Table S3), detecting most fires within 60 min. Four of five cases detected over 60 min had such a long latency owing to clouds. CS2, which had the shortest detection delay, detected 27 of the 91 cases simultaneously with the reporting time, with an average time to first detection of 12 min (Table S3), except for four cases, which had been masked owing to the presence of clouds in the early stage of the fire. This was faster than the 24 min lag time previously reported by Jang et al. (2019). As in previous studies, there was no clear relationship between detection latency and damage area (Jang et al. 2019). Interestingly, case 59 was detected 9 min earlier by CS2 than that reported in the reference data. This forest fire was detected by the satellite sensor before it was reported by people early in the morning (06:29 local time). Although the Himawari-8 AHI Wildfire product performed poorly in detecting forest fires (section 4.1), it detected 12 cases simultaneously with the reporting time, which is attributed to the high temporal resolution of the sensor. CS2 also detected small-scale forest fires much faster than MODIS and VIIRS, emphasizing the advantage of geostationary satellite-based models for reducing detection latency.

Fire detection mapping
The three schemes (RS3 and CS1-2) had almost no false alarms, implying that spatial information effectively eliminated false alarms. The fire detection mapping results were in good agreement with the high precision of RS3 and CS1-2. However, RS3 failed to detect some forest fires, whereas CS1-2 detected them, indicating that CNN can detect forest fires better than RF. However, the models had a tendency toward over detection for some forest fires (i.e. larger areas than the actual damaged areas), where a fire alarm was marked at multiple pixels surrounding the fire location owing to the high BT values (Figure 7d and 7f). This pixel adjacency problem, in which the signal from forest fires affects the surrounding pixels, is a common issue, especially when off-nadir data were used (Kato et al. 2021;Li et al. 2020). Notably, the study sites were located in a high-scan-angle region.

False alarms and post-processing effects
When the proposed model was applied to the entire study area, the number of target pixels for the test cases was 99,224,157 (i.e. 2,157 images × 46,001 pixels). Very few false alarms occurred for each of the five schemes: 26,480 (0.03%), 27,736 (0.03%), 31,035 (0.03%), 29,709 (0.03%), and 28,825 (0.03%) for RS1, RS2, RS3, CS1, and CS2, respectively. Post-processing successfully rejected 16,448 (62%), 9,637 (35%), 26,349 (85%), 12,213 (41%), and 11,149 (39%) false alarm pixels (ratios) for each of the five schemes (Table 2). Most false alarms were removed under condition 2 across all schemes. For schemes with temporal information, there were fewer false alarms related to Condition 3 (RS2-3 and CS1-2), indicating that temporal information was effective in removing stationary hotspots, such as built-up areas. Although there were many water-related false alarms for schemes with spatial information, they were successfully removed by Condition 4. False alarms are frequently associated with clouds, coastlines (near water), and bright built-up areas, all of which have been reported in fire detection studies (Louis, Schroeder, and Justice 2016;Hall et al. 2019;Koltunov et al. 2016). Fortunately, most false alarms caused by land cover interference were effectively removed during post-processing. Figure 8 depicts a common example of post-processing and its results. A total of 29 false alarm pixels were successfully removed from the images. All false alarms occurred in non-forest areas, of which 10 pixels were removed owing to the near shoreline or clouds, and six pixels were removed by the crop ratio. Even after postprocessing, few false alarms remained inland because of the imperfect masking of the cloud (Figure 8b).

Spatial transferability of the models
RS3 and CS2 were confirmed as the best schemes among the RF-and CNN-based schemes, respectively, and their performances were examined to evaluate the spatial transferability of the models using Chinese and North Korean fire cases. Typical forest fire detection algorithms rely on the BT of satellite images, rather than on environmental conditions (except for false alarm removal). Thus, they perform poorly in areas with varying environmental conditions (Yuyun et al. 2020;Waigl et al. 2017). As the models were mostly trained over forest areas in temperate climates, RS3 usually missed crop and savanna fires (Table S1), as fire radiation in low-biomass areas was Table 2. Reduction in false alarms using post-processing by condition.

Scheme
Condition 1 (%) Condition 2 (%) Condition 3 (%) Condition 4 (%) Total (%)  RS1  36  56  16  9  62  RS2  21  31  5  8  35  RS3  5  80  3  79  85  CS1  18  32  9  15  41  CS2  19  31  7  typically lower than that in forests (Yuyun et al. 2020). RS3 detected only 9 of the 18 cases, whereas CS2 detected all cases (Table 3). Although a CNN is robust to changes in the external environment, it can be further improved if external conditions are incorporated into the modeling process. Case A3 was used for additional analysis of the early detection capability of the proposed model. Fire A3 began at 04:00 UTC; however, it only became visible to the satellite sensor at 04:40 UTC because of cloud contamination. RS3 detected the fire at 07:30 UTC, whereas CS2 detected it at 05:00 UTC, approximately 20 min after the area became cloudless. Although precise ignition times for the other cases could not be obtained, CS2 also detected forest fires faster than RS3. These results demonstrated the potential of the CNN model for spatial transferability.

Superior performance of CNN
The superiority of CNN models in forest fire detection can be attributed to data augmentation and CNN structure. To demonstrate this, a scheme without data augmentation in CS2 (herein, CS2 w/o aug) and without convolution and pooling layers (herein, CS2 w/o conv) were constructed. The latter was intended to investigate the effectiveness of the CNN structure itself by removing the CNN's core layers and having only a simple dense layer; the significance of spatial pattern learning was also determined. The performance of the CS2 w/o aug scheme was slightly lower than that of the original CS2 (Figure 9). A decrease in recall was particularly notable, indicating that the use of augmentation to learn different fire mask shapes contributed to an increase in the detection rate of forest fires. Moreover, the performance of the CS2 w/o conv scheme was noticeably reduced compared to CS2. With a recall of less than 0.1, the model rarely detected wildfires. This clearly demonstrates the importance of learning spatial patterns and extraction of local features through convolution and pooling layers in forest fire detection.
The use of contextual information in CNN also positively affects early detection. For example, because the RF used pixel-level input features, it was difficult to identify a forest fire (i.e. case 49) when the input BT values were low. In contrast, the CNN could determine the case as a forest fire because all input features had a spatial pattern with a higher center value than the neighborhood pixels. RF did not detect the fire case earlier as the input feature, except for 3.85 μm BT, which had relatively low values at 02:50 UTC ( Figure 10).

Delayed detection
The following are the main reasons for the delayed detection of forest fires (> 10 min): weak signals, heterogeneous land cover types, and cloud contamination. First, weak signals can be attributed to the spatial resolution of Himawari-8 AHI that was too coarse to capture signals from small-scale fires, particularly those in the early stages. Subpixel-level omission errors were often found in MODIS (1 km) (Yuyun Second, heterogeneous land cover distribution is prone to false alarms and misdetections, as each land cover type has a unique inherent BT value. For example, sparsely vegetated areas have a relatively higher BT than densely vegetated areas (Huh and Lee 2017). In case 42, the BT at the center was not significantly higher than that at the boundary pixels of the window (Figure 11). The BT values were very low in areas with high forest ratios, whereas they were high in areas with high grass and urban ratios.
Lastly, clouds over fires obstructed early detection, an inherent limitation of optical satellite data regardless of the effectiveness of the developed model for early detection. Furthermore, incorrect cloud masking could be a significant limiting factor in fire detection (Yuyun et al. 2020;Hall et al. 2019). Case 108, with a long lag time (384 min), illustrated that when cloud masking was not used, excluding cloud masking errors, the fire could be detected within 24 min.

Novelty and limitations
This study proposes the application of deep learning for geostationary satellite-based forest fire detection. This study had several novel aspects. To our knowledge, this is the first study to analyze CNN-based forest fire detection using geostationary satellite data. Temporal information is beneficial for minimizing detection delay, whereas spatial information is critical for reducing false alarms. Consequently, the use of not only spectral information but also spatiotemporal information can significantly improve fire detection performance. Second, we examined in detail why the CNN had a superior performance than the RF. Third, the CNN reduced the detection latency compared with RF and other satellite-based active fire products. Finally, although the proposed algorithm utilized coarse spatial resolution data (Himawari-8), it showed better performance in the timely detection of small-scale forest fires than high spatial resolution satellite data (MODIS and VIIRS) owing to its much higher temporal resolution.
However, this study had some limitations. First, the RF and CNN models were not compared using identical input features. Although one spatial input feature was considered in RS3, more useful spatial input features, such as contextual metrics, could improve RS3 performance. Second, although CNN was effective for rapid detection, it is limited by a coverage issue because the proposed CNN approach worked when the missing value ratio was less than 40% within the 9 × 9 windows. Finally, false alarms due to incorrectly classified land cover may not be eliminated during postprocessing.

Conclusion
This study developed machine learning-based forest fire detection algorithms using Himawari-8 AHI images and demonstrated promising results for early detection. The algorithm enables continuous forest fire detection without physical visit to the site, facilitating a rapid response by minimizing lag time. The effects of the input features were analyzed in terms of detection latency and false alarms using five schemes based on the presence of temporal and spatial input features. RS3 had a precision of 0.89, recall of 0.41, and F1-score of 0.54, whereas CS2 had 0.91, 0.63, and 0.74, respectively, indicating a better performance of the CNN for forest fire detection. The CNN model detected all 86 test cases, including 5 that were detected without latency. Thus, the CNN outperformed RF in reducing detection latency, with an average initial detection time of 12 min. Furthermore, the CNN detected 18 additional test cases in the extended study area, whereas RF only detected nine, suggesting the potential of an image-based algorithm for the rapid detection of forest fires using geostationary satellite data at a spatial resolution of 2 km.