Smoke Image Segmentation Algorithm Suitable for Low-Light Scenes

: The real-time monitoring and analysis system based on video images has been implemented to detect ﬁre accidents on site. While most segmentation methods can accurately segment smoke areas in bright and clear images, it becomes challenging to obtain high performance due to the low brightness and contrast of low-light smoke images. An image enhancement model cascaded with a semantic segmentation model was proposed to enhance the segmentation effect of low-light smoke images. The modiﬁed Cycle-Consistent Generative Adversarial Network (CycleGAN) was used to enhance the low-light images, making smoke features apparent and improving the detection ability of the subsequent segmentation model. The smoke segmentation model was based on Transformers and HRNet, where semantic features at different scales were fused in a dense form. The addition of attention modules of spatial dimension and channel dimension to the feature extraction units established the relationship mappings between pixels and features in the two-dimensional spatial directions, which improved the segmentation ability. Through the Foreground Feature Localization Module (FFLM), the discrimination between foreground and background features was increased, and the ability of the model to distinguish the thinner positions of smoke edges was improved. The enhanced segmentation method achieved a segmentation accuracy of 91.68% on the self-built dataset with synthetic low-light images and an overall detection time of 120.1 ms. This method can successfully meet the ﬁre detection demands in low-light environments at night and lay a foundation for expanding the all-weather application of initial ﬁre detection technology based on image analysis.


Introduction
Fire accidents are one of the major disasters that seriously endanger the safety of people's lives and property in daily life. Prevention and the timely alarm of fire accidents are the top priorities to protect people's safety. Before most fire accidents, a mass of smoke will be produced at the ignition point. Consequently, monitoring smoke can detect fires quickly to avoid excessive fire spread and critical property damage. With the rapid development of computer vision and artificial intelligence, fire detection technology based on image processing and object detection has been widely studied. Smoke detection technology based on video images has been gradually replacing traditional temperature and smoke sensors with its faster response speed, more comprehensive detection range, lower cost of use, and weaker environmental restrictions.
Object detection mainly relies on color information, edge contour, texture features, and motion information of smoke images to extract features [1,2]. Therefore, it is challenging for object detection algorithms to accurately separate smoke areas from backgrounds during feature marking and extraction due to smoke's variable scale, strong diffusion, blurred boundary, and broad color change with concentration.
As the semantic segmentation methods gradually become a hot spot in image segmentation, this pixel-by-pixel classification method can effectively avoid the influence of background on smoke features and overcome the weakness that the object detection algorithm is only suitable for rigid targets. The semantic segmentation methods based on deep learning are more efficient in the segmentation task of smoke images than the traditional ones. During the training process, the deep learning models can select the features which need to be learned according to different distributions of the datasets. Therefore, the precise separation of smoke and non-smoke areas can be realized, and the segmentation method is accurate, fast, and robust [3,4].
Current semantic segmentation methods can obtain high accuracy when applied to the segmentation task of bright and clear smoke images during the daytime. However, their segmentation effects for low-light smoke images have difficulty reaching the practical application levels. Images captured in low-light conditions tend to have longer exposure times, which results in more noise and blur, as well as unclear edges and features, affecting the performance of the segmentation model. In addition, many unattended areas such as warehouses and computer rooms are in low-light environments, making it more important to improve the detection ability of segmentation algorithms for early smoke in low-light or nighttime environments, which is especially significant for fire accident detection and alarm tasks.
Considering these points, we proposed an innovative semantic segmentation method which is specifically designed to work in low-light environments for smoke images. It will provide theoretical and technical support for detecting smoke areas in real low-light scenes. The main contributions of our work can be summarized into three key points: • We have overcome the challenge of extracting features from low-light smoke by separating the entire detection task and completing the low-light smoke image segmentation task through an image enhancement network cascaded with a semantic segmentation network. • We have designed a low-light image enhancement network based on unsupervised transfer learning methods. Our modified CycleGAN [5] algorithm has significantly improved the brightness and contrast of the smoke image, making it more suitable for subsequent segmentation tasks. • We have designed a multi-scale feature extraction network based on a Transformer, which is capable of handling smoke feature extraction tasks in complex scenes. By fusing the semantic features of different resolution branches, the extraction ability of our network on the global features of smoke is enhanced.

Related Works
Many scholars have proposed various methods to enhance the segmentation ability to overcome the difficulties of image segmentation tasks in low-light environments. Ref. [6] proposed an unsupervised nighttime semantic segmentation model called DANIA. DANIA uses image relighting networks with light loss functions to narrow the image intensity distribution gap in different domains. Additionally, it combines the image relighting network and Convolutional Neural Networks (CNNs) to perform semantic segmentation as a generator. DANIA also designs a discriminator that performs adversarial learning to distinguish whether the segmentation prediction is from the source or target domain. While DANIA has successfully achieved state-of-the-art performance on night driving test datasets, it is not applicable for smoke segmentation due to the variable scales, irregular shapes, and texture information, which are greatly affected by brightness and concentration.
To improve the model's ability to learn from night images, Ref. [7] introduced a self-attention mechanism that considers position information based on Deeplab v3+ [8]. Additionally, a lighting adaptation mechanism was added to reduce the differences in the feature maps extracted by the shallow layers of the network. The model also addressed the differences between normal and low-light feature maps by using an illumination reflection weight map, which improves the feature extraction ability of unevenly illuminated positions. However, this method only considers the illumination and position feature information based on low-light images. Although the attention mechanism enhances the feature extraction ability during the end-to-end training process, more improvements are needed to solve the problems of inaccurate low-light feature extraction and difficult recovery of edge details.
Ref. [9] proposed a semantic segmentation network that combines visible and infrared images to improve segmentation accuracy in low-light environments. The network design includes two parallel encoders that extract pixels of both modes separately, followed by a fusion process where each pixel of the infrared images is fused into the visible images. This complementary process combines the location information and pixel associations of the two domain images and effectively extracts features from complex nighttime backgrounds. However, the dual-mode segmentation model requires the support of real datasets containing visible and infrared images, which are challenging to construct and prone to class imbalance. The increase in dataset capacity also leads to longer image acquisition and training times, as well as expanded training complexity.
Furthermore, the segmentation network built in Refs. [7][8][9] mainly utilizes convolutional units. CNNs have advantages in spatial position representation. However, it is difficult for them to capture global feature context information due to the locality of convolution operations [10]. At the same time, CNNs can reduce the amount of computation through pooling operations in the feature extraction process, which will instead lead to the loss of detailed features [11,12]. For smoke targets, the loss of small-scale features can result in poor segmentation of the details of edges. With the application of a Transformer to the visual field, it can capture the long-distance feature dependence of complex spatial transformation domains between feature maps with self-attention and establish the global feature representation of the candidate regions [13]. However, Transformers may ignore the local feature information, which reduces the ability to distinguish between the foreground and the background in images. Therefore, designing a network that integrates the Transformer and CNNs can complete the complementary fusion of global semantic information and local detail features.

Method
In order to address the issue of unclear features in low-light images, we cascade an image enhancement network prior to the segmentation task. Our modified CycleGAN algorithm is designed to enhance low-light images, making them more similar to their daytime counterparts. Then, more accurate smoke segmentation results can be obtained with our semantic segmentation network. Figure 1 depicts the entire flow of our low-light smoke image segmentation algorithm. Fire 2023, 6,217 By cascading the two network models, it is possible to avoid the need for complex fusion methods while effectively reducing the model's complexity and computational scale. Furthermore, our approach allows for the enhancement and segmentation networks to be replaced based on different task scenes and requirements, improving the method's generalization and flexibility. Our highly efficient and adaptable approach makes it a valuable tool for applications in low-light environments.

Image Enhancement Network
To improve the smoke characteristics in low-light environments, we modify the Cy-cleGAN model as the main component of our image enhancement network. The original CycleGAN has difficulty preserving image details during the enhancement process, which results in artifacts, blur, and noise in enhanced images, making the subsequent segmentation task challenging. Inspired by EnlightenGAN [14], we modify the generator network with an encoder-decoder structure to restore the smoke details accurately. We also add a brightness equalization branch to balance the brightness of each part of an image. To ensure that the smoke features are preserved during the enhancement process, we attach a similarity discriminant branch to the discriminator network based on a two-branch form. Figure 2 shows the structure of our low-light smoke image enhancement algorithm. The encoder follows the same structure as EnlightenGAN, while the decoder part's upsampling process is completed using PixelShuffle [15]. Through convolution and multichannel recombination, the low-resolution feature map transforms into a high-resolution feature map. PixelShuffle is used mainly to handle the loss of details during upsampling based on a single feature map. The skip connection between the encoding and decoding stages ensures that original image features are transmitted, allowing some of the lost details from downsampling to be recovered. The brightness equalization branch applies additional weights to darker areas of the images, which can make them brighter-at the same time, suppressing the enhancement effect of brighter areas to avoid overexposure problems. The structure of the brightness equalization branch is shown in Figure 3.
To ensure that the brightness equalization branch is sensitive to variable levels of darkness in low-light images, we choose the Parametric Rectified Linear Unit (PReLU) [16] as the activation function. By adjusting the parameters in PReLU adaptively according to different brightness levels in different areas of the images, the equalization branch's sensitivity to different brightness levels is significantly improved. Inspired by Enlighten-GAN's dual discriminant branch, our discriminator enhances the global brightness level and corrects image details. However, the generator's skip connection can only partially counteract the loss or change of image features during enhancement. To further recover detailed features, we design a similarity discrimination branch that works in couple with the subsequent segmentation model in the discriminator network, ensuring that the image enhancement process will not significantly impact the smoke features. The discriminator structure, including the similarity discriminant branch, is illustrated in Figure 4.  For the local discriminator, we modify the loss function of the original LS-GAN [17] with the least squares loss: For the global discriminator, to improve the quality of the generated images and reduce the training time, we refine the original loss function with the standard function of the relativistic adversarial network [18]. The least square loss of the global discriminator according to the corresponding regression target is where D represents the discriminator network. x r and x f represent the distribution of real and fake images.
With unsupervised training, the transfer effect of domains is controlled by the difference between the quadratically generated and real images. To ensure that the characteristic information of the smoke area remains unchanged, we append the cycle consistency loss based on the LS-GAN loss, which helps to prevent the subsequent segmentation process from being adversely affected. The cyclic consistency loss expression is as follows: where G and F represent the generators in two directions, x represents the low-light images, y represents the images in the daytime, and the difference between images is measured by 1-norm.
As shown in Figure 4, our similarity discrimination branch effectively mitigates the loss and alteration of image specifics caused by downsampling in enhancement networks. Therefore, we increase the object similarity loss using the segmentation model: where Φ represents the pixel-by-pixel classification results obtained by the subsequent semantic segmentation algorithm of the fake and real images. H, W, and C are the dimensions of the corresponding feature maps. We use Mean Squared Error (MSE) to represent the absolute difference between two image segmentation results. L seg improves the recovery of details of the images. However, it also leads to blurry output images since quality evaluation indicators such as MSE only consider the difference between pixels at a single point without considering the correlation between them. They ignore the correlation between pixels. Therefore, we incorporate a loss based on the Structure Similarity Index Measure (SSIM) between the real images and the quadratically generated images: where L cyc indicates the structural similarity loss added to the cyclic generation discriminant, and L one−way represents the structural similarity difference for G and F added in the one-way generation process. In summary, the loss function of the image enhancement network we proposed is as follows: where λ 1 , λ 2 , and λ 3 are the balance parameters that control the proportion of different loss functions. Based on the various experiments we conducted, we found that the parameters λ 1 = 5, λ 2 = 10, and λ 3 = 10 are the most suitable for our enhancement network. We use two branches to calculate the LS-GAN loss based on the two-branch discriminator, while the remaining three loss functions are calculated solely based on the generator network.

Semantic Segmentation Network
When it comes to smoke segmentation, it can be quite challenging due to the extensive range of sizes, strong diffusion, and changeable shapes of smoke. Background information can also significantly affect the segmentation results, making it even more difficult to obtain satisfactory results. Therefore, the networks which work well for other targets may not be effective for smoke segmentation tasks. To meet the requirements of video monitoring and analysis systems for detecting and announcing fire accidents, we propose a semantic segmentation network based on HRNet [19] and HRFormer [20]. It allows for the extraction of multi-scale features of smoke, making it possible to segment smoke images enhanced by our image enhancement network accurately. The structure of our semantic segmentation algorithm is shown in Figure 5. Due to the variable scales and shapes of smoke, feature extraction concentrating on one single scale makes it challenging to obtain accurate segmentation results. We modify HRFormer's Transformer block to handle multi-scale changes in the targets effectively. Additionally, using 3 × 3 depth-wise convolution for information interaction between windows in HRFormer is not sufficient, as it barely covers the internal features of the windows. Meanwhile, increasing the sizes of convolution kernels will import more background information. Therefore, we use the Shifted Window-based Multi-head Self-Attention (SW-MSA) [21] of a Swin Transformer (SW-Trans) instead. The entire structure of our feature extraction module is shown in Figure 6. Compared with the classical network structure of HRNet, our shallow and deep module parts utilize different numbers of feature extraction units. The high-resolution branches generate larger feature maps with fewer output dimensions, which allow for better preservation of pixel spatial positioning information. On the other hand, the low-resolution branches generate smaller feature maps with a larger number of output dimensions, making them better at extracting abstract semantic features. We gradually increase the number of Transformer blocks in our network from high-resolution to low-resolution branches, which allows us to extract both local details and global semantic information in parallel.
The parameters of our feature extraction modules are shown in Table 1. M 1 , M 2 , M 3 , and M 4 represent the numbers of modules in different stages, and B 1 , B 2 , B 3 , and B 4 represent the numbers of Transformer blocks in each branch of different modules.
Consistent with Figure   The LSAM structure is shown in Figure 8, and the GCAM structure is shown in Figure 9.   LSAM reshapes the tokens into feature maps before extracting structural information through a 3 × 3 convolutional layer. Depending on the stage and branch where the transformer block is located, H and W take different values. Once the structural information is extracted, global average pooling is performed in the width and height directions to obtain statistical information: where v k h i ∈ R h×1×c l and v k w j ∈ R 1×w×c l are one-dimensional spatial attention vectors in the height and width directions. z is the feature map obtained by the convolutional layer, and the Batch Normalization (BN) and Gaussian Error Linear Unit (GELU) activation functions are used after convolution. The corresponding operating ranges are 0 ≤ i < h, 0 ≤ j < w, and 0 ≤ k < c l .
A feature map of h × w × 1 is obtained by matrix multiplication using the tensors in the width and height directions within each channel: After obtaining two spatial attention vectors in the height and width direction and a two-dimensional spatial attention feature map with information interaction in two directions, the next step is to activate the three attention weight vectors with the sigmoid function. These weight vectors are multiplied point by point to the original feature map. Then the spatial position attention map M ∈ R h×w×c l of LSAM is obtained. The output feature map of LSAM is calculated by adding M to the output s l+1 of the SW-Trans branch yields: where ⊕ represents the point-by-point summation, represents the point-by-point product, and Φ represents the reshaping of the feature maps. The attention feature vectors in different directions can capture long-distance feature dependencies in their respective direction. By integrating the spatial position information of two directions after multiplying, a two-dimensional spatial mapping between the features and the pixels can be established. LSAM can effectively extract the regional features of smoke, which helps suppress background information and noise to a certain extent.
GCAM performs global average pooling and global max pooling of LSAM output features in the spatial dimension. By learning the weight distributions of the obtained max pooling features and average pooling features in the channel dimension through a shared fully connected layer, GCAM can reduce the dimensionality of the channel features and acquire two feature vectors. After applying GELU activation, the fully connected layer upgrades the feature maps and restores the initial number of channels. Finally, the two attention vectors of features are added and activated by the sigmoid function before being applied to the output features of LSAM to obtain the output of the Transformer block: where ⊗ represents the channel attention weights multiplied by the corresponding feature map, f 0 and f 1 represent the two shared fully connected layers that perform dimensionality operations, and AvgPool and MaxPool represent global average pooling and global max pooling. The Transformer block can adaptively assign feature weights on the channel and spatial domains according to the degree of correlation of features in the smoke image. It also allows for essential features to be enhanced and invalid information to be suppressed.
The segmentation task for the enhanced smoke images is challenging due to the blurred appearances. We modify the Pyramid Pooling Module (PPM) [22] and extend it into a dense style using DenseNet [23]. After each module of our backbone network, we implement a fusion module to upsample the low-resolution feature maps. By concatenating the feature maps of different scales and repeating the upsampling process, our network can increase the receptive field of the feature maps and extract more context information effectively. The structure of Dense PPM is shown in Figure 10. In the proposed network, we implement depth-wise separable convolution [24] instead of original convolutional units to reduce the computation amount during the training process, which results in a notable improvement in the network's training speed.
To accurately distinguish smoke from the background areas and ensure that the segmentation effect is not disturbed by external factors, we propose a foreground feature localization module called FFLM, which can precisely segment the thin smoke areas in the images through the calculation of correlation between each pixel and the smoke foreground. FFLM also helps to increase the differentiation between smoke areas and background information and prevent confusion. The specific structure of FFLM is shown in Figure 11. The output feature maps F in ∈ R h×w×c , which contain rich and detailed features and semantic information, are processed with average pooling, max pooling, and soft pooling [25] techniques in the channel dimension. The statistics obtained from these techniques are combined through a shared fully connected layer. The sum of average and max pooling is then multiplied by the index weights of soft pooling. With the obtained global weight description passing through a fully connected layer and a sigmoid function, the foreground normalized features of each channel of the output feature maps are presented.
Finally, the foreground normalized features are multiplied with the original pixel features F in to obtain the foreground area features: where F represents the foreground area features, P avg,max ∈ R 1×1×c represents the statistical feature description after the sum of average pooling and max pooling, P so f t represents the description of the statistical features obtained by soft pooling. AvgPool, MaxPool, and SoftPool correspond to the operations of average pooling, max pooling, and soft pooling in the channel dimension. f 1 and f 2 are fully connected layers. is the point-by-point product operation.
The degree of association between each pixel and the foreground areas can be calculated with the foreground area features and the original pixel features: where Σ is the foreground correlation representation. ϕ is the foreground correlation calculation function implemented by convolution, BN, and Rectified Linear Units (ReLUs). The output features enhanced by FFLM can be obtained by fusing foreground correlation features, foreground area features, and original pixel features: where ρ and δ are the fusion functions implemented by convolution, BN, and ReLUs. F out is the enhanced output features. FFLM can help improve the overall quality of images by enhancing the features in the foreground area while reducing the impact of the background on smoke feature extraction. Additionally, FFLM can improve the ability of segmentation in thin smoke areas. The smoke segmentation task is a pixel-level dense binary classification, and each pixel needs to be classified between the foreground and background areas. Therefore, we use a binary cross-entropy loss to design the loss function of our segmentation network: where N is the number of pixels in the feature map. p i is the probability that the pixel is predicted as smoke foreground. q i is the ground truth of the pixel. When analyzing images with smoke areas, it is important to consider the proportion of the smoke areas in relation to the background regions. If the loss function treats these areas with a consistent weight, the party with a larger proportion will play a more dominant role in the backward propagation process, resulting in a higher weight during the prediction. Therefore, we introduce weighted coefficients to the two parts of the loss function based on the relative sizes of the smoke areas and background regions. The coefficients allow our model to balance the feature learning process between the two regions and adapt to the unique characteristics of each image. The modified loss function includes a foreground weight coefficient α f : We add an intermediate layer loss to the foreground feature localization process and weigh it against the loss on the segmentation results to balance the supervision throughout the network training process. Both the intermediate layer loss and the final loss use a foreground-weighted binary cross-entropy loss function: are the probability value that the intermediate layer pixel and output feature pixel are classified to the foreground. α u is the balance weight parameter of the union loss. After conducting numerous experiments, it has been determined that our segmentation model achieves optimal performance with a value of 0.25 for α u when applied to our smoke dataset.

Dataset Settings
There is a lack of public image datasets available for the smoke segmentation task, and even fewer public datasets that include smoke images in low-light environments. Therefore, we utilize the smoke video dataset from the State Key Laboratory of Fire Science (SKLFS) [25] and add our own collected images to construct a dataset in the PASCAL VOC format. The training set of the image enhancement network includes 1000 smoke images in the daytime and 1000 synthetic low-light images. The test set includes 200 synthetic low-light smoke images. The semantic segmentation network's training set consists of 4000 smoke images in the daytime, and its test set consists of 200 enhanced synthetic low-light smoke images and 400 images in the daytime. Overall, our datasets include 70 scenes, and some of the images in the daytime are shown in Figure 12. The differences in brightness between low-light images and daytime images are not constant. The values increase as brightness increases, meaning high-brightness pixels in an image are reduced in brightness more than low-brightness ones when transitioning to low-light environments. To simulate these conditions, we use gamma correction [26] methods based on existing synthetic low-light image generation techniques [27][28][29]. We transform the V channel of the images in HSV space and add Gaussian and Poisson noise to simulate the blur and noise captured by the camera in low-light environments. Our image conversion formula is as follows: where X in is the value of the V channel of an HSV image. X out is an output synthetic lowlight image. B G is a Gaussian blur function that sets the standard deviation to a random value between 1.5 and 2. S and γ are the correction parameters corresponding to gamma correction and are, respectively, set to 0.8 and 0.65. N G is Gaussian noise, whose kernel size is (5, 5), and the standard deviation is 1.25. N P is Poisson noise, whose λ is 1.0.
The synthetic low-light images are shown in Figure 13. The detailed features of each part can still be obtained, but it may take some time to distinguish each area. Our network configurations during training and testing are shown in Table 2. The size of images is normalized to 640 × 480. The network batch size is set to 1. The epoch is set to 300. The initial value of the learning rate is 0.0005. The learning rate decays to 0.0001 when the epoch reaches 150.

Comparison Experiments of Image Enhancement Algorithm
In order to verify the effectiveness and superiority of our enhancement algorithm, we select several methods such as LIME [30], SRIE [31], RetinexNet [32], EnlightenGAN [14], LLFlow [33], Zero-DCE [34], and CycleGAN [5] for comparative experiments. Signal Noise Ratio(SNR), Peak Signal Noise Ratio(PSNR), and SSIM are used as image quality evaluation indicators. The average evaluation results on our synthetic low-light smoke image dataset are shown in Table 3. Our enhanced network has advantages over traditional methods SIRE and LIME in various evaluation indicators. Compared with fully supervised algorithms such as RetinexNet, LLFlow, and Zero-DCE, our algorithm also has a certain degree of advantage. In comparison to the unsupervised EnlightenGAN and CycleGAN, although there is not much difference in SNR, our algorithm has excellent advantages in SSIM and NIQE indicators. The results in Table 3 clearly demonstrate that our proposed enhancement algorithm is superior to other low-light image enhancement algorithms in various evaluation indicators.  Figure 14 shows the enhancement results on our synthetic low-light smoke image dataset. The enhancement effects of SRIE and LIME on low-light smoke images are not satisfactory. The enhanced images tend to be overexposed or underexposed, and the boundaries between areas of similar colors appear blurred. Similarly, RetinexNet and Zero-DCE also do not produce desired results. Their enhanced images show severe blur and color distortion, as shown in the fifth row of column (c) and the first row of column (f). CycleGAN can recover the color of images effectively and suppress noise in some uncomplicated scenes. However, when the background information becomes complex, the images enhanced by CycleGAN tend to produce more severe blurring, such as the third and fourth rows of column (g), which can negatively impact subsequent segmentation tasks.
LLFlow and EnlightenGAN can cause slight chromatic aberration, and their enhanced images have high contrast. LLFlow tones appear warmer, while EnlightenGAN tones are colder than real images. However, our method stands out as it can restore image details better, and the enhanced images contain a lower noise level. Most importantly, the enhanced images obtained by our method have the highest structural similarity, indicating that our enhancement operation has the most negligible impact on subsequent segmentation tasks, which further demonstrates the advantages of our method over other segmentation algorithms in low-light smoke image segmentation tasks.

Ablation Experiments of the Image Enhancement Algorithm
The proposed low-light image enhancement algorithm is subjected to ablation experiments in order to effectively verify the contribution of each individual part. The ablation objects are mainly aimed at the brightness equalization branch and the object similarity discrimination between the generated images and the original real samples during the loop generation. The quantitative results of the ablation experiments are shown in Table 4, and the qualitative results are shown in Figure 15.  Net1 is a generator network without a brightness equalization branch or attention module, which only contains the encoder-decoder structure of EnlightenGAN. Its discriminator still uses the two-branch discriminator structure of PatchGAN to fuse local features with global information. Net2 adds a parallel brightness equalization branch to Net1. Net3 is our network for the low-light smoke image enhancement task.
Based on the experimental data in the SSIM column of the table, it appears that the structural control of the generated images with the subsequent segmentation network is quite useful, and Net3 achieves an SSIM increase of 0.026 compared to Net2. The ablation experiment shown in Figure 15 compares the effects of different networks on low-light image enhancement. Net2's brightness equalization branch helps to balance the image's brightness enhancement, avoiding the problem of local brightness being too high or too low and the problem of overexposure or underexposure. Net3's similarity discrimination branch helps to segment the local details more finely. The results of the ablation experiments visually demonstrate the effectiveness of each part of our low-light image enhancement algorithm.

Comparison Experiment of Semantic Segmentation Algorithm
In order to verify the effectiveness of our segmentation algorithm in multiple scenes in low-light environments, we compare PSPNet [22], HRNet [19], Deeplab v3+ [10], Seg-Net [35], SW-Trans [21], and HRFormer [20] with our algorithm on the self-built test set. Quantitative comparison results on test images in the daytime are shown in Table 5. We use Mean Intersection over Union (mIoU) as the primary evaluation criterion. At the same time, Floating Point Operations (FLOPs), parameter amount (Params), and detection time (T) are considered as auxiliary evaluation indicators. The pooling operation of PSPNet can lead to the loss of local features between layers, which negatively affects its ability to identify the edges of smoke. SegNet eliminates the fully connected layer and uses pooled indexes to replace feature map concatenating operations, which significantly reduces the number of network operations. Although SegNet achieved a segmentation accuracy of 88.91%, it is the fastest in comparison experiments. Deeplab v3+ uses the Atrous Spatial Pyramid Pooling (ASPP) module to fuse multi-scale features. However, atrous convolution introduces many background contexts while expanding the receptive field. It is unsuitable for smoke, whose texture features will be greatly affected by background information. Therefore, the improvement in segmentation accuracy brought by ASPP modules is not apparent. Additionally, the computational cost of Deeplab v3+ has increased significantly due to the use of the Xception modules instead of ResNet-101, which results in a slower segmentation speed than other CNN models. HRNet is more effective for smoke segmentation than other CNN models, achieving an mIoU of 90.83%. The better performance indicates that using multiple resolution branches for parallel feature extraction for smoke is effective.
SW-Trans and HRFormer achieve higher segmentation accuracy than the CNN models, which indicates that in the segmentation task for smoke, better global feature extraction ability can bring higher segmentation performance. By introducing Window-based Multi-Head Self-Attention (W-MSA) into the self-attention stage, the calculation is limited to a small scale, and the calculation complexity of the model is effectively controlled while introducing the CNN effect locally. HRFormer uses 3 × 3 depth-wise convolution at the window's interaction, while SW-Trans moves the windows by the masks and feature shift operations, making its Transformer blocks more computationally intensive than HRFormer. However, HRFormer still acquires higher segmentation accuracy than SW-Trans because the medium and low-resolution branches supplemented the semantic information of different scales for high-resolution streams.
According to Table 5, our segmentation network has demonstrated a remarkable accuracy of 92.93% on our self-built smoke dataset, outperforming all the other algorithms. In order to further enhance the accuracy, we incorporate LSAM and GCAM attention modules into Transformer blocks, allowing us to extract smoke features. Additionally, we enhance the smoke foreground areas to better differentiate texture from the background in thinner areas of the smoke edges. These improvements enable us to achieve a significantly higher segmentation accuracy than HRFormer. Since our segmentation model modified the Transformer blocks and added a foreground enhancement module after the feature extraction, our model's complexity is higher than that of SW-Trans and HRFormer, leading to a decline in speed. Nevertheless, our segmentation network still holds practical value at the application level, particularly in situations where real-time demands are not excessive.
The segmentation results of the classical segmentation algorithms and ours on the images of daytime are shown in Figure 16. As mentioned earlier, PSPNet has difficulties with identifying the background near the edges of smoke, as seen in too many smoke areas in the first, second, and third rows. On the other hand, SegNet performs worse in segmentation, with more non-smoke areas being divided in the segmentation results and the outline being too smooth in the first, fourth, and fifth rows. Moreover, there are isolated misjudgment segmentation areas in the second row. Compared with the previous two algorithms, Deeplab v3+ has improved its segmentation ability on smoke, reflected in the noticeable improvement in the second and fifth rows. However, Deeplab v3+ still has problems accurately recognizing some detailed texture information. Finally, HRNet achieves better segmentation results than other CNN models. However, due to its focus on local feature information, it may divide thin areas around smoke into more extensive ranges, such as the left area of the third row.
Due to their exceptional global feature extraction capabilities, SW-Trans and HRFormer have significantly improved segmentation effects for large-scale smoke compared with CNN models. However, in the second row, the foreground target becomes muddled with background information, leading to poor segmentation effects. Some of the bright window areas near the smoke become erroneously classified as smoke targets. Our segmentation algorithm incorporates a foreground feature localization module, which enhances the segmentation accuracy of smoke by highlighting foreground pixels. With this, our algorithm achieves superior segmentation results on the smoke in the second row and other smoke images, with results that come close to the ground truth.
To verify the effectiveness and generalization of our segmentation network and prove that it surpasses other existing algorithms considering segmentation performance, we conduct a comparison experiment using the public dataset [36] proposed by Bilkent University. We compared our segmentation network with three smoke segmentation algorithms [37][38][39], which perform well on daytime smoke datasets. The results of the comparison segmentation are presented in Figure 17. AOSVSSNet [37] introduces a new plug-and-play Convolutional Block Attention Module (CBAM) based on the U-Net++ [40] network. The new module focuses more on the spatial location information of smoke areas, which results in improved segmentation results for smoke areas with high concentrations, such as the source areas in the second and third rows. However, the CBAM and improved loss function introduced by AOSVSSNet tend to focus more on the global location information of smoke rather than local characteristics. Therefore, AOSVSSNet is more suitable for optical satellite smoke images than rarefied ones and has poor segmentation results for thin smoke areas.
CGRNet [38] designs attention convolution modules based on Gated Recurrent Units (GRUs) to identify spatial correlation and global context dependence of smoke. Additionally, the Multi-scale Context Contrasted Local (MCCL) calculates the difference of smoke features at different resolutions to enhance the model's ability to segment small-scale smoke. The results of smoke segmentation on the public dataset show that CGRNet outperforms AOSVSSNet regarding the segmentation effect on the second and third rows of small targets and the first and fourth rows of the thin smoke boundary. In the fifth row especially, CGRNet has no false division for the light area. However, in the classification of the location of smoke in the second and third rows, AOSVSSNet performs better than CGRNet. Overall, CGRNet has shown significant improvements in smoke segmentation.
According to Ref. [39], the W-Net architecture utilizes multiple asymmetric encoderdecoder structures to create a waveform structure. The semantic information of images is mainly contained in the trough position, while the peak position contains local and mesoscale information. The use of skip connections between the peak and trough positions and the decoding layers enhances the accuracy of smoke segmentation. However, since W-Net adopts an upsampling and downsampling path similar to U-Net, the details lost in pooling operations are difficult to recover. W-Net divides excessive smoke areas in the first and fourth rows, which are highly diffuse smoke images, and misjudges the smoke in the second, third, fourth, and fifth rows. Overall, the segmentation effect of W-Net is relatively poor.
As shown in the results of comparison experiments, our smoke segmentation network is highly effective in identifying the edges of smoke with high concentrations. Moreover, it can accurately distinguish the background information from the smoke characteristics in the thin smoke areas. In particular, our algorithm can delineate clear smoke boundaries for the first row. For the second and third rows, which involve small-scale and high-concentration smoke targets, our algorithm performs better than Refs. [37][38][39]. Although our network has a small-scale misjudgment in the fourth row, our algorithm correctly judges the light in the fifth row to the background. These results demonstrate the generalization ability and practicality in multiple scenes of our segmentation algorithm, as well as its superior performance compared to other smoke segmentation algorithms.

Ablation Experiments of Semantic Segmentation Algorithm
In order to assess the contributions of the modules to performance improvement, we perform ablation experiments on each module on our synthetic self-built dataset. The ablation modules consist of the Transformer blocks with W-MSA and SW-MSA, the LSAM and GCAM, Dense PPM with feature fusion between branches of different resolutions, and FFLM after the feature extraction network. Table 6 presents the results of the ablation experiment, providing valuable insights into the functioning of our network. In Table 6, the row of HRNet* represents the original HRNet with modified LSAM and GCAM, the same Dense PPM after each stage, and FFLM after segmentation. Net1 is a modified version of HRNet with reduced convolutional modules in each branch. Net2 replaces the convolutional units in Net1 with Transformer blocks. Net3 adds LSAM and GCAM to the Transformer blocks in Net2. Net4 is based on Net3 but with a different feature fusion method using Dense PPM. Finally, Net5 is our proposed segmentation network, which combines all the modifications made in the previous rows. in the table indicates that the module is selected.
The comparisons between the mIoU of 90.83% with the detection time of 54.46 ms of HRNet in Table 5 and Net1 in Table 6 indicate that reducing the number of feature extraction units of each branch can decrease the complexity and calculation scale while still improving the accuracy. The comparison results between Net2 and Net1 and between HRNet* and Net5 indicate that the convolution modules containing W-MSA and SW-MSA can reduce the computational scale and improve the segmentation performance of the network. These results also demonstrate that the better the global feature extraction ability of the model, the better the ability to segment smoke will be.
Based on the comparison of Net3 and Net2, adding LSAM and GCAM resulted in an increase of 13.6 G FLOPs and a 0.62% improvement in segmentation accuracy, which proves that GCAM and LSAM can reduce feature loss and suppress irrelevant feature information related to smoke. Furthermore, the integration of Dense PPM contributes to more efficient incorporation of smoke features at different resolutions, resulting in a 0.45% improvement in segmentation accuracy between Net3 and Net4. On the other hand, FFLM can attenuate the influence of background information on the extraction process of smoke features, particularly in the thin positions of smoke where texture and background information can be easily confused. The data in Table 6 also support the effectiveness of FFLM, as Net5 shows a 0.67% improvement in segmentation accuracy compared to Net4.
Our modifications to HRNet can improve the segmentation accuracy of smoke without causing a significant increase in network computation. Our network can still meet the actual detection requirements. The comparison results of the ablation experiment are shown in Figure 18. As shown in Figure 18, the smoke segmentation performance of Net5 is superior to that of HRNet*. Specifically, Net5 achieves more accurate segmentation results with fewer smoke edge miscalculations, as evidenced by the results of the third and fourth rows. Moreover, compared to Net2, Net3 exhibits better segmentation performance in regions with high smoke concentrations, such as the lower areas of the first and fifth rows. By incorporating FFLM, Net5 can more effectively differentiate between the thin edges of smoke and background information, as demonstrated in the first, third, and fourth rows. Our ablation experiment confirms the efficacy of our modifications for improving smoke segmentation.

Enhancement Segmentation Experiments
In order to accurately segment low-light smoke, our segmentation network needs to be highly precise when enhancing smoke images. We test our semantic segmentation algorithm on enhanced smoke images acquired in Section 4.2. The results are presented in Table 7, where the T/ms column represents the overall detection time achieved after we cascade the image enhancement network and the smoke segmentation network. The partial segmentation results are shown in Figure 19. Table 7. Comparison of experimental results of our cascaded enhancement segmentation network.
The above results show that our algorithm performs better than other networks in the low-light smoke image segmentation task. For the image in the first row, PSPNet and SegNet misjudge the tree area in the upper right corner and do not accurately divide the smoke boundary, which is confused with the background information. Similarly, for the fourth-row image, SegNet and PSPNet have connectivity of smoke segmentation and are unable to distinguish between the background area and smoke. HRNet and Deeplab v3+ divide too few or too many smoke areas for highly diffuse smoke in the fifth row, indicating a low ability to distinguish between texture features and background information around edges. Our algorithm achieves superior segmentation results for the smoke images in the second, third, and sixth rows compared to SW-Trans and HRFormer.
In summary, our method can accurately segment smoke areas in daytime images and provide exceptional segmentation performance for low-light images enhanced by our enhancement network. While our approach has slightly increased detection time due to the complexity of the model, it still meets low-light smoke segmentation and alarm requirements in scenes that do not require high real-time performance.
To evaluate the efficacy of our approach in real low-light environments, we perform enhancement and segmentation of low-light smoke images in real nighttime environments. The results are shown in Figure 20.
Since low-light images in real-world scenes are different from synthetic low-light ones, the characteristics of non-smoke objects are unclear. As a result, our enhancement network is more capable of recovering smoke features, while non-smoke areas still show unknown black features like the original images. The results indicate that our enhancement network can effectively restore the smoke features in low-light images, and our segmentation model can complete the segmentation task of the enhanced images. Therefore, our proposed method is highly effective and practical in restoring smoke features and segmenting smoke areas in low-light images.

Conclusions
We propose a low-light smoke image segmentation method utilizing a cascaded image enhancement algorithm with a semantic segmentation algorithm. The method shows excellent performance in accurately segmenting smoke images in low-light environments. To address the challenge of unclear smoke features in low-light environments, we propose a low-light smoke image enhancement network based on CycleGAN. Furthermore, we propose a multi-scale smoke semantic segmentation network based on HRNet and HRFormer to segment smoke areas in enhanced images accurately. Through a series of experiments, our method's ability to effectively complete the segmentation task of smoke in low-light environments is proven.
In addition, the real images collected at nighttime have unclear color information, making many details difficult to recover. Therefore, improving the suppression ability of irrelevant information in the semantic segmentation algorithm will be the direction of our research in the future.

Conflicts of Interest:
The authors declare no conflicts of interest.