Overcoming Adverse Conditions in Rescue Scenarios: A Deep Learning and Image Processing Approach

: This paper presents a Deep Learning (DL) and Image-Processing (IP) pipeline that addresses exposure recovery in challenging lighting conditions for enhancing First Responders’ (FRs) Situational Awareness (SA) during rescue operations. The method aims to improve the quality of images captured by FRs, particularly in overexposed and underexposed environments while providing a response time suitable for rescue scenarios. The paper describes the technical details of the pipeline, including exposure correction, segmentation, and fusion techniques. Our results demonstrate that the pipeline effectively recovers details in challenging lighting conditions, improves object detection, and is efﬁcient in high-stress, fast-paced rescue situations.


Introduction
Situational Awareness (SA) is critical for First Responders (FRs) during rescue operations. FRs must have a clear understanding of the location of individuals in need, any potential risks, and any other essential factors crucial to properly and safely perform their duties. The use of Artificial Intelligence (AI) and advances in object detection technology can greatly enhance FRs' SA by identifying and highlighting key elements, reducing their cognitive burden, and improving their perception. However, these improvements are effective only in well-lit environments, while, in real-world scenarios, FRs often face challenging conditions, such as smoke, dust, or limited visibility, which can impair both human and AI perception. To overcome these challenges, it is necessary to utilise technology and equip FRs with tools and resources capable to improve their SA in all situations and conditions.
One critical aspect of enhancing FRs' perceptual ability is to improve their visibility under adverse conditions, such as overexposed and underexposed scenarios. Traditional approaches to image restoration and exposure correction rely on image histogram adjustments. Histogram-based methods operate on the statistical distribution of pixel intensities in an image. These methods modify the histogram to make the image visually appealing and better expose the features in the scene [1][2][3][4]. Other techniques, related to Retinex theory [5], have also found wide application, both within and outside of learning-based methodologies [6][7][8][9][10][11]. According to Retinex theory, the perceived colour of an object depends on both the spectral reflectance of the object and the spectral distribution of the incident light. Retinex-based exposure correction algorithms recover the original reflectance of the scene by removing the effects of the illumination. Further research has shown the effectiveness of High Dynamic Range (HDR) based techniques for exposure correction. The most common technique for creating HDR images is to exploit the information of a stack of bracketed exposure Low Dynamic Range (LDR) images [12][13][14][15]. Other techniques reconstruct an HDR image from a single LDR image [16][17][18][19][20].
Despite significant advances in exposure correction techniques, it is essential to note that existing methods are not optimised for rescue scenarios and may not meet the specific needs of FRs. To address this gap, we present a novel pipeline that integrates Deep Learning (DL) and Image-Processing (IP) techniques specifically tailored to improve the operational abilities of rescuers. Our method is designed to meet three main requirements: detailed feature recovery, enhanced object detection, and efficiency. Detailed feature recovery refers to the model's ability to enhance a scene by revealing hidden features, whereas improved object detection capabilities refer to the model's capacity to enhance other DL techniques by acting as an intelligence amplification (IA) layer. Furthermore, the pipeline has been designed to handle the high-stress, fast-paced nature of rescue situations, where accurate and quick information is crucial.
As a proof of concept, the pipeline has been tested on a data set created in the framework of the Horizon 2020 project, "first RESponder-Centered support toolkit for operating in adverse and infrastrUcture-less EnviRonments" (RESCUER), during the Earthquake Pilot performed in Weeze, and organised by I.S.A.R Germany (International Search and Rescue organisation).
The paper is structured as follows. In Section 2, we describe the technical details of the pipeline, including exposure correction, segmentation, and fusion techniques, indicating the DL and IP features involved in the process. We also introduce the data set used, and how we collected the training data. In Section 3, we present different metrics of model performance validation over the data set constructed in the Weeze Earthquake Pilot. In Section 4, we discuss the results obtained in Section 3, highlighting the strengths and potential applications of the proposed approach. Finally, in Section 5, we summarise the key findings of our study and their implications, as well as highlight limitations and potential areas for future research.

Materials and Methods
In rescue operations, the safety of both rescuers and victims relies on having accurate and up-to-date information about the environment. The unpredictable and rapidly changing conditions of a disaster scene make it essential for rescuers to understand the physical and geographical features of the area, as well as any human-made structures or infrastructures present. However, lighting deficiencies and scene degradation can present significant challenges for rescuers in identifying key features, such as pathways, buildings, and other landmarks. To address these challenges, we propose a pipeline that integrates both DL and IP techniques specifically designed to handle extreme lighting conditions. In the following section, the pipeline graphics are depicted using the TM-DIED: The Most Difficult Image Enhancement Dataset [21], which showcases images in various lighting scenarios featuring diverse intensity shifts between regions that are underexposed, overexposed, and correctly exposed.

Pipeline Design
The pipeline, as it is represented in Figure 1, comprises three main modules, each playing a specific role in improving the quality of the final image. We designed the first two, Exposure Correction (EC) and Exposure Segmentation (ES), to work in parallel. Simultaneously advancing the flow enabled us to boost all the steps required to reach the final module. EC aims to adjust the exposure of acquired images to an optimal level, using a DL method that automatically adapts to different lighting conditions. The isolation of Regions Of Interest (ROIs) based on their exposure levels is achieved by ES through the application of IP techniques such as global thresholding and edge detection. ES allows the system to process each ROI independently, ensuring that the final image contains only the information of interest. Finally, the last module, Image Fusion (IF), combines both EC and ES results to produce the final outcome. IF takes as input the information of all ROIs and corrections collected from the previous stages and fuses them to create a single image with an optimal exposure level.

Exposure Correction (EC)
The EC module adjusts the brightness and contrast of an image to achieve an optimal level of visibility and detail. In recent years, researchers have proposed several DL methods for exposure adjustments and demonstrated their effectiveness in several applications, including surveillance, robotics, and photography [22][23][24][25].
In the following section, we describe the DL framework for exposure adjustment in rescue scenarios. We subdivide it into two parallel branches, namely the Under-Exposure Branch (UE b ) and the Over-Exposure Branch (OE b ), which perform the under-and overexposure corrections, respectively. Both share the same unsupervised DL model designed for low-light image enhancement. In the UE b , we apply the model directly to the original images to correct and recover their dark areas. Instead, for the OE b , we introduce a preprocessing step on the original images to treat the overexposure correction as if it were an underexposure recovery problem. After the application of the model, at the end of OE b , a post-processing step is necessary to restore the original appearance of the images. Further details on the processing procedures are given at the end of this section. The highlevel architecture of the EC module is outlined in Figure 2.
Further details about the main components of EC are provided below: Low-Light DL model: As DL model, we used the SCI (Self-Calibrated Illumination) [8]. The model is designed to be fast and flexible, thus making it ideal for real-time situations with unpredictable lighting conditions, such as those encountered in rescue scenarios. To train the model, we selected low-light images from several publicly available datasets, including LOL DATASET [6], Ex-Dark DATASET [26], MIT-Adobe FiveK [27]. This heterogeneous nature of the resulting data set was crucial to ensure that the training data set was diverse and representative of a wide range of low-light scenarios.
The resulting training dataset comprises 2500 images captured under different indoor and outdoor lighting conditions, thus ranging from high-quality images with good resolution and minimal noise to very noisy ones with low visibility. The dataset features various scenes and subjects, including landscapes, indoor scenes, and objects, thus providing a varied set of low-light scenarios. Diversity was further increased through data augmentation techniques such as rotation, flipping, and cropping, which create additional images and improve the model's ability to generalise to different scenarios. OE b processing: To perform the overexposure correction, we first pre-process the input exploiting the CIELAB color space since it separates colour information (encoded in the a * and b * channels) from lightness information (encoded in the L * channel). Independent manipulation of the L * channel has been valuable in reversing brightness without affecting the colour appearance of the original images. In addition, CIELAB is designed to be perceptually uniform, which means that equal changes in lightness should appear to be equally perceptible. Therefore, inverting the L * channel results in negated images, with the relative differences between the lightness values of different colours remaining consistent. This pre-processed image is then fed to the OE b (see Figure 3). Once the model is applied, we post-process the image by reversing the L * channel to recover its original lighting distribution.

Exposure Segmentation (ES)
EC outputs complementary results, which need to be merged based on the exposure conditions of the original image. The ES module identifies the overexposed or underexposed regions of the original image, thus allowing the construction of a final image with a more balanced exposure and a wider range of details. Specifically, the module outputs three binary images (M OE , M UE , and M CE ), which refer to the Over, Under, and Correctly-Exposed Masks, respectively (see Figure 4).
We created the segmentation masks using lightness, saturation, and contrast information, as detailed in Equations (1)-(5). The first two components are retrieved from the HSV (Hue, Saturation, Value) color space, which separates the chromatic (H and S) from the lightness information (V). The Contrast (C) information is obtained by taking the absolute value of the Laplacian of the grayscale version of the images and then applying a threshold to the resulting Laplacian image. This thresholding step helps suppress low-amplitude noise and retain only the most prominent features, resulting in the thresholded version of C, denoted as T C .
The three visual attributes that contribute to mask generation are determined by the properties of the regions being segmented. Overexposed regions occur when too much light enters the camera, resulting in washed-out or overly bright images that appear white and blown out. Conversely, underexposure happens when too little light enters the camera, resulting in dark images with less vibrant colours. (c) The image before preprocessing.
(d) The image after preprocessing.  Masks were obtained using the following thresholding relationships: where M OE , M UE , and M CE are the Over-Exposed, Under-Exposed, and Correctly-Exposed Masks. V and S are the Value and the Saturation channels of the HSV version of the image scaled between [0, 1]. To include edges and textures information, for each mask M * , we applied: where T C is the following binary mask: The selection of threshold values for the over-and underexposed masks was guided by both empirical observation and domain knowledge, the latter being derived from a literature review of typical brightness and saturation ranges for such regions. This knowledge allowed us to narrow the range of values to test empirically and helped us select threshold values that were more likely to capture the relevant characteristics of the image [28][29][30][31][32].
We experimentally tested various threshold values to determine the optimal ones and visually evaluated the resulting masks to assess their accuracy in identifying over-and underexposed regions. Based on this evaluation, we found that a threshold value of 0.9 for brightness and 0.1 for saturation was effective in detecting overexposed regions, while a threshold value of 0.1 for brightness was effective in identifying underexposed regions. Examples of masks generated using different thresholds are presented in Appendix A. Figure 5 highlights some results of the ES module.
The formula ensures that the final image is a combination of the correctly exposed regions of the original image and the regions corrected by the UE b and OE b branches, as defined by their respective masks. This property is guaranteed by the fact that the three masks (M CE , M OE , and M UE ) are mutually exclusive and together form a matrix of ones, except for the edges.
Although Equation (6) ensures that the resulting image covers all the desired regions, it may produce abrupt transitions and visible artefacts. Therefore, to produce a final image with smooth transitions between regions and a natural-looking appearance, we utilise the Laplacian Pyramid Blending [34], which involves the following steps: Pyramid Creation: Construction of the Laplacian Pyramids of X, X UE and X OE , as well as the Gaussian Pyramids of M CE , M UE and M OE .
Layer Fusion: Apply Equation (6) layer-wise to the pyramids generated in the previous step to obtain the Pyramid of Blended images (Pyr b ).
Reconstruction: Reconstruct the final image from Pyr b by performing the following steps: at each level, expand the current layer with lower resolution to match the size of the next level in the pyramid. Add the expanded layer to the corresponding layer in the pyramid to form a higher-resolution blended image. Repeat this process until the final level is reached, yielding the final blended image with the original resolution.

Results
In this section, we report on the results of the pipeline and show that it meets the requirements for rescue scenarios. To do so, we evaluate its performance in terms of details recovery, object detection improvement, and efficiency.
As stated in Section 1, we measure the effectiveness of our method on the data set collected at the Weeze earthquake pilot. Our dataset comprises 282 images that depict earthquake scenarios, taken both indoors and outdoors. The images are available in both RAW and JPG formats and have a resolution of 3840 × 5750 pixels. They were captured in a variety of exposure settings, mostly featuring buildings and humans in the scenes.

Details Recovery
A crucial aspect of the pipeline is enhancing images through heightened detail, but the subjective nature of image quality poses a challenge in demonstrating improvements. To address this issue, we analysed several aspects of our results.
Reference-based Image Quality Analysis: In this section, we evaluate the quality of the images produced by the pipeline by comparing them with reference images (R). R is obtained by using Automatic Exposure Bracketing (AEB) [35], a technique that captures the same scene several times at different exposure levels.
To quantify the similarity between the reference images and the pipeline's outputs, we utilise two widely used image quality metrics: Mean Squared Error (MSE) and Structural Similarity Index (SSIM) [36]. Both metrics are computed for both the original and the output images with respect to the reference images. MSE computes the average of the squared differences between the pixel intensities of two images, while SSIM compares the structural information and texture of the images to provide a measure of similarity. By deriving both metrics for the original images and the pipeline's outputs, we determine whether the pipeline outputs are more similar to the reference images than the original ones. The average MSE and SSIM for both sets of images are presented in Table 1. In addition to reporting the average MSE and SSIM for the original and output images, we analysed the distribution of these metrics across all images used in our evaluation. In Appendix B, we use box plots to visualise the distribution of the MSE and SSIM values for both sets of images ( Figure A4); we also display a sample of the images along with their corresponding MSE and SSIM values in Tables A1 and A2. Image Characteristics Evaluation: The image characteristics of both original images and pipeline results were assessed using a set of metrics, including Texture (T), Entropy (E), Object Count (OC), Segmentation Masks (SM), and Hue Similarity Index (HSI).
T (Texture): Measures the visual pattern or structure of the images. We compute it by getting the variance of random images' windows. E (Shannon Entropy): Measures the degree of disorder in the images. A higher value indicates that the image has more information content, whereas a lower value indicates that the image has less degree of uncertainty. The comparison of Shannon entropy [37] of X and Y quantifies the changes in information content resulting from the correction process.
OC (Object Count): OC represents the number of objects in an image. We obtain the metric by counting the connected components of the images. In particular, a connected component is defined as a set of pixels in the image that are connected through a path of neighbouring pixels of similar intensities. Table 2 displays the values of the three metrics mentioned above, calculated for both the original and corrected images. To reduce any noise that may have been introduced during the exposure correction process, we applied an average smoothing filter to the images. Taking this step ensures that the metrics reflect the amount of detail recovered from the corrected images. SM (Segmentation Masks): Segmentation masks are used to identify different regions or objects within an image. In particular, we used the mask M CE to identify well-exposed areas of the images before and after correction, as shown in Figure 6. HSI (Hue Similarity Index): Hue Similarity Index is a metric used to measure the similarity between the hue values of the original and corrected images. We calculated HSI as the Pearson correlation coefficient [38] on the Hue channel of the HSV version of the original and corrected images. Figure 7 shows the box plot of the HSI values for the original and corrected hue channels. Our results demonstrate a linear relationship between the two pairs of images, indicating that our pipeline effectively preserves colour consistency. Visual Inspection: To demonstrate the effectiveness of our procedure, we compare the results (Y) obtained by applying the pipeline to the initial images (X) with the reference images (R), as shown in Figure 8.

Object Detection Improvement
In rescue scenarios, accurately locating and identifying individuals is critical. However, when a scene is not properly exposed, certain features of its objects can be obscured or washed out, making it difficult for the detector to identify them. To test and evaluate the pipeline's effectiveness in identifying people, we used the YOLOv7 [39] object detector, which is a state-of-the-art deep learning model for object detection. Specifically, we used the pre-trained weights of YOLOv7 on the Microsoft COCO: Common Objects in Context data set [40], which contains a large number of annotated images of objects belonging to 80 different categories, including people. We applied the detector to both the original and corrected images, to assess its performance in identifying people under different lighting conditions.
The detector performances were evaluated using: Precision (P), Recall (R), F1-score (F1), and Average Precision (AP). Precision computes the proportion of true positives (correctly detected people) among the total number of people detected. Recall is a measure of the proportion of true positives among the total number of actual people. F1-score is the harmonic mean of precision and recall. Average Precision (AP) evaluates the accuracy of the detector based on its precision and recall and it is calculated as the area under the Precision-Recall (PR) curve, which plots the precision values against the corresponding recall values at different detection thresholds.
The performances are summarised through the Precision-Recall (PR) and F1-Confidence curves shown in Figure 9a,b. Additionally, the Average Precision (AP) is reported in Table 3 to provide a comprehensive evaluation of our model's performance. In Figure 10, we also provide a visual comparison of the predicted bounding boxes.

Efficiency
Having fast corrections is of the utmost importance in real-time operational scenarios. Therefore, one of the requirements of the pipeline is that it has a response time suitable for rescue scenarios. To evaluate the pipeline efficiency, we first calculate the time needed to arrive at the IF module (1st stage) and then the time required for the fusion (2nd stage). Calculating the time of the 1st stage is a matter of computing the longest time between the parallel branches of the pipeline (see Figure 11). In Table 4, we report the running time (s) for images of different shapes (Width (W), Height (H), Channels (C)) by averaging 10,000 experiments conducted on an NVIDIA GeForce GTX 1650 using the CUDA toolkit version 11.7.
(b) F1-Confidence curve. Figure 9. (a,b) show that a greater Area Under the Curve (AUC) corresponds to the corrected images.   Figure 11. Stages overview. The figure represents the stages of the total computing time.

Discussion
The results of our pipeline indicate that it effectively meets the requirements for rescue scenarios. Our methodology was able to improve image detail, enhance object detection, and provide fast processing.
As described in the details recovery Section 3.1, we used several metrics to evaluate the image characteristics of the pipeline's outputs compared to the original images. The results showed that the pipeline was able to improve the complexity of the image's texture, increase the information content, and augment the number of objects detected. Reference-based image quality analysis also demonstrated that the pipeline was able to produce images with higher quality compared to the original images, as evidenced by the reduced MSE and increased SSIM. Visual inspection further supported these results, providing a clear side-by-side comparison of the original, reference, and corrected images.
One of the main objectives of this model was to improve the light environment to boost the efficiency of other vision DL models, and this goal has been achieved as detailed in the object detection improvement Section 3.2. The section highlighted how the pipeline was able to improve the performance of the YOLOv7 object detector in the detection of people. The results indicated that the corrected images allowed the model to identify individuals with higher Precision, Recall, F1-score, and Average Precision compared to the original images. The Precision-Recall and F1-Confidence curves, as well as the visual comparison of the predicted bounding boxes, provided clear evidence of the improvement in people detection.
The Efficiency Section 3.3, indicates that the pipeline was able to process and provide information quickly enough to meet real-time requirements, which is essential in rescue scenarios. The results showed that the pipeline was able to reach the IF module (1st stage) and complete the fusion stage in a reasonable time even for images with different shapes.

Conclusions
In this study, we propose a pipeline of Deep Learning (DL) and Image-Processing (IP) techniques to improve the Situational Awareness (SA) of First Responders (FRs) in adverse lighting conditions. The pipeline consists of three modules: Exposure Correction (EC), Exposure Segmentation (ES), and Image Fusion (IF). EC aims to adjust the exposure of the acquired images to an optimal level, ES separates the images into regions of interest based on their exposure levels, and IF combines both the EC and ES results to produce the final image. The proposed method offers a solution to the challenges faced by FRs to enhance their operational capacity and allow them to make more informed decisions in high-pressure and fast-paced rescue scenarios.
While our method has shown improvements in image detail recovery, object detection performance, and fast processing, there is still room for improvement to address its limitations and expand its applicability. In particular, while our ES module strikes a balance between image enhancement and fast inference, we recognize that there is potential to explore adaptive thresholds or learning-based segmentation to improve its effectiveness in complex situations. Additionally, generating exposure segmentation masks with high precision is essential for ensuring the accuracy and reliability of our model.
To further enhance the performance and applicability of our method, we plan to modify the pipeline by removing its dependencies on the ES module. Specifically, we aim to use the EC module to generate multiple images with different exposure levels and merge them to create an image with a wider dynamic range. We intend to explore the integration of our pipeline into augmented reality visualization tools like the Hololens, which could offer new possibilities for real-world applications.
Through the Modane pilot, we plan to collaborate with professional first responders, including firefighters and medical rescuers, to gather feedback and refine it based on input from actual end-users.
In conclusion, the proposed pipeline provides a solution to the challenges faced by FRs under adverse lighting conditions, and the results indicate its potential for further development and implementation in real-world scenarios.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A
Here we provide an overview of the experimental setup and methodology used to generate the masks discussed in Section 2.1.2. The aim of these experiments was to determine optimal threshold values to detect over-and underexposed regions in images, based on a combination of empirical observation and domain knowledge. To begin, we conducted a thorough literature review of typical brightness and saturation ranges for overand underexposed regions in images, which allowed us to narrow down the range of values to test empirically. We then systematically tested various threshold values for brightness and saturation, visually evaluating the resulting masks to determine their accuracy in identifying over-and underexposed regions. Figures A1-A3 represent the experiments.

Appendix B
Here, we present supplementary information regarding the outcomes of our methodology. Specifically, we include box plots that illustrate the mean squared error (MSE) and structural similarity index measure (SSIM) of both the original and corrected images in comparison to the reference images, as depicted in Figure A4. Additionally, we offer two tables containing the MSE and SSIM values of a randomly selected subset of 20 images each. Table A1 represents a sample of MSE for the original and corrected images compared to the reference images. The MSE values demonstrate the accuracy of the images before and after correction, with lower values indicating better results.
Similarly, Table A2 shows a sample of the SSIM values for both the original and corrected images with respect to the reference images. SSIM values measure the structural similarity between images and are used to evaluate the quality of corrected images. Higher SSIM values indicate better similarity between images.