A multimodal deep learning approach for gravel road condition evaluation through image and audio integration

This study investigates the combination of audio and image data to classify road conditions, particularly focusing on loose gravel scenarios. The dataset underwent binary categorisation, comprising audio segments capturing gravel sounds and corresponding images. Early feature fusion, utilising a pre-trained Very Deep Convolutional Networks 19 (VGG19) and Principal component analysis (PCA), improved the accuracy of the Random Forest classifier, surpassing other models in accuracy, precision, recall, and F1-score. Late fusion, involving decision-level processing with logical disjunction and conjunction gates (AND and OR) in combination with individual classifiers for images and audio based on Densely Connected Convolutional Networks 121 (DenseNet121), demonstrated notable performance, especially with the OR gate, achieving 97 % accuracy. The late fusion method enhances adaptability by compensating for limitations in one modality with information from the other. Adapting maintenance based on identified road conditions minimises unnecessary environmental impact. This method can help to identify loose gravel on gravel roads, substantially improving road safety and implementing a precise maintenance strategy through a data-driven approach.


Introduction
Loose gravel on gravel roads significantly challenges road safety and maintenance efforts.Loose gravel can lead to reduced traction, vehicle skidding, and increased dust emissions, potentially causing hazardous conditions for drivers and pedestrians alike.Accurate and timely detection of loose gravel is paramount for traffic agencies to initiate maintenance measures promptly and ensure the safety of road users.
Traditional methods of loose gravel detection on gravel roads have relied on manual inspections, often limited in scope and subject to human error.In recent years, machine learning and multimodal sensor fusion advancements have provided opportunities to revolutionise gravel road condition assessment, offering a more data-driven and precise approach to detecting loose gravel.
In [1,2], the possibility of objectively classifying the loose gravel conditions using audio and images independently was investigated, and the results were promising.This paper introduces an approach to detecting loose gravel on gravel roads, utilising the fusion of spectrograms from audio recordings and images captured from the road surface.This multimodal fusion aims to significantly enhance the accuracy and reliability of loose gravel detection, aligning closely with the standards set forth by road traffic agencies worldwide.
The proposed methodology harnesses the synergistic nature of audio and image data, recognising that each modality brings unique insights into loose gravel detection.Audio spectrograms capture acoustic signatures, such as gravel impacts, surface disturbances, and vehicleinduced vibrations, offering valuable acoustic signals indicative of loose gravel presence.Meanwhile, images provide high-resolution visual information about the road surface, enabling the detection of loose gravel patches, displacement, and surface irregularities.
This paper examines two fusion methods specifically designed for detecting loose gravel: feature-level fusion and decision-level fusion.Feature-level fusion involves combining features from two different sources-in this case, images and audio from gravel roads.On the other hand, decision-level fusion occurs at a later stage, combining decisions from models trained separately on images and audio.Fusion techniques offer a notable advantage in classification by enhancing accuracy and robustness.This advantage stems from their ability to effectively utilise complementary information from different sources or modalities, addressing the limitations of individual methods.Integrating data modalities or decision outputs improves accuracy, reliability, and adaptability, particularly in handling complex classification tasks [3,4] The suggested framework for loose gravel detection aims to provide an objective method aligning with the Swedish Road Transportation Agency (Trafikverket) standards.Automating the loose gravel assessment would improve safety conditions on gravel roads and decrease maintenance response times.

Literature review
There are many studies utilising the multimodal approaches for classification tasks.Multimodal methodologies benefit classification tasks [3][4][5].By seamlessly integrating information from diverse modalities, these approaches exhibit enhanced performance compared to their unimodal counterparts.The fusion of different modalities provides a robust and redundant framework, ensuring the system's resilience, even in noisy or incomplete data.Multimodal models excel in handling ambiguity and demonstrate improved generalisation, making them adaptable across various scenarios and datasets.Their versatility extends across domains, such as computer vision, natural language processing, and healthcare ( [6][7][8]).
Moreover, these methodologies mirror the human-like perception that integrates multiple sensory inputs, aligning with the holistic nature of human cognition.Multimodal models prove valuable in scenarios where data may be incomplete or missing in one modality, and they facilitate transfer learning, enabling the transfer of knowledge between modalities or tasks.In essence, the advantages of multimodal methodologies lie in their ability to harness the strengths of different modalities synergistically, resulting in more robust, versatile, and practical solutions for classification tasks.
Considering the maintenance of gravel roads in Sweden by the Trafikverket, the current assessment methods involve subjective evaluations based on guidelines, incorporating factors such as crossfall, irregularities, loose gravel, and dust [9].These are rated subjectively and, in some cases, involve manual measurements using specialised equipment.However, due to the high costs associated with alternative objective methods, such as laser scanners, they are typically not employed, prioritising the minimisation of gravel road maintenance expenses [10].Fig. 1 illustrates loose gravel conditions and their grades, depicting Road Type 1 as well-maintained and Road Type 4 as severely deteriorated, following the [9] grading system.This visual reference provides a clear insight into the spectrum of gravel road conditions under assessment.
A dust classification algorithm was developed by [11] for the Gravel Roads Management System, using smartphone images to classify dust amounts on gravel roads accurately.The algorithm was validated against dustometer measurements.The results showed that the algorithm is a cost-effective and accurate alternative, offering potential assistance to local agencies in maintenance planning regarding dust evaluation on gravel roads.The study explores challenges with gravel pavement, noting its lower construction costs but inferior performance to asphalt.Dust emission, deformation, and deepening ripples impact vehicle vibrations, fuel consumption, and driving comfort.The research proposes a methodology for gravel pavement evaluation, measuring profiles and analysing the international roughness index (IRI).Findings stress the importance of timely maintenance.The study's objectives include adapting road roughness indicators for gravel pavement and evaluating dynamic responses, with specific speed ranges (30-45 km/h and 90 km/h) indicating the need for careful prediction of safe driving speeds.
In a recent study by [12], a semi-automated approach utilising UAV-captured images from a one-kilometre road segment was introduced to identify and extract parameters of unpaved road surfaces, such as potholes and rutting.This method addresses the crucial necessity for efficient road condition surveys.The research was conducted in the Ofirikrom Municipality, Ghana, showcasing the correlation between UAV imagery and conventional field methods, suggesting the potential for cost-effective road maintenance monitoring.Although the study was confined to a limited road length, it suggests future endeavours for a fully automated methodology to enhance road condition assessment further.
The literature review highlights a predominant emphasis on overall road roughness in existing research, revealing a relatively lesser focus on identifying distinct distress types on gravel roads.Notably, there is a research gap in the automation of loose aggregate assessment ( [13][14][15][16]).There is potential in exploring avenues that involve integrating data from various sources, including sound and images, offering promise for the automated assessment of loose gravel on these roads.

Methodology
This section will discuss methodology, including an overview of the study's methodology, starting with an overview of multimodal fusion, the data collection approach, followed by a discussion of the preprocessing steps.Subsequently, the two distinct techniques utilised in this study, feature-level early fusion, and decision-level fusion, are introduced.

Overview of multimodal fusion techniques
This study incorporates multimodal fusion techniques.The subsequent section offers a brief technical introduction to general fusion methods.
Multimodal fusion techniques are methodologies used to combine information or data from multiple sources or modalities to enhance the understanding or performance of a system or application.These techniques are valuable for improving models used in affect recognition tasks, which are analysed based on data from various sources such as audio, visual, physiology, and more.Multimodal fusion holds significant merit in the realm of classification tasks.Fusing information from multiple modalities can enrich the feature space, enhance the discriminative power of models, and provide a more comprehensive understanding of complex phenomena.The literature discusses three joint fusion strategies: feature-level, decision-level (or score-level), and model-level fusion [17].These are discussed below: • Feature-level fusion: This strategy combines features extracted from different modalities by creating a single feature vector encompassing Fig. 1. showcases images of loose gravel conditions with their respective grades.Road Type 1 illustrates a well-maintained road, while Road Type 4 depicts a severely deteriorated gravel road [9].
N. Saeed et al. information from all modalities.This approach mimics how humans process information, where features from various sources, such as audio and visual cues, are integrated before making predictions.Feature-level fusion often requires large training datasets because it captures more information than a single modality alone.Additionally, the modalities should have corresponding data for this strategy to work effectively.One major advantage is that predictions can still be made even if data from one modality are missing [18].• Decision-level (Score-level) fusion: In this strategy, each modality is used independently to make predictions, and then the scores or results from each modality are combined.A drawback of this approach is that if data from one modality are missing, the full potential of that modality cannot be realised.Fusion can be as simple as a majority vote for classification tasks, but more sophisticated versions may be introduced, e.g., incorporating learning weights.For regression tasks, a linear regressor can be trained using the predictions from each modality, and its weights can be used for the fusion.• Model-level (Hybrid-level) fusion: This strategy combines the strengths of both feature-level and decision-level fusion strategies.
For instance, a model-level fusion might involve performing featurelevel fusion for certain modalities and then combining those predictions with scores from other modalities that were processed independently.This approach offers flexibility and can adapt to the specific requirements of the task [19].An example of model-level fusion is the method proposed by [20], which combines the results of feature-level fusion with scores from independently processed modalities.This hybrid approach aims to harness the benefits of integrating features from some modalities, while still considering the unique information provided by others.This fusion technique can improve performance in affect recognition tasks, especially when dealing with complex and diverse data from multiple sources [21].

Data collection
The data collection involved using two HERO7 GoPro cameras manufactured by GoPro Inc. based in San Mateo, CA, USA.One camera was positioned inside a vehicle to capture audio and video data, while a second camera was mounted on the car's bonnet to obtain recordings with an improved view of the roads.The recordings were made during the summer seasons of 2020 and 2022 along gravel roads in Dalarna, Sweden.The car maintained a constant speed of 50 km/h during these recordings under dry and sunny weather conditions.It is important to note that certain portions of the recorded videos were excluded from the dataset.These excluded segments contained activities such as travelling to the selected road, turning the car around, driving at varying speeds, and conversations between the data collectors.These marked segments did not represent the gravel road conditions the study aimed to analyse.
The dataset consisted of a total of 15 videos, with a combined duration of 1 h, 13 min, and 54 s (01:13:54).The purpose of this data collection was to investigate the gravel road conditions, utilising the audio and video recordings obtained from the GoPro camera.For more detailed information about the camera and vehicle specifications, refer to the publication by Saeed et al. [2]

Preprocessing
Audio and image data were extracted from recorded videos, resulting in separate datasets for both modalities.Preprocessing procedures were subsequently applied to each dataset.Roboflow's Annotation Tool was instrumental in highlighting the gravel roads by creating bounding boxes and isolating the road sections.This approach was applied so that images with only the crucial aspects of the gravel roads are obtained, while discarding unrelated elements, such as the sky and vegetation.As a result, a new dataset containing solely the road information was generated with the assistance of Roboflow.
Roboflow is a specialised platform tailored for developers and researchers dealing with visual data.It offers a comprehensive set of tools and services for tasks such as image annotation, dataset management, data preparation, and even model deployment in computer vision and image processing [22].In Fig. 2(a), Roboflow illustrates its capability to detect roads and segment the gravel road, excluding vegetation, as shown in Fig. 2(b).This process was undertaken to ensure that during subsequent classification, the algorithms focus on learning features extracted specifically from road conditions.A conversion process was employed for the audio data to transform the audio files into spectrograms, shown in Fig. 3.The audio data went through a conversion process, during which the audio signals were broken down into smaller segments, predominantly employing the Short-Time Fourier Transform (STFT) technique [23].These temporal segments were subsequently translated into image representations, featuring time on one axis and frequency on the other.This transformation resulted in the creation of spectrogram images, effectively rendering the audio data in a visual format conducive to integration with the existing image-based processing pipeline.

Dataset
Following the preprocessing of both the image and audio datasets, each dataset was categorised through labelling into two classes: 1 & 2 and 3 & 4, aligning with Trafikverket's classification, where 1 represents good road conditions, and 4 indicates the worst road conditions.The former had a combined count of 487 instances, while Classes 3&4 had a sum of 398 for both images and audio.The size of each data set in total was 885.The class labelling adheres to the guidelines outlined in the Trafikverket Road Maintenance Gravel Road assessment manual [9].Considering the limited size of the available dataset, we have combined Classes 1 and 2, as well as 3 and 4.These can be considered as roads in good condition and roads in poor condition, respectively.In the case of audio labelling, each audio clip received its label based on its extraction from a corresponding video segment.For example, if the video segment indicated Road Types 1&2, the audio extracted from that section was labelled Classes 1&2.Table 1 presents the details of the dataset.
Within this study, we have implemented both feature-level fusion and decision-level fusion.The following discussion elaborates on the particulars of the processes utilised in this study.

Feature-level fusion
In this study, features were extracted from road images and audio spectrograms using the VGG19, a pre-trained convolutional neural network architecture.VGG19 is recognised for its effectiveness in image classification and achieves feature extraction by guiding input data through its hierarchical layers.It progressively captures intricate patterns and details [24].The extracted features from road images and spectrograms are later combined through concatenation, creating a unified representation.This comprehensive and integrated representation is valuable for enhancing subsequent stages of analysis.
After feature extraction and concatenation, feature reduction was applied using Principal Component Analysis (PCA), and the optimal number of components was determined using the elbow method.PCA transforms original features into orthogonal principal components, capturing maximum data variance.Projecting data onto a lowerdimensional subspace, PCA effectively reduces dimensionality while retaining crucial variance [25].The elbow method identifies the "elbow point", where additional components cease to significantly increase explained variance [26].
After feature extraction, concatenation, and reduction, machine learning algorithms were trained on this feature set, specifically Random Forest, Multi-layer Perceptron (MLP), and XGBoost classifiers.Finally, the classification results were obtained.Fig. 4 illustrates how featurelevel fusion works in this study, using gravel road images and audio spectrograms as inputs to produce a classification decision as the output.

N. Saeed et al.
It visually guides through the entire process.
These classifiers are widely recognised for their efficacy across diverse domains [19,[27][28][29].The Random Forest classifier operates as an ensemble learning approach, uniting numerous decision trees to generate precise and resilient predictions.This involves training individual trees on distinct data subsets and amalgamating their outcomes for the final predictions.Conversely, the Multi-layer Perceptron (MLP) is an artificial neural network tailored for intricate pattern recognition tasks.Comprising multiple layers of interconnected nodes, it undertakes data processing and transformation, each contributing to the network's adeptness in capturing complex data relationships.The Gradient Boosting XGBoost algorithm incrementally constructs a sequence of weak learners, often decision trees.Each new learner addresses the errors of its predecessors, fostering potent predictive capabilities [30].This iterative strategy empowers XGBoost to manage intricate datasets proficiently.Each of these classifiers boasts unique merits, and their selection hinges on the specific attributes of the given problem (X.[31][32][33]).Fig. 4. Methodology used in this study for Feature-level Fusion.
N. Saeed et al.

Decision-level fusion
The second fusion method employed in this study is decision-level fusion, and it is discussed below.It incorporates two variations using OR and AND rules.Fig. 5 illustrates the use of both decision-level and feature-level fusion used in this study.Fig. 5 gives a broad view of the decision fusion methods employed: both feature-level fusion and decision-level fusion.It focuses on explaining the key components of these fusion approaches Existing studies consistently demonstrate improved classification performance with decision fusion (K.[34][35][36]) emphasising the need for diverse techniques due to varied classifier outcomes [37].

Decision fusion with an OR rule
The technique commonly known as majority voting, logical disjunction, or voting with a logical OR, is widely used in various studies.This includes applications such as person recognition using imperfect face images alongside supporting gait images, as well as in spam detection through videos and images.The use of decision fusion in this approach consistently leads to better performance in classification results [4,38].In majority voting, the final prediction is based on the majority decision of the individual models.If most models predict a positive outcome (Class 1), the fused prediction will be positive.Otherwise, if the majority predicts a negative outcome (Class 0), the fused prediction will be negative.Consider two binary classifiers, C1 and C2, where each classifier makes a binary decision (0 or 1).The final decision D final in the OR gate scenario is 1, if at least one of the decisions of classifiers D C1 or D C2 predicts 1.
Here, ∨ represents the logical OR operation.

Decision fusion with an AND rule
This method is often called unanimous voting, or voting with a logical AND or Logical Conjunction.In unanimous voting, the final prediction is positive only if all individual models predict a positive outcome [39].If any one model predicts a negative outcome, the fused prediction will be negative.This approach ensures that all models agree before making a positive prediction.
Here, ⋀ represents the logical AND operation [4].Both majority voting (OR rule) and unanimous voting (AND rule) are variations of voting-based ensemble methods commonly used to combine predictions from multiple models.The specific logical operations (OR and AND) determine how the predictions are aggregated to arrive at the final decision.These methods harness the collective knowledge of diverse models and enhance overall predictive performance [40].

Results and discussion
In this study, data extraction from video recordings encompassed both audio and image components.The audio segment specifically captured the auditory cues of gravel impacting the undersides of vehicles, serving as a significant source of information regarding road conditions.The dataset, categorised into binary Classes 1&2 and 3&4, aims to discern road conditions, especially in loose gravel scenarios, aligning with Trafikverket's standards, where Class 1 signifies well-maintained roads and Class 4 indicates poor conditions.Classes 1&2 are combined to denote good road conditions, while Classes 3 and 4 signify areas needing maintenance.
Audio and image data fusion were integrated to investigate the potential enhancement of the classifier's accuracy.Initially, early feature fusion was employed, involving the extraction of features from both audio spectrograms and images using the pre-trained convolutional neural network VGG19.These distinct features from both modalities were concatenated to create a unified feature space.Subsequently, PCA (Principal Component Analysis) was applied for dimensionality reduction.
The Random Forest classifier, Multi-layer Perceptron, and XGBoost classifier were then trained using 80 % of the dataset and validated on the remaining 20 %.The experimental outcomes, as presented in Table 2, highlight the Random Forest classifier's superior performance across metrics such as accuracy 0.9018, precision 0.9011, recall 0.9018, and F1-score 0.9014 compared to other models.
A late fusion methodology, also known as decision-level fusion, as discussed previously in the methodology section, was explored to  improve the results further.Two decision-level gates, namely AND and OR gates, were tested.Individual classifiers based on DenseNet121 were trained separately on each modality, i.e., images and audio.Subsequently, these classifiers were tested on the designated test dataset, resulting in individual accuracies of 0.95 for images and 0.92 for audio, respectively, as seen in Table 3.The fusion of their decisions was achieved through AND and OR gates.Notably, the OR gate demonstrated superior performance with an accuracy of 0.97.This accuracy was derived by comparing the test results with the ground truth labels.The superiority of late fusion results is evident, and late fusion methods also demonstrate increased adaptability to diverse input conditions.These methods excel in scenarios where one modality may be afflicted by noise or incompleteness, as the other modality can effectively compensate for these limitations.Each modality typically possesses unique strengths and weaknesses; for instance, images might excel in capturing visual details, while audio can contribute additional contextual information.Late fusion, as an approach, enables the fusion of data from both modalities (images and audio), thereby augmenting the overall system's robustness through the utilisation of complementary information from distinct sources.

Conclusion
This study introduces a novel methodology that employs both audio and image data to detect loose gravel conditions on gravel roads.Audio clips from Road Classes 1&2 and 3&4, capturing varying degrees of gravel hitting the bottom of the car, were labelled into these two groups.The labelling process entailed a thorough examination of videos, extracting relevant segments, and categorising roads based on predefined classes.These classes adhere to the labelling system of Trafikverket, ranging from Class 1 to 4, where Class 1 represents a good road condition, and Class 4 indicates the worst condition.However, due to limitations in data volume, we combined Classes 1 and 2 into one category and Classes 3 and 4 into another.Subsequently, the audio segments were transformed into spectrograms.Using Roboflow annotation tool, roads from images were isolated to ensure that the classifier learned features relevant to road conditions, while disregarding irrelevant elements such as vegetation and sky.The fusion technique involved combining decisions from two classifiers trained on gravel images and corresponding audio segments to enhance road classification.Both feature-level early fusion and decision-level late fusion techniques were evaluated, incorporating OR and AND gates.
The decision-level approach using the OR gate exhibited superior accuracy in the classification process.The collected data from Swedish roads can be utilised to assess gravel road conditions in Sweden and similar terrains.Applications developed through this method can be deployed on cost-effective devices, such as smartphones for capturing data from gravel roads, and the classification results can be mapped on real maps, displaying the road profile.This could assist drivers in planning their trips and gaining knowledge of road conditions in advance.These applications empower road assessment agencies to conduct timely and unbiased evaluations of gravel conditions, particularly concerning loose gravel.With additional data, the study's scope could be broadened to classify gravel roads into four classes.The methodology is adaptable to other gravel road defects, contributing to a comprehensive system that provides insights into road status, including defects such as potholes, dust, and corrugations.The study advocates for data-driven decision-making by road maintenance agencies, streamlining prioritisation and resource allocation based on identified road conditions.The utilisation of audio and image data, particularly from smartphones, allows for remote monitoring, reducing the need for physical inspections and enhancing efficiency.Adapting maintenance strategies to the identified road conditions has the potential to minimise the unnecessary environmental impact associated with extensive road repair activities.The study's reliance on easily accessible devices, such as smartphones, creates opportunities for community engagement in data collection, involving residents as contributors to road maintenance efforts.The study's findings and methodology could serve as a catalyst for further research in the integration of audio and image data for assessing infrastructure, fostering continuous advancements in road maintenance technology.

Fig. 2 .
Fig. 2. Image (a) depicts gravel road detection, while image (b) exclusively displays the extracted roads from the image, omitting vegetation and sky.

Fig. 5 .
Fig. 5. illustrates the application of two fusion techniques in this study: (a) feature-based early fusion and (b) decision-level fusion.

Table 1
Dataset summary: images and audio class distribution.

Table 2
Classification results of various machine learning algorithms using the early feature fusion method.

Table 3
Classification results by decision-level fusion method.