Scenario-based collision detection using machine learning for highly automated driving systems

Highly Automated Driving (HAD) systems implement new features to improve the performance, safety and comfort of partially or fully automated vehicles. The identification of safety parameters by means of complex systems and the driving environment is a fundamental aspect that require great attention. Therefore, much research has been conducted in the field of collision detection in the development of automated vehicles. However, the development of HAD systems faces the challenge of ensuring zero accidents. For this reason, collision detection in the safety-related concept phase as hazard identification is one of the key research points in HAD system. In this paper, a systematic approach to detect potential collisions for scenario-based hazard analysis of HAD systems is presented by using Multilayer Perceptron (MLP) as a Machine Learning (ML) technique. Moreover, the proposed approach assists in reducing the number of observed scenarios for hazard analysis and risk assessment. Additionally, two simulation-based scenario datasets are examined in the ML model to identify potential hazard scenarios. The results of this study show that MLP can support to detect the collision at safety-related concept phase. Furthermore, this paper contributes to providing arguments and evidence for ML techniques in HAD systems safety by selecting relevant use cases.


Introduction
Highly Automated Driving (HAD) systems are focusing on approaching a zero-accident rates.The key aspect of road vehicle safety is the HAD system's capability of decision making, prediction and perception, and to act/react according to the environment or situations.Recent technologies applied in HAD systems have had great success in implementing new features for convenience and human safety, but traffic crashes show that the technologies are not yet mature (Devies, 2016;Levin & Carrie, 2018;Stewart, 2018).According to the data from the National Highway Traffic Safety Administration (NHTSA), around 33,654 fatalities are caused by traffic crashes in the United States (Li, 2020).To fulfil the desire for accident-free driving, HAD systems must be able to anticipate critical events or situations and act accordingly.Therefore, new technologies such as Machine Learning (ML) methods are gaining attention in the development of HAD systems and are being used to improve vehicle features and functions.Recent research on HAD systems and/or automated vehicles includes ML for detection (Almutairi & Muneer, 2022), prediction (Theissler et al., 2021), and avoidance functions (Strömgren, 2018) CONTACT Marzana Khatun marzana.khatun@hs-kempten.deincluding verification and validation (Borg et al., 2018;Elrofa et al., 2018).However, a systematic approach from the safety-related concept phase to the development phase for HAD systems is not described in detail.The safety-related phases (concept, development) define different stages of the safety life cycle specified in Functional Safety (FuSa) (ISO26262, 2018).Furthermore, the ML related adjacent points in terms of safety argumentation are also not clearly demonstrated.ML techniques allow the expansion the areas of investigation and the improvement of system performance.The development and improvement of complex HAD systems such as driver assistance systems, safety services and automatic accident notification systems can be supported by ML techniques.In HAD systems, however, prediction, decision and perception are particularly difficult because of the extensive operational design domain.Consequently, scenario-based analysis is broadly acceptable and implemented for HAD systems.It is important to keep in mind that the completeness of the scenario data is still an open research question.Parameter-based scenario testing is required to discover the unknown critical areas, which also leads to the exploitation of scenarios in different levels (Khatun et al., 2021b).In safety-critical events, any error in the systems can cause harm to humans, so the safety of such systems must be ensured.
Although ML is a potentially auspicious method for HAD systems, it is controversial in terms of safety assurance.Hence, more research is demanded by scenariobased hazard analysis to integrate ML techniques in hazard situation identification and risk assessment.Based on this motivation, this paper focuses on demonstrating a systematic approach for an HAD system to a function as collision detection in a use case.Moreover, the proposed Multilayer Perceptron (MLP) is evaluated with various datasets to understand the ML model and support safety argumentation.In recent years, several categories of deep learning methods have been applied for collision detection methods such as Artificial Neural Network (ANN), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) (Almutairi & Muneer, 2022;Strömgren, 2018).Nevertheless, depending on the type of data (image, tabular, text, video, audio), suitable deep learning methods needs to be considered.Muzammel et al. have proposed CNN for blind spot collision detection using camera image data (Muzammel et al., 2022).Sharkawy has presented robot collision including RNN for position detection using time series (Sharkawy, 2022).Besides, deep neural network is applied in vessel traffic service (Kim & Lee, 2018), collision avoidance in aircraft (Julian et al., 2018) and others (Strömgren, 2018;Wang & Siau, 2019).Nonetheless, deep neural network is not applied in Hazard Analysis and Risk Assessment (HARA) by means of scenario-based hazard situations identification in HAD systems.
In this study, a set of vehicle safety-related parameters (e.g.vehicle speed, distance between two vehicles, lane change duration) is applied to train the ML model that supports to identify hazard situations.To identify the hazard situation is one of the vital parts of scenario-based HARA.It can be argued that images contain more data about the surroundings, which could improve the identification of the hazardous situation.However, scenariobased HARA focuses on the HAD systems that implement a vehicle-level function and performed at the very beginning of the safety-related concept phase.Therefore, datasets with vehicle safety-related parameters are used in this study.Hence, MLP is applied in this experimentation, which is most relevant when compared to CNN or RNN based on the training datasets for collision detection.
The novelty of ML model in this paper is to present a framework of integrating ML in scenario-based HARA for HAD systems at concept phase to identify hazardous situations and support the scenario reduction for risk assessment.The application of MLP supports the expansion of the operational domain of scenario-based HARA, considering the parameter-based datasets studied in this work.This study helps to recognize that the quality of the datasets not only has a significant impact on hazard identification but also on ML model outcomes.
The challenge in HAD systems hazard analysis, specially when using ML techniques, is to ensure that all potential hazardous situations has been taken into account in HARA.Another challenge is the size of the datasets and the exponentially increasing scenarios in HARA that needs to be considered to train the ML model.Therefore, the presented ML model examines not only input dataset based on publicly available data defined as knowledge-based scenarios but also simulation-based dataset indicated as data-driven scenarios to realize the changes in the ML model's outcomes.The contribution of this paper focuses on HARA of HAD systems that incorporate ML techniques and are listed as follows: • Proposing a systematic approach of scenario-based hazard analysis for HAD systems that includes ML technique for collision detection The outline of this paper is as follows, Section 2 provides a background of the scenarios-based analysis and testing approaches together with machine learning.Next, Section 3 addresses the proposed ML model.Later, the experimental set up for the ML model including the model parameter assumptions is explained and the obtained results are presented and discussed in Section 4. Finally, Section 5 summarizes the conclusion with future work and constraints.

Scenario definitions and classifications
Scenario-based approaches in automated driving systems have attracted attention and are applied from the safety-related concept phase as scenario-based HARA to scenario-based testing.A scenario describes the influencing factors that the driver or driving systems need to take into account while driving.To investigate the functions of partially and fully automated vehicles, Operational Design Domain (ODD) is one of the relevant aspects within which the ego-vehicle should drive safely (Zhang et al., 2021).According to Society of Automotive Engineers (SAE) J3061, ODD is defined as operational conditions under which, a particular automated driving system or a function thereof is intended to operate, including, but not limited to, environmental, geographic, and time-ofday constraints and/or the required presence or absence of certain traffic or roadway features (SAEJ3016, 2021).According to ISO 21448, Scenario is a description of the development over time between several scenes in a sequence of scenes.Note: Each scenario starts with an initial scene.Actions and events, as well as goals and values, may be specified to characterize this temporal development within a scenario.In contrast to a scene, a scenario extends over a certain period of time (ISO21448, 2022).
For simplicity, Menzel divides scenarios into three categories: functional scenario, logical scenario and concrete scenario (Menzel et al., 2018).Functional scenarios are described as use case with linguistic notations; while logical scenarios represent a state space description with parameter ranges and concrete scenarios considered concrete values of each specific parameter.Menzel defines, Functional scenarios are described in a linguistic way so that experts can talk about scenarios in the beginning of the development process.Logical scenarios specify parameters for the scenarios and define parameter ranges.Concrete scenarios specify a concrete value for each parameter and are thus, the basis for reproducible test cases (Menzel et al., 2019).Road type, weather conditions and traffic situations, and road vehicle parameters are among many aspects that are considered for describing scenarios for automated vehicles, which are classified as Level 3 (conditional driving automation), Level 4 (high driving automation), and Level 5 (full driving automation) according to the SAE (SAEJ3016, 2021).The systems used in vehicles of level 3 or higher are considered HAD systems in this paper.

Scenario-based testing
Scenario-based testing is a novel technique that has been investigated in recent research projects such as the ENABLE-S3 (Valls et al., 2020), the PEGASUS (Winner et al., 2019) and the VVM (Krebs-Radic & Körtke, 2022).Scenario-based testing in automated driving systems is an efficient approach to not only identify critical situations, but also to support safety assessment and safety argumentation for automated driving systems, including evidence.According to Neurohr, scenario-based testing involves deriving relevant test cases from a manageable set of scenario classes (Neurohr et al., 2020).In addition, Neurohr has also mentioned that automotive testing is defined as the process of planning, preparing, and operating or exercising an item or element to verify that it meets specified requirements, detects anomalies and provides confidence in its behaviour (Neurohr et al., 2020).
Scenario-based test cases are used for the assessment (de Gelder et al., 2022).Thus a scenario database with relevant concrete scenarios is created.Both the logical scenario and the concrete scenario are a quantitative description of action(s)/event(s) with relevant properties or parameters in road traffic.Scenarios related to vehicle functions and considering critical situations are the focus of the scenario-based testing (Erz et al., 2022).Erz et al. have introduced an ontology that brings together ODD, scenario-based testing, and the architecture of HAD systems with their relationships (Erz et al., 2022).Scenarios can be described in a modular approach using tools such as SceML, CarMaker, OpenScenario, VDT.In this work, CarMaker is used for modelling and simulating the test cases.

Machine learning
The application of machine learning is playing an increasing role in advanced driving systems in automobiles (Salay et al., 2017;Spanfelner et al., 2012).ML techniques are widely used in various sectors such as finance to analyse data, predict market prices, find solutions based on situations, solving complex problems, and understanding business values, medicine, robotics (Ahmed et al., 2022;Ahsan & Siddique, 2022;Eder et al., 2022;Nikitin et al., 2022;Sen et al., 2021;Vayena & Blasimme, 2022;Xiao et al., 2022).In addition, ML has tremendous applications in business, education, defense, self-driving or autonomous vehicles, cybersecurity and many other fields (Bhavsar et al., 2017;Peres et al., 2019;Theissler et al., 2021;Wang & Siau, 2019;Yao & Feng, 2018).Theissler focuses on the ML subfields relevant to predictive maintenance of vehicles and categorizes of the tasks involved, also pinpoints the unavailability of public real-world datasets as a challenge (Theissler et al., 2021).Machine learning relies on model and algorithms with respect to the type of data and features.The area of ML models' development is fragile and feature specific.Different learning algorithms such as CNN, RNN/LSTM are compared for object detectionbased data (Knupp, 2017).Additionally, Choi et al. have presented the car crash detection based on camera sensor data (Choi et al., 2021).However, the use of deep neural network is not focused for HARA at concept phase of HAD systems.In general, RNNs are frequently used for processing time series (sensor)/sequential data (sound) data (Heo et al., 2019).Furthermore, CNN is widely used in collision detection for human-robot interaction and other sensor related functions (Adewopo et al., 2022); Anvaripour & Saif, 2019;Garcia et al., 2002).Huang et al. demonstrated the use of deep learning techniques in highway crash detection focusing on the roadside radar and other sensors (Huang et al., 2020).Nevertheless, however, for scenario-based HARA, only top-level vehicle functions are required, regardless of the type of sensors.Hence, a set of logical scenarios are investigated using MLP to demonstrate the proposed approach in this study for hazard analysis.
Machine learning in HAD systems can be applied from HARA to testing to support the verification and validation of these types of vehicles (Borg et al., 2018;Koduri et al., 2018;Koopman & Sholingar, 2016;Salay et al., 2018).Major automotive manufacturers have published a white paper on automated driving, with an appendix dedicated to the use of neural networks in safety-critical scenarios, using object recognition as an example but not in detail with extended areas (Cima et al., 2020; ISO/AWI/TS5083, n.d.; Wood et al., 2019).Hence, further research and investigation are needed to support ML in automated driving systems by providing evidence of the capability and relevancy of ML techniques.Therefore, this work focuses on the analysis of hazard identification based on lane change collision detection on the highway as a part of HARA of HAD systems.Additionally, provide aiding in reducing the number of scenarios for risk assessment based on aspects of severity, exposure and controllability.

Concept
Machine learning is applied for realizing and solving practical problems reliably and efficiently.Scenario-based analysis has indicated the explosion of test scenarios for verification and validation of HAD systems.However, scenario modelling and simulation investigation are timeconsuming and labour-intensive.Consequently, ML techniques can help to overcome the drawback of scenariobased analysis by means of the explosion of hazardous scenarios and SOTIF-related scenarios (ISO21448, 2022).The conceptual approach has been presented in Figure 1 that shows a systematic approach how ML model can be integrated to HARA for HAD systems.
The results as presented in Section 4 give an indication of how the ML can be used in HARA.The proposed concept shows the incorporation of ML in scenariobased HARA to identify the potential hazard situations.To demonstrate the systematic approach, the MLP model for scenario-based HARA development was investigated.

Description of the use case
The use case typically describes the functional scope, the desired behavior, and the functional system boundaries, including scene and scenario.A use case can consider single, double, or multiple functional scenarios.Use cases are therefore scenarios described at a more abstract level (ISO21448, 2022).The recent project (Winner et al., 2019) was considered as the basis to identify use cases.In this paper, lane changing on the highway is used as a use case for HAD systems.The use case is defined such that the ego-vehicle changes from the right to the left lane when there is another road user (vehicle) in the left lane.The ego-vehicle must take into account the corresponding parameters of the other vehicle when performing the lane change.
To express the use case, scenario construction process is followed by ISO/PAS 21448 (ISO21448, 2022).The use case is described as: climate = sunny, time of day = daytime, shape of road = straight highway, road conditions = dry, ego-vehicle operation = vehicle is performing lane change, other vehicles = oncoming on left side and one in-front of ego-vehicle, pedestrian = none, objects off-roadway = none.
A set of logical scenarios is modelled in CarMaker and simulated to detect the collision between egovehicle (V2) and other vehicles (V1 and V3).According to the use case, the scenarios as a temporal sequence of actions/events are:

Scenario datasets
Knowledge-based scenario generation has already been used in HAD systems, as studied in the PEGASUS project (Winner et al., 2019).Menzel mentioned ontologies as a knowledge-based analysis of the functional scenario generation process using the 5-layer model (Hülsen et al., 2011;Menzel, 2020).According to Ponn, ontologies are a formal representation of knowledge and its relationships that originate in the semantic web and knowledge-based scenarios are defined by functional scenarios and logical scenarios (Ponn et al., 2019).Additionally, Bagschik proposed knowledge-based approach for generating scenarios, as expert knowledge can support the development of engineering techniques and safety analysis (Bagschik et al., 2018).
Based on these premises, this paper considers functional scenario related to the defined use case and a set of logical scenarios from research projects PEGASUS and ENABLE-S3 as knowledge-based scenario dataset (dataset 1) (Valls et al., 2020;Winner et al., 2019).Besides, a brief review process has been performed for the scenariobased HARA to define and collect a list of hazardous situations with extension by means of FuSa and SOTIF aspects.As an extension of the logical scenario for dataset 1, reduced parameter boundaries are estimated by applying Monte Carlo simulation as described in Khatun et al. (2021a).The results from the Monte Carlo simulation are used to estimate the boundary of the parameters and dataset 1 has been created based by the boundary of the parameters achieved by Monte Carlo simulation.The dataset 1 typically represents source data and describes the scenarios with related information of hazard analysis.Consequently, data-driven scenarios focused on simulation-based scenarios are addressed in this study as dataset 2. Data-driven scenarios are based on logical scenarios.The discretization steps of the assumed safety-related parameters are optimized by observation oriented (testing and/or study based) approach at the logical scenario as described in reference (Khatun et al., 2021b).Furthermore, the sensitivity analysis of the parameter is inspected for logical scenarios as explained in Khatun et al. (2022) and considered in the preparation of data-driven scenarios as dataset 2. Emilio claimed that technologies in autonomous systems are driven toward model-based and data-driven methods, and demonstrated a knowledge-based process for improving situational awareness (Miguelañez et al., 2011).For logical scenarios and concrete scenarios, the data-driven method can be used due to the parameter spacing and complexity of the application scenario.For verification and validation of HAD vehicle systems, the use of innovative and systematic data-driven methods is encouraged to be applied (Zofka et al., 2015).Data-driven approaches are used to model manoeuvre trajectories for safety validation of highly automated vehicles (Krajewski et al., 2018).The systematic approach applied in this paper for creating test case scenarios is as follows: (1) Knowledge-based scenario data collection: publicly available functional scenario and relevant logical scenario are considered to model a realistic use case.
(2) Probability density of the safety parameters: since logical scenarios are considering a wide range of parameter boundaries, the Monte Carlo technique is used to reduce the parameter range.To further investigate and verify the Monte Carlo simulation, the logical scenario is modelled and simulated in CarMaker.This approach helps to provide safety proofs and build confidence in the results.
(3) Variation of the parameters: to generate most efficient test cases that can be used in driving simulator, variation of the parameters has been optimized.During the parameter variation, fine parameter spaces are applied that appear to be relevant by means of collision detection.
Furthermore, the scenario reduction and optimization aspects are considered as by-products of the proposed approach, as shown in references (Khatun et al., 2021a(Khatun et al., , 2021b(Khatun et al., , 2022)).Replacement and/or adding parameters significantly increases the exploration of the scenario and parameter space.Since the logical scenarios are parameterised, the boundary of the parameters can be estimated considering the collision detection criteria.The test samples generated in the pre-testing (preliminary test) considers a larger parameter space than the refined logical scenario (Khatun et al., 2022).For the simulation-based generation of scenario data, first, functional scenarios are collected and analysed for the relevant HAD safety feature.Later, initial parameters are created based on the knowledge and publicly available data, and lastly, the scenarios are modelled in tools such as CarMaker.The parameter sets are shown in Figure 3 (left side).Moreover, a set of safety-related parameters is considered and varied in a given range based on expert judgments and available resources, and Monte Carlo simulation is examined until the parameter range is estimated as pre-testing of logical scenarios.Monte Carlo simulation is performed to determine the reduced range of parameter boundary, and the logical scenario is refined.Then, the parameters in each logical scenario are discretized and a set of concrete scenarios is created, as shown graphically in Figure 3 (right side).To determine an efficient input datasets, the limiting ranges of the parameters are estimated by Monte Carlo.A parameterbased analysis is then performed on the reduced boundary range and input datasets are generated (Khatun et al., 2021a(Khatun et al., , 2021b(Khatun et al., , 2022)).
An input dataset is randomly shuffled, then subsampling done by k-fold is used to generate the training dataset, (D train ), validation dataset (D validation ) and test dataset, (D test ).Therefore, D train , D validation and D test are subsets of input dataset.Note that, ML model is evaluated with the D test , which is never used during the D train or D validation and not a part of train or D validation .So, the ML model was trained using the D train dataset and validated on the D validation dataset and finally tested on the D test dataset.In k-fold cross-validation, the available input dataset is partitioned into k disjoint subsets of approximately equal size.Then, the model is applied to the remaining subset, which is denoted as the D test , and the performance is measured.This procedure is repeated until each of the k subsets has served as D test .The average of the k performance measurements on the k D test , is the cross-validated performance (Berrar, 2019).The ML model is estimated with several subsets using the k-fold method.Each subsets consist of D train , D validation and D test .Finally, the model applies the same function to the D test only for model evaluation.In k-fold sub-sampling, the single hold-out method is repeated k times.Hence, k pairs of D train , D validation and D test are generated.

Machine learning model
In the proposed ML model, a set of input data is given but not the patterns and relationships between them because descriptive accident scenarios are considered for functional scenario as described in Sections 2.1 and 3.2.Therefore, MLP is apt for applying in HARA compare to other deep neural networks such as CNN and RNN.Several MLPs have been examined with three hidden three layers during the training of the ML model (see Figure 4).A neural network in TensorFlow backend with the keras API has been used for ML model.The ML model is trained for collision detection based on the input dataset.Each datasets is split into three parts as training data (approx.70%),validation data (approx.20%) and test data (approx.10%).Two types of input datasets have been used as described in Section 3.3 for model training to optimize and verify the ML model performance in terms of accuracy and loss.One input dataset considers the optimization technique as Monte Carlo and other input dataset consider the optimization techniques in combination of Monte Carlo and parameter-based analysis.A deep neural network consists of hidden layers of units between the input and output layers known as dense layers as well.In neural network, the layer normalization can be expressed as Ba where ith denotes hidden layer unit, a denotes a vector to represent the summed input to the neurons in that layer, considering the th hidden layer in a neural network.ω i is the incoming weights to the ith hidden layer unit, h is the bottom-up inputs and b i is the scalar bias parameter.An element wise non-linear function is defined by the f (weights inputs + bias), consists of neurons input and bias parameter of each hidden layer and unit.The overall layer normalization statistics for all the hidden layer is as follows (Ba et al., 2016): Here, three hidden layers are used for the ML model.µ and σ are the normalization term that are same for all the hidden layers.The proposed ML model has used sequential base model to train the ML model.The sequentialbased model allows to build a stack of notes (neurons) in series to each other (Ba et al., 2016).The activation function of a neuron can be presented as follows: where model function h(x), and x = input value; θ = weight with bias value x_0 = 1 and θ_0 and actual model function is h 0 (x).However, to verify the capability of the model, several dense layers with fixed hidden layers are investigated during the experiments.For example, one of the density layer the proposed ML model has trained is the following: one input layer (6 units), one output layer (1 unit), three hidden layers (8 units, 16 units and 8 units), as shown in the Figure 4 in two model experiments.
The performance (detection) of the model is estimated by a loss function.The loss function is defined as Mean Squared Error (MSE) and the target is to reduce the values as an acceptable range that the loss function is returning to the weights to the model.The MSE is expressed as (Gupta, 2021) where N is the number of samples, Y is the actual data value and Ŷ is the predict data value.Adaptive Moment Estimation (Adam) is applied to the ML model (Kingma & Ba, 2017).
For optimal results, accuracy of the model has been investigated by means of the k-fold cross validation procedure.The k-fold cross validation used only training data and calculated the true performance of the models (Arlot & Celisse, 2010).A 5-fold cross validation and 10-fold cross validation have been applied both input datasets (dataset 1 and dataset 2).Thus two sets of D test are investigated with folds and the performance of the ML model is evaluated as that detail presented in Table 2.  evaluation function is re-verified by comparing its results with an independent self-designed verification function.This step is done to increase confidence regarding the keras evaluation function, and can be one step of a Tool Confidence Level (TCL) determination.TLC is described in further detail in ISO 26262-6. ISO26262 (2018).

Experimental setup and assumptions
To evaluate the ML model, the input parameters (input datasets) contain only variables that are collected from the publicly available data and research projects.For the use case, six variables are used as scenario input and outcomes are indicated as collision detection.To realize the proposed ML model's performance, k-fold cross validation is applied to both input datasets.The number of folds, k = 5 and k = 10.For model training, 2000 epochs with batch size 64 are used and k-fold cross validation is applied.A list of ML model parameter is presented in Table 1.Note that, to assume the batch size, epoch and learning rate and optimizer selection, several experiments have been performed including the knowledge from experts and literature.
The number of epochs used to train the model is determined by finding the maximum accuracy and lowest loss in the validation data (approximately 300 epochs, see Figure 6).Learning rate is optimized by observing the total sum squared error determined during the training phase.To identify the optimal learning rate, a set of simple trial has been performed where learning rate is varied from 0.00005 to 0.1.After each experiment, the learning rate is increase and decrease with the delta values to optimize the best suited value.By observing the loss of the experiments, that under-learning (learning rate is too low) and over-learning (learning rate is too high) domains can be omitted.For this study, learning rate is optimized as 0.001.

Results and discussion
The results focus on the performance of the ML model in terms of collision detection.The input datasets and outcomes are quantitative data.The structure of the ML model is based on three main pillars: • First, the input datasets.
• Second, the ML model parameters selection or assump tion.• Third, the evaluation of the model outcomes including validation.
ML model parameters are described in Section 4.1.The ML model experiments have been coded in Python using keras framework with Tensorflow backend.To evaluate the ML model, input dataset is split into three parts as described in Section 3.3.Each of the k-fold, ML model has been trained and validated (D train and D validation ) and model's learning performance is represented as accuracy (in %) and loss values.Figure 7 shows the accuracy and loss values for dataset 2 for k-fold 5 and k-fold 10 over training epochs.The accuracy and loss score of k-fold 5 are 83.78% and 0.114 for dataset 2. The accuracy score of k-fold 10 for dataset 2 are 89.189% and 0.112.Hence, the ML model performance is compared based on the k-fold validation.To test the actual ML model performance, test dataset (D test ) is given to the ML model and ML model outcomes are checked with the expected results.The process has been performed on each of the k-folds (5 and 10) as presented in the Figure 7 for dataset 2.
For evaluate the ML model, 'prediction' results from the model are compared with k-fold cross validation results.To evaluate the ML model performance, a simple self-designed verification function has been used in addition with k-fold cross validation approach.
A systematic approach to collect, generate and optimization of the input data aids in understanding the proposed model behavior as two different types of data are used for training the model.The reason to investigate two types of input datasets is to understand or evaluate the outcomes of the ML model.To comprehensively improve the model's prediction to detect collision, consistent and realistic data must be provided.From the model prediction by means of collision detection, it has been investigated that the input dataset that are optimized based on parameter-based analysis (dataset 2) provide higher accuracy results compared to Monte Carlo simulation based dataset (dataset 1).It is noted that the ML model is trained and tested on both datasets using the same

Conclusion
This paper demonstrates a systematic approach using machine learning method to detect the collision at concept phase during scenario-based analysis.The concept phase typically starts with the scenario investigation.The input dataset for ML model is based on scenario-based HARA from an ongoing research project.Simulationbased input dataset is generated to train the ML model.Afterwards, input parameters are optimized at logical scenario level.The performance of the ML model is studied using two sets of input data to optimizing the accuracy of the trained model.Hence, the integration of ML model in hazard analysis is applied to support scenario-based HARA.Further safety argumentation with evidence for ML method can be developed throughout the HAD systems life cycle by integrating the ML techniques in sub system level.This study focuses on the collision detection, and a set of experiments is executed by the machine learning model to investigated and to ensure the performance of the applied model.Additionally, the quality of the input dataset plays a fundamental role in ML techniques.The result of the ML model experiment shows that if the input (learning) dataset is realistic and optimized sensibly, the ML model can detect true or actual collisions.
Furthermore, this investigation provides a procedure on how safety can be argued with evidence as exhibited with a provided use case.Moreover, this study can be considered as a basis of other safety steps, like establishing pre-collision strategies, collision event identification, and development of post-collision strategies.
The first major limitation for this study is that the model is tested on simulation-based datasets.The second constraint is that a knowledge-based scenario dataset is based on functional scenarios and logical scenarios, and discretization steps in logical scenario are reduced for data-driven scenario dataset.This means that the completeness of the dataset cannot be argued.Therefore, future work will include extended ODD with related use cases.Lastly, the trustworthiness and effectiveness of the machine learning model and the quality of the dataset can be evaluated by experimenting a set of test case scenarios in vehicle simulators.

Figure 1 .
Figure 1.Concept Model for using ML for HARA.
The ego-vehicle(V2) turns towards left to change the lane.• The other vehicles (V1 and V3) driving straight without acceleration.V1 is driving on the left lane and V3 is driving on the right lane.• The ego-vehicle(V2) is driving straight on the left lane.• The ego-vehicle(V2) turns towards right to change the lane.A graphical representation of the highway use case is presented in Figure 2 only for right to left lane change of the ego-vehicle, including the flow from accident type to functional scenario as 2D diagram to logical scenario that modelled in CarMaker.The accident type 63 is considered as lane change from right lane to left lane (Unfalltypen-Katalog, 2016).The lane change scenario function has been extended by returning to the previous lane (right lane) by the ego-vehicle (V2).A 2D diagram is used to demonstrate the functional scenario and used as a basic use case reference for modelling the logical scenario in CarMaker(Khatun et al., 2021a).The ego-vehicle model and other road vehicle used kinematic bicycle mode as default from CarMaker tool(CarMaker, n.d.).The initialization parameters that are required for modelling the logical scenario are: (i) initial speed of the ego vehicle, (ii) speed of the other vehicles, (iii) distance between the vehicles, (iv) lane change time for ego vehicle, (v) highway road type, (vi) start position of the ego vehicle and other vehicles and (vii) road length.Additionally, a set of parameters are considered that are varied over certain range are: ego-vehicle speed, time to perform lane change, lane change distance, velocity and start position of vehicle (V1).

Figure 2 .
Figure 2. Representation of Use Case for Scene 1.
A 5-fold cross validation and 10-fold cross validation are applied together with two sets of D test to evaluate the performance of the ML model.The two D test are from two different input datasets, which contain two different optimization approaches and aim to ensure the quality of the input dataset.The machine learning-based collision detection begins with the development of an ML model's or learning model's features.A set of model parameters is optimized based on experience and publicly available research work.The machine learning concept, including the parameters for predicting or detecting the collision, is presented in Figure 5.However, the input dataset used in the model plays a vital roles in optimizing the results from the ML model as illustrated in the Figure 5.The input dataset frames three categories of sub datasets as, training dataset (D test ), validation dataset (D validation ) and test dataset (D test ).It is worth mentioning that optimization techniques like Monte Carlo and parameter-based analysis are used to prepare the input dataset for the ML model.The ML model is tested with two types of input datasets to realize the model performance.Additionally, the keras

Figure 6 .
Figure 6.Accuracy and loss of training dataset and validation dataset.

Figure 7 .
Figure 7. Accuracy and loss of test dataset.

Table 1 .
Machine learning parameter setting.

Table 2 .
Summary of experimental results: deep neural network-MLP in HARA Use case.Experiments are listed in the table considering two k-folds, dense layer and input dataset.b There are three hidden layers used in neural network.Example: The neural network for model experiment (exp) is shown in Figure 4. c All input datasets are examined with respect to k-fold and input dataset (see 3.3).test setup.The ML model results indicate that the number of scenarios can be reduced and the ML model can be used efficiently by applying an optimized parameterbased dataset (see experiments summary results Table 2 ).Table 2 provides a glimpse of the experiments considering ML model parameters and shows that experiments 'exp_a16' deliver the most suitable outcomes among other experiments. a