Safety of the Intended Functionality Validation for Automated Driving Systems by Using Perception Performance Insufficiencies Injection

: System perception of the environment becomes more important as the level of automation increases, especially at the higher levels of automation (L3+) of Automated Driving Systems. As a consequence, scenario-based validation becomes more important in the overall validation process of a vehicle. Testing all scenarios with potential triggering conditions that may lead to hazardous vehicle behaviour is not a realistic approach, as the number of such scenarios tends to be unmanage-able. Therefore, another approach has to be provided to deal with this problem. In this paper, we present our approach, which uses the injection of perception performance insufficiencies instead of directly testing the potential triggering conditions. Finally, a use case is described that illustrates the implementation of the proposed approach.


Introduction
Automated Driving Systems present a new challenge in the field of safety argumentation due to the complexity of validation, because this requires covering not only malfunctions but also scenario conditions and complex algorithms, greatly increasing the effort involved in obtaining quantitative evidence that ensures the safety of systems.The validation of ADAS/AD functions is shifting from component-based validation to a more scenario-based validation.Unlike component-based validation, which ensures that all components are working properly (i.e., no faults or malfunctioning), scenario-based validation adds the focus in cases when everything works as intended but different situations and components of the scenario could create a situation that may lead to hazardous behaviour.

Safety Validation
New regulations [1] require the provision of evidence from the validation process to obtain authorisation for driving on public roads and, more importantly, to avoid accidents [2][3][4][5] that occurred in the past.To obtain a better picture of safety and validation processes in the domain of autonomous vehicles, the authors [6,7] give an overview of the current situation, describing the main requirements and concepts.Standards such as UL4600 [8] also provide a list of all the required evidence to ensure the validation of a system.Other standards are being defined to explain this new safety domain in which scenarios are much more relevant.The Safety Of The Intended Functionality (SOTIF) is defined in the standard ISO21448:2022 [9] and covers the validation of hazards that are not initiated by a malfunction in the system but by misuse and technical shortcomings.The standard introduces the concept of potential triggering conditions, which are the conditions from a scenario that could cause the system to exhibit hazardous behaviour.As part of a SOTIF validation, potential triggering conditions must be covered, but the huge number of possible scenario situations makes testing all possible potential triggering conditions an unmanageable task.In order to fully understand the work described here, some concepts and terminology need to be clarified.A triggering condition is defined in [9] as a specific condition of a scenario that starts a reaction in the system contributing to hazardous behaviour.Potential could be included as a prefix when it is not yet validated, but experts see evidence that it could turn out to be a triggering condition in the end.A functional insufficiency is defined as an insufficiency of specification or performance insufficiency.The scope of the work is only focused on performance insufficiencies; therefore, the insufficiencies of specification, which are the initiators of unintentional misuse, are not considered in this publication.A performance insufficiency is defined as a limitation of technical capability contributing to hazardous behaviour when activated by one or more triggering conditions.An output insufficiency is an insufficiency on a functional level and, like the other insufficiency, can be activated by one or more functional insufficiencies or triggering conditions.Hazardous behaviour is defined as the behaviour of a system that is not within the specified acceptance criteria.The acceptance criteria could be defined on the basis of different Key Performance Indicators (KPIs) or, in the case of safety, Safety Performance Indicators (SPIs).An overview of the currently available metrics to define these criteria is given in [10].Additionally, the authors in [11] present a survey of the current standards related to safety in the automated driving domain, including the definition of common perception failures and relevant metrics to evaluate perception systems.Other authors [12] also give an overview of the current standards but focus on SOTIF.This research work shows the relationship between SOTIF and other standards as well as its implementation in the verification and validation process.Another safety standards overview is provided in [13] with a focus on object-based environment perception.The relationship between these concepts is shown in Figure 1, which illustrates how a triggering condition may start in a potential triggering condition (heavy fog) and lead to hazardous behaviour (unintended braking).Another example could occur if a vehicle leaves a tunnel.If an ADS only relies on a camera as a perception sensor, this camera has some moments with extremely high-contrast images that may impact the behaviour of the ADS.In an attempt to cover the topic of triggering conditions, the authors in [14] present a systematisation and identification of triggering conditions, providing a categorisation to better handle them, which also gives us a better understanding of the concept in this context.This topic was also covered in our previous work [15], where the process of testing the triggering conditions was explained, but the realisation of the impossibility of covering all triggering conditions made us change the direction of our research to validate an ADS by using perception performance insufficiencies instead.
Accounting for the scenario side, one of the main goals of the SOTIF is to minimise the scenarios that could be hazardous.Unlike ASIL methodology from the Functional Safety Standard (ISO26262) [16], a SOTIF validation does not provide a classification according to a specific metric.In SOTIF, the validation of the ADS should improve in each iteration due to minimising the already known hazardous scenarios or discovering new hazardous scenarios, which change with every iteration.Figure 2 shows how the scenarios are divided into four main areas: from safe and known scenarios to the worst-case scenario, unsafe and unknown scenarios.As depicted in this figure, each validation iteration must reduce the number of unsafe unknown scenarios, firstly validating the already known unsafe and safe scenarios and secondly discovering new scenarios that could be unsafe and unknown.Unfortunately, the area of hazardous unknown scenarios cannot be completely accounted for, as such scenarios can always occur.The rest have to be treated as a residual risk of ADS.This approach is not only focused on the performance insufficiencies injection but also on finding a way to discover and identify edge scenarios.A further classification of scenarios [17] is also given according to their occurrence or hazardous level.For example, corner scenarios are scenarios that occur in rare conditions with normal operational parameters (e.g., low sun angle, ice-covered road, etc.).On the other hand, an edge case is also a scenario that occurs in rare situations but with the presence of extreme values.Thus, not all corner cases are edge cases and vice versa.A nominal scenario means a traffic scenario containing situations that reflect regular and non-critical driving manoeuvres according to [18], which also defines a critical scenario as one that needs an emergency manoeuvre to avoid harm or react to a system failure.
A key concept arises in scenario-based validation, where the description of the scenario as well as the domain in which the system works properly has to be defined.In this context, the Operational Design Domain (ODD) defines all scenario situations in which an ADS is designed to work safely.ISO34502 [19] provides the principles to define an ODD, where a fine-grained description of the scenario plays a crucial role in this task.Many taxonomies [20,21] together with the previously cited [9,19] have tried to reduce the gap presented by this issue, describing various aspects from different levels of weather conditions (wind, snow, etc.) to road topologies.The authors in [22,23] describe how to maximise ODD coverage in the scenario validation process.The Pegasus project [24] proposed the six-layer scenario model, which categorises each scenario into layers according to the kind of actors and their functionality in each case.Furthermore, ASAM OpenODD [25] provides the necessary syntax to include a defined ODD in the software simulations and be able to carry out the validation.The syntax also allows different definitions to be reused, shared and combined, providing greater flexibility and collaboration between partners in the development of ADS as required.

Sensor Models
Although there are many Hardware-In-Loop (HIL) platforms [26,27] for collecting data to validate Automated Driving Systems, two main questions arise when using this approach.The first one is starting to think how much real-world data would be enough to validate the ADS [28]; the second is the impossibility of collecting data for all types of possible scenarios, such as different types and levels of weather conditions [29,30].Therefore, virtual validation is the more feasible way to validate an ADS.Typically, an Automated Driving System is built on three main blocks: sense, plan, and act.Figure 3 shows each block and the relationship between them.The sense block carries out the perception of the environment using sensors such as cameras, lidar or radar to perceive what is happening in the current environment.In our approach, the sense block is also split into two different sub-blocks: the perception and algo blocks.The sense-perception block models the observed reality (e.g., the environment) based on sensor inputs.For example, the point cloud generated by a lidar sensor is based on the perceived environment.On the other hand, the sense-algorithm block is responsible for extracting information from the input of the perception block (e.g., an object list from the generated point cloud).The plan block is in charge of the decision making according to the input from the perception sensors and the defined functionality, triggering the necessary actions according to the situation.Finally, the act block executes the actions decided upon, for example, braking or turning.This model is also referred to by different names such as Sense-Decide-Act or Perception-Decision-Actuation, but the meaning of each block remains the same.There are many research works focused on the study of the behaviours of diverse sensor technologies in different harsh environments.The authors in [31] present the current state of sensor models for virtual validation, including an explanation of the different types of sensor model fidelities.In this context, many research works have focused on the behaviour of different sensor technologies and on developing error models that model sensor performance in harsh environments.For example, the authors in [32][33][34] provide an in-depth analysis of the performance of camera and lidar technologies in adverse weather conditions such as foggy environments.A fog error model for a point cloud generated by lidar sensors was developed in [35].The same authors have continued the work by modelling lidar performance insufficiencies in snowfall conditions [36].Another example can be found in the following work [37], where the author develops a library to edit the point cloud generated by a lidar, including effects such as cropping or added reflection.This research is also developed in the context of security, as these additions can be used as a spoofing attack by external actors.Machine learning has also been used to develop sensor error models, as shown by [38].The authors explain their process for incorporating rain into the image frames produced by a camera.The aforementioned research works and tools could be used as a component of our approach to include perception performance insufficiencies into the system to be validated.One of the issues of using high-fidelity models is the increased computing demand of the simulation; the proposed approach in [39] mitigates this problem, albeit by using low-fidelity models, that show similar behaviour with lower computing costs.Andrea Piazzoni et al. in [40][41][42] conduct extensive research regarding perception error models.Similar to this paper, the authors remark on the importance of considering perception errors in the virtual validation process and its absence in current techniques.They proposed a perception error model and the guidelines to be included in the simulation pipeline.Moreover, the model is implemented using different sensor configurations, showing how each configuration is related to system safety.Perception errors are also utilised for virtual validation by using adversarial attacks [43] in the approach proposed in [44], concluding that these attacks, although seemingly harmless, have an impact on the final behaviour of the system.Another approach of RGB-camera perception error models is proposed in [45] to estimate rare failure probabilities used to learn high-likelihood failure trajectory distributions.

Risk Evaluation for Autonomous Vehicles
With regard to quantitative risk assessment, there have been recent approaches in this area.The first related work is proposed in [46], which includes a well-described list of deficiencies in the standard and possible corrections.It also gives a brief idea of how they could use a statistical approach for SOTIF validation.The authors in [47] propose an approach in which they can give quantitative values to each category of the HARA analysis (exposure, controllability and severity) from ISO26262 and calculate the risk of the ADS.Based on these values and statistical approaches, they can define the probability of risk of some extracted scenarios with and without triggering conditions.Unlike our approach, which is focused on full virtual validation, this approach focuses on using real-world data for validation.Another approach to calculating the risk is given in [48], where the authors use a fault tree analysis and HARA to provide a quantitative metric of an ADS.The approach given in [49] uses one-side binomial and Poisson distribution, but the authors recognise that the given analysis is greatly simplified to a model of a specific ADS.Moreover, no false positive or perception triggering conditions are included.In addition to the benefits of facilitating modular design, this approach makes it possible to demonstrate that sufficient safety conditions are met at the component level, using data sets of reduced size and therefore cost compared with those required for validation by vehiclelevel road tests.As a disadvantage, the very specific scenario complicates the inclusion of the methodology beyond the described scenario.This paper [50] proposes another statistical validation method that uses reinforcement learning to identify the scenarios that lead to a system outcome outside the acceptance criteria.This approach reduces the number of necessary scenario simulations needed to validate the collision avoidance system from the publication.A perception validation methodology using failure rate probabilities is given in [51].Similar to our approach, it uses the Responsibility-Sensitive Safety (RSS) [52] area as a main area to focus on in validation, but the approach only takes into account the perception component and not the impact in the complete system.Another research work in this field is [53].The authors in this publication develop a system to monitor, quantify and mitigate SOTIF risks.The methodology is validated through an HIL platform that uses AI algorithms in the perception system.

Structure of the Article
The structure of the publication is as follows: Section 2 explains how perception performance insufficiencies are implemented in our approach, providing a classification of these performance insufficiencies according to the impact on the system and the perception technology.The next section describes the risk evaluation, where performance insufficiency injection is used to provide a quantitative metric that can be used to evaluate the system.Section 4 illustrates two use cases in which the proposed approach is put into practice.In the first use case, the risk evaluation for a system with a visibility insufficiency is described, showing how this insufficiency impacts the output of the function and the associated risk.Additionally, the second use case shows the impact and the associated risk of an accuracy insufficiency in the same system.Finally, Section 5 summarises the methodology proposed in this publication and outlines the next steps of this research.

Perception Performance Insufficiencies Injection
As previously stated, validating all potential triggering conditions of the system is an unmanageable task.Therefore, this work focuses on the performance insufficiencies rather than the reasons behind them.The approach is centred on the impact on the ADS: for example, validation of an ADS when a scenario includes a tunnel.The validation strategy could involve generating test cases that involve all possible types of tunnels, also accounting for sizes and materials.Based on this validation strategy, including all definitions of tunnels and scenario conditions creates an almost infinite number of test cases, making the validation infeasible.Therefore, our approach goes directly to the impact that triggering conditions could generate in the system: extremely high-contrast images for a short period of time in a camera-based ADS.This strategy validates not only the specific triggering condition but also the unknown scenarios that could affect it in a similar way.This contributes to reaching one of the main goals of SOTIF, which is to minimise the unknown scenarios that are both hazardous and non-hazardous.
The perception component of an ADS includes different sensors and technologies that can vary the perceived reality.In our approach, performance insufficiencies are given at a high level of abstraction.Therefore, when a performance insufficiency is injected, it is related to the raw data that the sensor provides as input to the system.For example, if a lidar performance insufficiency is implemented, it uses point cloud messages for the injection, or image frames if a camera is used.The architecture of the approach is shown in Figure 4, which shows that the injection is added to the raw data of the sensor before it is included in the ADS system.Although in our approach only perception performance insufficiencies are considered, the effect of the injection could appear in any block of the ADS of Figure 3: sense, decision and actuation.To implement this approach, a classification of performance insufficiencies is defined first, with a main category defining the general impact of the insufficiency.Then, each of these categories serves as a parent of the performance insufficiencies defined by a specific technology as well as the insufficiencies modelled for specific triggering conditions.Thus, the performance insufficiencies in our approach are classified as follows: • Generic Performance Insufficiency (GPI): This refers to a general performance insufficiency that is not related to any specific sensor technology but rather the impact on sensor perception.It is used as a general category for performance insufficiencies.Table 1 shows an excerpt of some identified performance insufficiencies, also describing their impact on the sensor.• Technology Performance Insufficiency (TPI): In these insufficiencies, the defined generic performance insufficiencies are modelled for a specific technology.For example, the reduction of field of view performance insufficiency from the GPI table could be defined for the lidar technology as cropping in the point cloud message provided by the lidar sensor function.Thus, if the visibility of the sensor is limited to a specified distance, the injector will remove the points farther than this distance.Table 2 shows an excerpt of the performance insufficiencies for the lidar technology and how they are modelled in the system.

•
Triggering Condition Performance Insufficiency (TCPI): This is a performance insufficiency that was modelled for a specific triggering condition and technology, such as the lidar snowfall modelling from [36] or camera rain models from [38].This category also includes the defined taxonomies from the standards (SAE [54], BSI [21], SOTIF [9], etc.) that could be set as triggering conditions in the validation process.For example, visibility in a heavy snow scenario is limited to 500 m according to the SAE [54].
Note that these performance insufficiencies are not system-independent; therefore, they have to be included in all available sensors simultaneously.In this context, if a triggering condition is validated for ADS, then this includes a radar, camera, and lidar sensor; all performance insufficiency injections for all sensors must be included at the same time and at the same fidelity level to avoid inaccurate results.
A test case function ( f TC ) is defined without any performance insufficiency injection, resulting in values that fall within the acceptance criteria as f TC () ∈ ϵ acceptance−criteria .The same function, including the performance insufficiency (PI) injector, is defined in Equation ( 1) for all insufficiency levels (S).
If many performance insufficiencies (N) are validated at the same time, the previous equation could be expressed as Equation (2), where each added performance insufficiency (PI j ) is included independently of from each other and with different levels of intensity (S) (e.g., in cases when limited visibility and illuminance is validated at the same time in the system).
As discussed in [55], a triggering condition could be defined by one or many performance insufficiencies.For example, a heavy fog triggering condition could not be only parameterised by a visibility performance insufficiency as defined in [21], but also by illuminance and accuracy insufficiencies.Therefore, these three insufficiencies should be injected into the system at the same time.Consequently, a Triggering Condition Performance Insufficiency (TCPI) is defined as one or many performance insufficiencies that are injected at the same time and insufficiency level.This can then be formalised as Consequently, the equation to inject one or many TCPIs (M) into the test case is defined as follows: Table 1.Generic performance insufficiencies excerpt list.

PI-01
Reduction of Field of View (FoV) The visual range of the sensor is reduced from the nominal sensor performance.

PI-02
Light disturbance An external light source affects the sensor perception.

PI-03 Misalignment
The position of the sensor was changed from the calibrated sensor position.

PI-04 Reduction of resolution
Sensor resolution is reduced according to the nominal performance provided by the manufacturer.

PI-05
Reduction of accuracy Sensor accuracy decreases according to the nominal performance.

PI-06 Reduction of luminous intensity
The luminous intensity of the sensor is reduced according to the technical specifications.

PI-07 Slower processing time
Sensor processing time is slower than the maximum processing time in nominal conditions.

Risk Quantification
It is essential to perform a quantitative evaluation to ensure objective validation of the function.Thus, the next stage of our methodology involves quantifying the risk for validating the ADS, which enables a comparison of the results with newer iterations of the ADS to introduce improvements.Figure 5 demonstrates the correlation between risk and cause and effect in accordance with the SOTIF standard.ISO26262 [16] defines exposure, controllability and severity as part of the Hazard Analysis and Risk Assessment (HARA) methodology.The HARA methodology assigns a safety level to each defined hazard called the Automotive Safety Integrity Level (ASIL).The calculated ASIL of a hazard is based on three main variables: • Severity (S): the level of injury to the driver and passengers.• Controllability (C): if the hazard could be controlled by the driver.

•
Exposure (E): how often the hazard occurred during the driving time.Each variable is assigned a value ranging from 0 (mildest) to 4 (worst).The sum of these three variables (e.g., S0 + C2 + E1) is used to determine the ASIL level.ASIL A is assigned to the lowest risk, while ASIL D is assigned to the highest risk.In contrast, ISO21448 [9] defines risk as the product of controllability and severity.Thus, the primary objective is to reduce controllability and severity to prevent harm or achieve the safety goal, considering a previously defined residual risk.In ideal scenarios, controllability and severity should be zero (C = 0 or S = 0) to achieve optimal outcomes.Risk = PF * P(PI) * P(I) This approach calculates the risk based on Equation ( 5).This equation is based on two different probabilities.P(PI) represents the likelihood that a performance insufficiency may have an impact on the nominal performance of the system.This means that the results once the performance insufficiency is injected are outside a defined acceptable window.Nominal performance is calculated based on the values from Montecarlo simulations of the same scenario without the injection of any performance insufficiency.The tolerable window is calculated based on the standard deviation of this simulation and a given factor.As Figure 5 shows, if P(PI) is greater than zero, this implies that the ADS is susceptible to the injected insufficiency and it is also uncontrolled by ADS and, therefore, it is relevant for the system.On the other hand, P(I) is considered the probability of injury.It is assumed that both probabilities are independent for the sake of simplicity; however, dependent probabilities will be considered in further research to achieve more accurate quantification results.The probability of P(I) is determined using the methodology explained in [47,56].The authors employ the model developed by Kusano and Gabler [57] to calculate the likelihood of injury.This model assesses the probability of injury as being greater when the injury level is equal to or greater than level 2 according to the Maximum Abbreviated Injury Scale (MAIS) [58].MAIS level 2 is defined by moderate injuries with a low probability of death (1-2%).This value is considered in our approach as severity greater than zero (S > 0) and, therefore, takes the risk into consideration.Otherwise, the probability of severity is zero as well as the risk.In this case, although the performance insufficiency still affects the system, SOTIF modifications may be applied in the ADS to enhance system reliability.The model [59] utilised to calculate the probability of injury is defined as follows: A Plausibility Factor (PF) was also added to the risk calculation to adjust the different levels of performance insufficiency injection.This factor decreases as the injection level becomes more extreme.For instance, if the system experiences reduced visibility, it is more plausible that visibility is not drastically reduced than that sensor visibility is almost lost.The plausibility coefficient has the same value as the probability of occurrence when a triggering condition is injected.However, it is not the same when a performance insufficiency is injected.Currently, the value of this factor is determined through expert judgement; however, further research will be conducted to calculate this factor accurately in the future.Finally, the equation that defines the risk for a performance insufficiency is given in Equation ( 7) for all levels of injection (S).
Consequently, the risk evaluation of an ADS is obtained by summing all calculated risks for the considered performance insufficiencies (N), as shown in Equation 8.
The risk calculated by this methodology provides a quantitative metric that must be evaluated by the stakeholder to determine if it is within the acceptance criteria for its SOTIF evaluation.The As Low As Reasonably Practicable (ALARP) [60] principle or similar should then be applied to reduce the risk to the lowest possible level.This also aligns with one of the main goals of SOTIF, according to which each validation iteration improves the system's reliability and safety.

Use Case
This section presents two use cases to illustrate the proposed approach outlined in previous sections.In the first use case, the ADS is subjected to a limited visibility performance insufficiency by including a generic performance insufficiency with varying levels of intensity.In the second use case, the lack of accuracy is included in the system as a performance insufficiency, injecting different reflection levels.The environment simulator used was the open-source CARLA Simulator [61] in which all elements of the map were removed, leaving only the road and the vehicles (ego and target vehicles).The object detection of the ADS is based on the cluster detection from the Autoware [62] software stack.The ADS uses a lidar sensor located at the top of the vehicle as a unique perception sensor.The vehicle controller of the ADS was developed in ROS [63]. Figure 6 depicts the architecture for these use cases.It is assumed that the performance insufficiency injection does not introduce a significant delay into the system and that it does not affect the results.The scenario used is a deceleration scenario, where the ego vehicle (green) reduces speed or stops to avoid a collision with the target vehicle (red) located in front.The waypoint is straight without turning in any direction.The initial speed of the vehicles is zero with the maximum speed reached by the ego vehicle being 80 km/h.The deceleration scenario is shown in Figure 7.The top left picture shows the CARLA simulation; the top right shows the visualisation from the vehicles.A logical view is given below these pictures.The nominal performance of the ADS is shown in Figure 8, where the distance travelled is plotted on the x-axis over time.Black lines show the performance of the ADS of each simulation with the mean value of the simulations also displayed with an overlapping cyan line.The execution time tolerable window is shown with vertical blue lines.Similarly, green lines are used to limit the tolerance window of the ADS distance travelled.The tolerable window is calculated based on the mean value from the nominal performance simulations in which the standard deviation is multiplied by a factor set to the upper and lower limits.In this use case, one hundred simulations (B = 100) were performed to obtain the probabilistic values.Whether execution time or distance travelled is outside the defined tolerance window, this is considered hazardous behaviour.

Performance Insufficiencies Injection
The initial aim is to translate the generic performance insufficiency into a technological performance insufficiency.Since the perception sensor in this ADS is a lidar, the reduction of the field of view has been modelled as the cropping of the point cloud generated from the lidar sensor.In order to increase the performance of the simulations and reduce the test cases, only the Responsibility-Sensitive Safety (RSS) [52] area (A RSS ) is considered in the tests.This decision was made because, based on the model, only this area is relevant for the safety of the vehicle.In our approach, the velocity of the target vehicle was set to zero in the original RSS equation to be more conservative.
The meaning and the given value for each parameter of the formula are outlined below: • D RSS : Minimum distance to ensure that there is no crash with the obstacle.The minimum distance is only used for the longitude value; for the latitude value, the standard width for a highway is applied instead, 3.75 m [64].The RSS distance, calculated using the given values, is 81.09 m (D RSS ).Vehicle length is assumed to be five metres.The area (A RSS ) considered for analysis is depicted in Figure 9 and can be described as follows: The reduced visibility levels for the limited visibility performance insufficiency injected into the system are 80, 60, 45, 30, 20 and 15 m. Figure 10 shows the results of these injections.In this use case, level 0 (80 m), level 1 (60 m) and level 2 (45 m) do not have any impact on the output of the ADS.From level 3 (30 m) on, the injections do have an impact on the system, resulting in collisions.At these levels, the difference between nominal performance (cyan line) and the output with the injection is clear, with collisions shown as vertical red lines.Collisions occur earlier in each injection, since detection of the target is delayed due to visibility insufficiency, leading to delayed braking and, finally, a collision.Table 3 shows the probabilities of hazardous behaviour and collision of each injection level.In this case, all hazardous behaviours lead to a collision, although this behaviour does not always occur.As shown in the charts, the impact of the performance insufficiency starts to be relevant at level 3.At this level, two thirds of all simulations are outside of the tolerance windows and lead to a collision.Then, the injections have a full impact on the outcome of the ADS.These results follow the cause and effect model shown in Figure 1, where the visibility reduction injected at the technical level leads to an output insufficiency at the functional level.There is a late detection from the sensor block that generates a lag on the actuation block, followed by a late braking at the vehicle level, which ends in hazardous behaviour.As previously described in Section 3, the risk depends on the Plausibility Factor (PF), the probability of performance insufficiency (P(PI)), and the probability of injury (P(I)).The Plausibility Factor was calculated based on the values from an exponential distribution with a given lambda value (λ = 1) and a random variable (X) set by expertise judgement for each level.The plausibility values for this use case for all levels are shown in Table 4. Based on the risk calculation given in Section 3, the following equation shows the evaluated risk for the lowest level of the reduced visibility performance level.The Plausibility Factor for this injection is the maximum, PF vis 80 = 1.0000, because it could occur with a high probability in many scenarios: if this happens, its impact must not be minimised.The probability of performance insufficiency is zero, which is expected since the reduced visibility is close to the nominal field of view of the sensor.The probability of injury is also zero, since there are no collisions.Consequently, the risk for this performance insufficiency level is zero, as shown in Equation (11).
Risk vis 80 = PF vis 80 * P(PI vis 180 ) * P(I vis 80 ) = 1.00 * 0.00 * 0.00 = 0.00 (11) Unlike the risk evaluation previously calculated, the risk for level 3 of injection (Risk vis 30 ) is not zero.In this case, the given Plausibility Factor is more restrictive, as this kind of performance insufficiency level does not occur as regularly.The probability of this injection level is not zero, and it has an impact on two thirds of the simulations.The probability of injury is not zero, since there are collisions that could cause moderate injuries to drivers.
Risk vis 30 = 0.04979 * 0.66 * 1.22966 × 10 −2 (12) As expected, the values of the probability of injury increase when stricter levels are injected because there is less time for the vehicle to brake, and thus the crash velocity is higher for each level.On the other hand, the overall risk for each level is not always higher than the previous level because of the given Plausibility Factor.Finally, the quantitative risk evaluation for this reduced visibility performance insufficiency based on the results from Table 5   This quantitative risk evaluation provides a reference point for the minimisation of risk in subsequent iterations of the SOTIF validation.It is noted that these results could be used to validate the function for specific triggering conditions defined in the standards.For instance, the SAE standard [20] classifies fog into six levels based on system visibility.

Accuracy Reduction
A reduction in the accuracy of the perception component is included in the system based on the classification from Table 1 in this use case.The primary objective of this use case is to demonstrate how reflections at different density levels can affect the system's object detection and resulting behaviour.

Performance Insufficiencies Injection
Table 2 shows how this performance insufficiency is modelled for lidar-specific technology in which random points are injected into the point cloud from the message.Similar to the first use case, only the area (A RSS ) from the RSS is considered because it is the relevant safety area for the test case.Different levels of point density are injected in the system, where density is calculated based on the number of points in the A RSS based on sensor resolution and the number of injected points for this area (injection_density = number_injected_points/number_A RSS _points).
Figure 11 shows two levels of injection from the vehicle perspective.The left picture shows level 2 of injection, where we observe the false negatives produced by the random reflections in the point cloud message.The right picture shows a higher level of injection, where the amount of false detections makes the ego vehicle stop completely.The impact of each injection level is shown in Figure 12.Levels 0 and 1 do not have any relevant impact on the outcome of the function, but from level 3, the impact of the outcome is remarkable.Level 2 makes the ADS still follow the path but with successive stops due to the false negatives, while at levels 3 and 4, the ego vehicle remains stopped.Noteworthy is level 5, where the huge amount of reflections causes the target vehicle not to be considered as an object, eventually leading to a collision.Unlike the previous use case, in this case, hazardous behaviour does not always generate a collision, as Table 6 depicts.Table 6 displays the results of the injections.It is noteworthy that although this injection has an impact on the system at many levels, the included reflections do not affect the system at levels 0 to 2. However, levels 3 and 4 include false positives due to the injected reflections, which activate the braking system and prevent the vehicle from following the defined path.On the other hand, all hazardous behaviours lead to a collision at the most restrictive level of injection.

Quantitative Risk Evaluation
After calculating the probability of performance insufficiency, the risk can be evaluated.As with the previous use case, the Plausibility Factor was determined using an exponential distribution and values based on expert judgement.Table 7 displays the risk values for each level with only the final level posing a risk in the ADS due to the occurrence of collisions.It is important to note that there are many injections where the performance insufficiency has an impact on the behaviour of the system, but there are no collisions.Despite this, SOTIF measures should be implemented to minimise the probability of hazardous behaviours.For example, including a wide range of perception sensor technologies could be beneficial as each sensor technology has its advantages and disadvantages in specific environmental situations, which could help mitigate the impact of certain scenarios.The equation below shows the calculated risk for the reduction of accuracy, where only the last injection has an impact on the final risk quantification for this performance insufficiency; even though most of the performance insufficiency injections have an impact on the system output, only the last injection has an impact on the risk quantification.The aim is to indicate that both risk and SOTIF modifications are relevant in the validation process.

Conclusions and Future Work
This document describes a methodology for validating perception performance insufficiencies in automotive driving systems.Due to the impossibility of validating all possible triggering conditions of a scenario, the evaluation focuses on the impact of performance insufficiencies in the perception component of the system and its effect on the output of the entire system in order to determine whether it may lead to hazardous behaviour and, finally, possibly cause harm.In the document, a classification of the performance insufficiencies is given, showing the impact of the defined insufficiency in the system.Then, a model for each perception insufficiency is used to inject the insufficiency into the system and determine whether the specified performance insufficiency does have an impact on the ADS output.Based on the results from the injections, a Plausibility Factor and a probability of injury are calculated; when the injection leads to a collision, a quantitative risk could be calculated.Since the risk is based on the severity of injuries, if there are no collisions or the injuries caused by a collision are light, the risk is set to zero.In these situations, SOTIF measurement should be considered to minimise the probability of performance insufficiency.The calculated risk provides us with a quantitative metric that serves as a reference for improvement in further validation iterations.Finally, this text describes two use cases that show how the proposed methodology can be applied.The first use case validates a limited visibility performance insufficiency in which most of the hazardous behaviours lead to risk due to collisions.The second use case validates a reduction in accuracy, where most levels of injection do not result in a collision, indicating no risk, but they still have an impact on the system.Therefore, SOTIF measures should still be applied to improve the ADS against this type of performance insufficiency.
This research has raised several questions that require further investigation.Although the risk evaluation provides a quantitative metric, it is not related to any specific measure, such as the number of hours driven or kilometres travelled.Future research should address this issue to better link the obtained metric with real-world measurements.Another important question that needs to be addressed is how to determine when a performance insufficiency has been fully validated, including accurate models for each performance insufficiency.Additionally, a more effective approach to identifying edge cases in performance insufficiency injection testing should be implemented in the future.

Figure 1 .
Figure 1.Cause and effect model between potential functional insufficiencies and triggering conditions.

Figure 2 .
Figure 2. Evolution of the scenario categories through the SOTIF validation iterations.

Figure 4 .
Figure 4. Architecture of the injection approach.

Figure 5 .
Figure 5. Risk quantification from our approach and compared with ISO21448.

Figure 9 .
Figure 9. Considered RSS area in the test cases.

Figure 10 .
Figure 10.Results of the visibility performance insufficiency injections.

Figure 12 .
Figure 12. Results of the accuracy performance insufficiency use case.

Table 2 .
Lidar technology performance insufficiency excerpt list.

Table 3 .
Simulation results for each visibility performance insufficiency level.

Table 4 .
Plausibility factor for the visibility performance insufficiency.

Table 5 .
Risk evaluation for each visibility performance insufficiency level.

Table 6 .
Simulation results for each performance insufficiency level.

Table 7 .
Risk evaluation for each accuracy performance insufficiency level.