Testing and verification of neural-network-based safety-critical control software: A systematic literature review

Context: Neural Network (NN) algorithms have been successfully adopted in a number of Safety-Critical CyberPhysical Systems (SCCPSs). Testing and Verification (T&V) of NN-based control software in safety-critical domains are gaining interest and attention from both software engineering and safety engineering researchers and


Introduction
Cyber-Physical Systems (CPSs) are systems involving networks of embedded systems and strong human-machine interactions [1] . Safetycritical CPSs (SCCPSs) are a type of CPSs that highlights the severe nonfunctional constraints (e.g., safety and dependability). The failure of SC-CPSs could result in loss of life or significant damage (e.g., property and environmental damage). Typical applications of SCCPSs are in nuclear systems, aircraft flight control systems, automotive systems, smart grids, and healthcare systems.
In the last few years, advances in Neural Networks (NNs) have boosted the development and deployment of SCCPSs. The NN is con-will behave correctly and consistently when system failures or malicious attacks occur?
Increasing interest in the migration of Industrial Control Systems (ICSs) towards SCCPSs has encouraged research in the area of safety analysis of SCCPSs. Kriaa et al. [10] surveyed existing approaches for an integrated safety and security analysis of ICSs. The approaches cover both the design stage and the operational stage of the system lifecycle. Some approaches (such as [11,12] ) are aimed at combining safety and security techniques into a single methodology. Others (such as [13,14] ) are trying to align safety and security techniques. These approaches are either generic, which consider both safety and security at a very high level, or model-based, which build upon the formal or semi-formal representation of the system's functions.
There are many studies that focus on the T&V of NNs in the past decade. Several review articles [15][16][17][18] on this topic have been published. Studies [15,19] have reviewed methods focusing on verification and validation of NNs for aerospace systems. Studies [17,18] are limited in automotive applications. None of these review articles have applied the Systematic Literature Review (SLR) [20] approach.
Recently there has been more concern about Artificial Intelligence (AI) safety. The state-of-the-art advancements in the T&V of NN-based SCCS are increasingly important; hence, there is a need to have a thorough understanding of present studies to incentivize further discussion. This study aimed to summarize the current research on T&V methods for NN-based control software in SCCPSs . We have systematically identified and reviewed 83 papers focusing on the T&V of NN-based SCCSs and synthesized the data extracted from those papers to answer three research questions.
• RQ1 What are the profiles of the studies focusing on testing and verifying NN-based SCCSs? • RQ2 What approaches and associated tools have been proposed to test and verify NN-based SCCSs? • RQ3 What are the limitations of current studies with respect to testing and verifying NN-based SCCSs?
To our best knowledge, our study is the first SLR on testing and verifying NN-based control software in SCCPSs. The results of these research questions can help researchers identify the research gaps in this area, and help industrial practitioners choose proper verification and certification methods.
The main contributions of this work are: • We made a classification of T&V approaches in both academia and industry for NN-based SCCSs. • We identified and proposed challenges for advancing state-of-the-art T&V for NN-based SCCSs.
The remainder of this paper is organized as follows: In Section 2 , we define terminologies related to NN-based SCCPSs and summarize related work from academia and industry. Section 3 describes the SLR process and the review protocol. The results of the research questions are reported in Section 4 . Section 5 discusses the industry practice of T&V of NN-based SCCSs, and the threats to validity of our study. Section 6 concludes the study.

Background
In this section, we first introduce terminology related to CPSs and modern NNs and show how NN algorithms have been used in SCCPSs. Then, we present the current state of practice of T&V of SCCSs.

Cyber-physical systems
As defined in Rajkumar et al. [1] , "cyber-physical systems (CPSs) are physical and engineered systems whose operations are monitored, coordinated, controlled and integrated by a computing and communication core. " Several other systems, such as Internet of Things (IoTs) and ICSs have very similar features compared to CPSs, since they are all systems used to monitor and control the physical world with embedded sensor and actuator networks. In general, CPSs are perceived as the new generation of embedded control systems, which can involve IoTs and ICSs [21,22] .
In this SLR, we adopted the CPS conceptual model in Griffor et al. [23] as a high-level abstraction of CPSs to describe the different perspectives of CPSs and the potential interactions of devices and systems in a system of systems (SoS) as shown in Fig. 1 . From the perspective of unit level, a CPS at least includes one or several controllers, many actuators, and sensors. A CPS can also be a system consisting of one or more cyber-physical devices. From the SoS perspective, a CPS is composed of multiple systems that include multiple devices. In general, a CPS must contain the decision flow (from controller to actuators), information flow (from sensors to controller), and action flow (actuators impacting the physical state of the physical world).
In the context of SCCPS, safety and performance are dependent on the system (to be more specific, the controller of the system) making the right decision according to the measurement of the sensors, and operating the actuators to take the right action at the right time. Thus, verification of the process of decision-making is vital for a SCCPS.

Modern neural networks
The concept of "neural network " was first proposed in 1943 by Warren McCullough and Walter Pitts [24] , and Frank Rosenblatt in 1957 designed the first trainable neural network called "the Perceptron " [25] . A perceptron is a simple binary classification algorithm with only one layer and output decision of "0 " or "1. " By the 1980s, neural nets with more than one layer were proposed to solve more complex problems, i.e., multilayer perceptron (MLP). In this SLR, we regard multilayer NNs that emerged after the 1980s as modern NNs.
Artificial Neural Network (ANN) is the general name of computing systems designed to mimic how the human brain processes information [26] . An ANN is composed of a collection of interconnected computation nodes (namely "artificial neurons "), which are organized in layers. Depending on the directions of the signal flow, an ANN can have feed-forward or feedback architectures. Fig. 2 shows a simplified feedforward ANN architecture with multiple hidden layers. Each artificial neuron has weighted inputs, an activation function, and one output. The weights of the interconnections are adjusted based on the learning rules. There are three main models of learning rules, which are unsupervised learning, supervised learning, and reinforcement learning. The choice of learning rules corresponds to the particular learning task. The common activation functions contain sigmoid, hyperbolic tangent, radial bases function (RBF), and piece-wise linear transfer function, such as Rectified Linear Unit (ReLU) [27] . In a word, an ANN can be defined by three factors: the interconnection structure between different layers, activation function type, and procedure for updating the weights.
Multi-Layer Perceptron ( MLP [28] ) represents a class of feedforward ANN. An MLP consists of an input layer, one or several hidden layers, and an output layer. Each neuron of MLP in one layer is fully connected with every node in the following layer. An MLP employs a back-propagation technique (which belongs to supervised learning) for training.
Convolutional Neural Network ( CNN [29] ) is a special type of multi-layer NN with one or more convolutional layers. A convolutional layer includes "several feature maps with different weight vectors. A sequential implementation of a feature map would scan the input image with a single unit that has a local receptive field, and store the states of this unit at corresponding locations in the feature map. This operation is equivalent to a convolution, followed by an additive bias and squashing function, hence the name convolutional network " [29] . CNNs are superior for processing two-dimensional data (particular camera images) because of the convolution operations, which are capable of detecting features in images. CNNs are now widely applied to develop partially-autonomous and fully-autonomous vehicles.  Deep Neural Networks ( DNNs [30] ) represent an ANN with multiple hidden layers between the input and output layers. DNNs (e.g., a MLP with more than three layers or a CNN) differ from shallow NNs (e.g., a three-layer MLP) in the number of layers, the activation functions that can be employed, and the arrangement of the hidden layer. Compared to shallow NNs, DNNs can be trained more in-depth to find patterns with high performance even for complex nonlinear relationships.
An NN could be trained offline or online. An NN trained offline means it only learns during development. After training, the weights of the NN will be fixed and the NN will act deterministically. Therefore static verification methods could be possible. In contrast, online training will allow the NN to keep learning and evolving during operation, which requires run-time verification methods. In some applications, such as the Intelligent Flight Control System developed by NASA [15] , both offline and online training strategies are employed to meet the system requirements.
NNs are fundamentally different with algorithmic programs, but a formal development methodology can still be derived for an NN system. Development process of an NN system can include six phases [ [32] and proposed a W-model inte-grated data development with standard software development to highlight the importance of data-driven in DNN development. Falcini et al. [32] also summarized that the DNN's functional behavior depends on both its architecture and its learning outcome through training.

The trends of using NN algorithm in SCCPSs
From 1940s automated range finders (developed by Norbert Wiener for anti-aircraft guns) [164] to today's self-driving cars, AI, especially NN algorithms, is widely applied in both civilian (e.g., autonomous cars) and military domains (e.g., military drones). Boosted by the advances of AI, state-of-the-art CPSs can plan and execute more and more complex operations with less human interaction. Here we present the applications of NNs in the following four representative SCCPSs.

Autonomous cars
For automobile, the Society of Automotive Engineers (SAE) proposed six levels of autonomous driving [33] . A level 0 vehicle has no autonomous capabilities, and the human driver is responsible for all aspects of the driving task. For level 5 vehicle, the driving tasks are only managed by the autonomous driving system. When developing autonomous vehicles targeting a high level of autonomy, one industry trend is to use DNNs to implement vehicle control algorithms. The deep-learning-based approach enables vehicles to learn meaningful road features from raw input data automatically and then output driving actions. The so-called end-to-end learning approach can be applied to resolve complex real-world driving tasks. When using deep-learning-based approaches, the first step is to use a large number of training data sets (images or other sensor data) to train a DNN. Then a simulator is used to evaluate the performance of the trained network. After that, the DNNbased autonomous vehicle will be able to "execute recognition, prediction, and planning " driving tasks in diverse conditions [10] . Nowadays, CNNs are the most widely adopted deep-learning model for fully autonomous vehicles [5][6][7][8] . NVIDIA introduced an AI supercomputer for autonomy [34] . The development flow using NVIDIA DRIVE PX includes four stages: 1) data acquisition to train the DNN, 2) deployment of the output of a DNN in a car, 3) autonomous application development, and 4) testing in-vehicle or with simulation.
One essential characteristic of deep-learning-based autonomy is that the decision-making part of the vehicle is almost a black box. This means that in most cases, we as human drivers must trust the decisions made by the deep-learning algorithms without knowing exactly why and how the decisions are made.

Industrial control systems
Industrial Control System (ICS) is the general term for control systems, also called Supervisory Control and Data Acquisition (SCADA) systems. ICSs make decisions based on the specific control law (such as lookup table and non-linear mathematical model) formulated by human designers. In contrast to the classical design procedure of control law, reinforcement-learning-based approaches learn the control law simply from the interaction between the controller and the process, and then incrementally improving control behavior. Such approaches and NNs have been used in process control two decades ago [35] . Concerning the recent progress in AI and the success of DNNs in making complex decisions, there are high expectations for the application of DNNs in ICSs. For instance, DNNs and reinforcement learning can be combined to develop continuous control [36] . Spielberg et al. extended the work in Lillicrap et al. [36] to design control policy for process control [37] . Even though the proposed approach in Spielberg et al. [37] is only tested on linear systems, it shows a practical solution for applying DNNs in non-linear ICSs.

Smart grid systems
The smart grid is designed as the next generation of electric power system, dependent on information communications technology (ICT). There is tremendous initiative of research activities in automated smart grid applications, such as FLISR (which is a smart grid multi-agent automation architecture based on decentralized intelligent decisionmaking nodes) [38] . NNs have been considered for solving many pattern recognition and optimization problems, such as fault diagnosis [39] , and control and estimation of flux, speed [2] , and economical electricity distribution to consumers. MLP is one of the most commonly used topology in power electronics and motor drives [2] .

Healthcare
Medical devices is another emerging area where research and industry practitioners are seeking to integrate AI technologies to improve accuracy and automation. ANNs and other machine learning approaches have been proposed to improve the control algorithms for diabetes treatment in recent decades [40,41] . In 2017, an AI-powered device for automated and continuous delivery of basal insulin (named MiniMed 670G system [42] ) was approved by the U.S. Food and Drug Administration. In the same year, it was reported that GE Healthcare had integrated the NVIDIA AI platform into their computerized tomography scanner to improve speed and accuracy for the detection of liver and kidney lesions [43] . Using deep learning solutions, such as CNNs, in the medical computing field has proven to be effective since CNNs have excellent performance in object recognition and localization in medical images [44] .

Testing and verification of safety-critical control software
IEC 61508 and ISO 26262 are two standards highly relevant to the T&V of SCCS. IEC 61508 is an international standard concerning Functional safety of electrical/electronic/programmable electronic safety-related systems . It defines four safety integrity levels (SILs) for safety-critical systems [45] . The higher the SIL level a SCCPS requires, the more time and effort for verification are needed. In IEC 61508, formal methods are highly recommended techniques for verifying high SIL systems. Because formal methods can be used to construct the specification and provide a mathematical proof that the system matches some formal requirements, this is quite a strong commitment for the correctness of a system.
ISO 26262, titled Road vehicles -functional safety , is an international standard for the functional safety of electrical and/or electronic systems in production automobiles [46] . Besides using classical safety analysis methods such as Fault Tree Analysis (FTA) and Failure Mode and Effects Analysis (FMEA), ISO 26262 explicitly states that the production of a safety case is mandated to assure system safety. It defines a safety case as "an argument that the safety requirements for an item are complete and satisfied by evidence compiled from work products of the safety activities during development " [46] .
The development of suitable approaches, which can verify the system behavior and misbehavior of a SCCPS is always challenging. Not to mention that the architecture of NNs (especially DNNs) makes it even harder to decipher how the algorithmic decisions were made. The current version of IEC 61508 is not applicable for the verification of NN-based SCCSs because AI technologies are not recommended there. The latest version of ISO 26262 and its extension, ISO/PAS 21448, which is also known as safety of the intended functionality (SOTIF) [47] , will likely provide a way to handle the development of autonomous vehicles. However, SOTIF will only provide guidelines associated with SAE Level 0-2 autonomous vehicles [48] , which are not ready for the verification of NN-based autonomous vehicles.
In practice, in order to reduce test and validation costs, high-fidelity simulation is a commonly used approach in the automotive domain. The purpose of using a simulator is to predict the behavior of an autonomous car in a mimicked environment. NVIDIA and Apollo distributed their high-fidelity simulation platforms for testing autonomous vehicles. CARLA [49] and Udacity's Self-Driving Car Simulator [50] are two popular open-source simulators for autonomous driving research and testing.

Research method
We conducted our SLR by following the SLR guidelines in Kitchenham and Charters [20] as well as consulting other relevant guidelines in Petersen et al. [51] and Shahin et al. [52] , Nguyen et al. [53] . Our review protocol consisted of four parts: 1) search strategy, 2) inclusion and exclusion criteria, 3) selection process, and 4) data extraction and synthesis.

Search strategy
Based on guidelines provided in Kitchenham and Charters [20] , we use the Population, Intervention, Outcome, Context (PIOC) criteria to formulate search terms. In this SLR, • The population should be an application area (e.g., general CPS) or specific CPS (e.g., self-driving car). • The intervention is methodology, tools and technology that address system/component testing or verification. • The outcome is the improved safety or functional safety of CPSs.
• The context is the NN-based SCCPSs in which the T&V take place. Fig. 3 shows the search terms formulated based on the PIOC criteria. We first used these search terms to run a series of trial searches and verify the relevance of the resulting papers. We then revised the search string to form the final search terms. The final search terms were composed of synonyms and related terms.
We executed automated searches in six digital libraries, namely, Scopus, IEEE Xplore, Compendex EI, ACM Digital library, SpringerLink, and Web of Science (ISI). Table 1 presents our inclusion and exclusion criteria. We set three inclusion criteria to restrict the application domain, context, and outcome type. We excluded papers that were not peer-reviewed, such as keynotes, books, and dissertations, and papers not written in English. It should be clarified that, unlike most other SLR studies, we did not directly exclude short papers (less than six pages), work-in-progress papers, and pre-print papers. The reason is that this research area is far

Table 1
Inclusion and exclusion criteria.

I1
The paper must have a context in SCCPSs, either in general or in a specific application domain I2 The paper must be aimed at testing/verification approaches for NN-based SCCSs I3 The paper must be aimed at modern neural networks Exclusion criteria

E1
Papers not peer-reviewed E2 Not written in English E3 Full-text is not available E4 Not relevant to modern neural networks from mature, so many initial thoughts or in-progress papers are still valuable to review.

Selection process
We used the inclusion and exclusion criteria to filter the papers in the following steps. We covered papers from January 2011 to November 2018. Fig. 4 shows the whole search and filtering process.
Stage 1: Ran the search string on the six digital libraries and retrieved 1046 papers. After removing those duplicated papers, we had 950 papers.
Stage 2: Excluded studies by reading title and keywords. If it was not excluded simply by reading titles and keywords, the paper was kept for further investigation. At the end of this stage, we selected 254 papers.

Stage 3:
Further filtered the papers by reading abstracts and found 105 potential papers with high relevance to the research goal of our SLR.
Stage 4: Read the introduction and conclusion to decide on selection. We recorded the reasons for exclusion for each excluded paper. We excluded the papers that were irrelevant, or whose full texts were not available. Furthermore, we critically examined the quality of primary studies to exclude those that lacked sufficient information. We ended up with 27 papers.
Stage 5: Read full text of the selected studies from the fourth stage, applied snowballing by scanning the reference of the selected papers. The snowballing process can be implemented in two directions: backwards (which means scanning the references of a selected paper and find any other relevant papers published before the selected paper), and forwards (which means checking if any other relevant paper was published after the selected paper and cited the selected paper). In our SLR, we adapted mainly backward snowballing to include additional papers. To limit the scope of the snowballing, we covered only references published between 2011 and 2018. From snowballing, we found 56 new relevant papers.
Finally, we selected 83 papers as primary studies for detailed analysis. We listed all of the selected studies in Appendix A. The first author conducted the selection process with face-to-face discussions with the second author. The second author performed a cross-check of each step and read all the final selected papers to confirm the selection of the papers.

Data extraction and synthesis
Data Extraction: We extracted two kinds of information from the selected papers. To answer RQ1, we extracted information for statistical analysis, e.g., publication year and research type. To answer RQ2 and RQ3, we collected information to identify key features (such as research goal, technique and tools, major contribution and limitation) of T&V approaches.
Synthesis: We used descriptive statistics to analyze the data for answering RQ1. To answer RQ2 and RQ3, we analyzed the data using the qualitative analysis method by following the five steps of thematic analysis [54] : 1) extracting data, 2) coding data, 3) translating codes into themes, 4) creating a model of higher-order themes, and 5) assessing the trustworthiness of the synthesis.

RQ1. What are the profiles of the studies focusing on testing and verifying NN-based SCCSs?
Studies distributions: Fig. 5 shows the distribution of selected papers based on publication year and the types of work. There has been 68 papers (81.9%) published since 2016, indicating that researchers are paying more attention to the T&V of NN-based SCCSs. Conference was the most popular publication type with 48 papers (57.8%), followed by pre-print (25 papers, 30.1%), workshop (6 papers, 7.2%), and journal (4 papers, 4.8%).
We also investigated the geographic distribution of the reviewed studies. It allowed us to identify which countries are leading the research in this domain. We considered a study to be conducted in one country if the affiliation of at least one author is in that country. Moreover, the involvement of industry would be an indicator of industry's interest in this domain. We classified the reviewed papers as industry if at least one author came from industry or the study used real-world industrial systems to test/verify the proposed approach. A paper would be categorized as academia if all authors came from academia. It shows that researchers based in the USA have been involved in the most primary studies for testing or verification of NN-based SCCSs with 56 publications, followed by the researchers based in Germany and the UK with 10 and 9 publications, respectively. It is worth noting that 47 of 83 (56.6%) publications have involvement from industry.
Research types: We classified the selected papers based on the criteria proposed by Kai et al. [51] (See Table 2 ). According to Table 2 , the research type of the paper is governed by rules (i.e., R1-R6). Each rule is a combination of several conditions. The six research types (i.e., evaluation research, solution proposal, validation research, philosophical papers, opinion papers, and experience papers) correspond to R1-R6, respectively. For example, both evaluation research (corresponding to R1) and validation research (corresponding to R4) must present empirical evaluation. The difference between evaluation and validation research is that validation is not used in practice (e.g., experimental or simulationbased approaches), whereas evaluation studies should be conducted in a real-world context. Solution proposal means that it has to propose a new solution that may or may not be used in practice. We found that evaluation and validation research are the majority of the selected papers, corresponding to 31.3% (26 papers) and 61.4% (51 papers) of the selected papers, respectively. The low percentage of the solution proposal (6 papers) was not surprising because a majority of the reviewed papers presented and demonstrated their T&V approaches through academic and industrial case studies, simulation, and controlled experiments. The other three types of research papers (i.e., philosophical papers, opinion papers, and experience papers) do not exist in selected studies because we only included papers that aimed at testing/verification approaches (refer to inclusion criteria I2).

Application domains:
We analyzed the application domain of selected studies to provide useful information for researchers and practitioners who are interested in the domain-specific aspects of the approaches. The results are shown in Table 3 . We found that considerable effort is now being put into using NN algorithms to accomplish control logic for general purpose (59 papers, 71.1%), automotive CPSs, such as autonomous vehicles (13 papers, 15.7), and autonomous aerial systems, such as airborne collision avoidance systems for unmanned aircrafts (5 papers, 6%).

RQ2. What approaches and associated tools have been proposed to test and verify NN-based SCCSs?
As 4 of the 83 papers focused mainly on high-level ideas and concepts without presenting detailed approaches or tools, we did not include them to answer RQ2. For the remaining 79 out of 83 (95.2%) papers, we applied the thematic analysis approach [54] and identified five high-order themes and some sub-themes. Some papers contain more than one themes. In order to balance the accuracy and the simplicity of categorization, we decided to assign each study only one category based on its major contribution. Table 4 presents the themes, sub-themes, and the corresponding papers. Fig. 6 compares the interests difference of academia and industry for the five identified themes.

CA1: Assuring robustness of NNs
One high-order theme of the studies is to assure the robustness of NNs. Robustness of an NN is its ability to cope with erroneous inputs. The erroneous inputs can be an adversarial example (i.e., an input that adds small perturbation intentionally to mislead classification of an NN), or benign but wrong input data. Methods under this theme can be further classified into four sub-themes.
Studies focusing on understanding the characteristics and impacts of adversarial examples Some studies tried to identify the characteristics and impacts of adversarial examples. The study [56] found the characteristics, such as the linear nature, of adversarial examples. The study  [58] measured the impact of adversarial examples by counting their frequencies and severities. Nguyen et al. [55] found that a CNN trained on ImageNet [134] is vulnerable to adversarial examples generated through Evolutionary Algorithms (EAs) or gradient ascent. A few other studies, such as [57,[59][60][61] , tried to understand the characteristics of robust NNs. Cisse et al. [59] introduced a particular form of DNN, namely Parseval Networks, that is intrinsically robust to adversarial noise. Gu et al. [61] concluded that some training strategies, for example, training using adversarial examples or imposing contractive penalty layer by layer, are robust to certain structures of adversarial examples (e.g., inputs corrupted by Gaussian additive noises or blurring). Higher-confidence adversarial examples (i.e., adversarial instances that are extremely easy to classify into the wrong category) were used to evaluate the robustness of the state-of-the-art NN in Carlini and Wagner [60] and the robot-vision system in Melis et al. [57] .
Studies focusing on methods to detect adversarial examples Detecting adversarial examples that are already inserted into training or testing data set are the primary targets of [62,[64][65][66][67] . Wicker et al. [62,66] formulated the adversarial examples detection as a two-player stochastic game and used the Monte Carlo Tree Search to identify adversarial examples. Reuben [64] applied density estimates, and Bayesian uncertainty estimates to detect adversarial samples. Xu et al. [65] proposed a feature squeezing framework to detect adversarial examples, which are generated by seven state-of-the-art methods. According to [65] , an advantage of feature squeezing is that it did not change the underlying model. Therefore, it can easily be integrated with other defenses methods. Metzen et al. [67] embedded DNNs with a subnetwork (called "detector ") to detect adversarial perturbations. The Deepsafe presented in Gopinath et al. [ [60] . Papemot et al. [69] revisited defensive distillation and proposed a more effective way to defend against three recently discovered attack strategies, i.e., the Fast Gradient Method (FGM) [56] , the Jacobian Saliency Map Approach (JSMA) [135] , and the AdaDelta optimization strategy (AdaDelta) [60] .
Studies focusing on increasing robustness of NNs through using adversarial examples. In studies [70,71] , the authors proposed methods to leverage adversarial training (e.g., generating a large amount of adversarial examples and then training the NN not to be fooled by these adversarial examples) to increase the robustness of NNs.

CA2: Improving failure resilience of NNs
Studies under this theme focused on improving the resilience of NNs, so that the NN-based CPSs are more tolerant of possible hardware and software failures.
Studies [74,76,77] investigated error detection and mitigation mechanisms, while studies [75,79] focused on understanding error propagation in DNN accelerators. Vialatte et al. [74] demonstrated that faulty computations can be addressed by increasing the size of NNs. Santos et al. [76] proposed an algorithm-based fault tolerance (ABFT) strategy to detect and correct radiation-induced errors. In [77] , a binary classification algorithm based on temporal and stereo inconsistencies was applied to identify errors caused by single frame object detectors. Li et al. [75] developed a general-purpose GPU (GPGPU) fault injection tool [136] to investigate error propagation patterns in twelve GPGPU applications. Later, Li et al. revealed that the error resilience of DNN accelerators depends on "the data types, values, data reuse, and the types of layers in the design [80] ". Based on this finding, they devised guidelines for designing resilient DNN systems and proposed two DNN protection techniques, namely Symptom-based Error Detectors (SED) and Selective Latch Hardening (SLH) to mitigate soft errors that are typically caused by high-energy particles in hardware systems [137] .
Mhamdi et al. explored error propagation mechanism in an NN [78] , and they theoretically and empirically proved that the key parameters that can be used to estimate the robustness of an NN are: "Lipschitz coefficient of the activation function, distribution of large synaptic weights, and depth of the network ". The study [80] characterized the faults propagation through an open-source autonomous vehicle control software (i.e., openpilot) to assess the failure resilience of the system. The Systems-Theoretic Process Analysis (STPA) [138] hazard analysis technique was used to guide fault injection. Existing work in Rubaiyat et al. [80] showed that STPA is suited for an in-depth identification of unsafe scenarios, and thus, the fault injection space was reduced.
Based on the diversified redundancy strategies, the study [81] developed diverse networks in the context of different training data sets, different network parameters, and different classification mechanisms to strengthen the fault tolerance of the DNN architecture.
Studies [72,73] tried to improve computation efficiency without compromising error resilience. Studies [72,73] also predicted the error resilience of DNN accelerators to make reconfigurable NN accelerators.
The study [72] demonstrated a more accurate neuron resilience assignment than the state-of-the-art techniques and provided the possibility of moving parts of the neuron computations to unreliable hardware at the given quality constraint. Zhang et al. [73] proposed a framework to increase efficiency of computation by approximating the computation of certain less critical neurons. Daftry et al. [82] provided an interesting idea about "how to enable a robot to know when it does not know? " The idea of [82] is to utilize the resulting features of the controller, which are learned from a CNN to predict the failure of the controller, and then let the system self-evaluate and decide whether to execute or discard an action.

CA3: Measuring and ensuring test completeness
The approaches and tools under this theme aim to ensure good coverage when testing NNs. The testing approaches include black-box testing (i.e., focusing on whether the tests cover all possible usage scenarios), white-box testing (i.e., focusing on whether the tests cover every neuron in the NN), and metamorphic testing, which focuses on both test case generation and result verification [139] .
O'Kelly et al. [83] proposed methods to ensure good usage coverage through first making a formal Scenario Description Language (SDL) to create driving scenarios, and then translating the scenarios to a specification-guided automatic test generation tool named S-TALIRO to generate and run the tests. Raj et al. [86] proved the possibility of speeding up the generation of new and interesting counterexamples by introducing fuzzing patterns obtained from an unrelated DNN on a different image database, although the proposed method provides no guarantee of test completeness.
DeepXplore [84] first introduced neuron coverage as a testing metric for DNNs, and then used multiple different DNNs with similar functionality to identify erroneous corner cases. Compared to [84] , DeepTest [85] and DLFuzz [89] aimed at maximizing the neuron coverage without requiring multiple DNNs. The study [85] employed metamorphic relations to identify erroneous behaviors. The study [89] proposed a differential fuzzing testing framework to generate adversarial inputs. However, methods proposed in Pei et al. [84] , Tian et al. [85] , Guo et al. [89] cannot guarantee the generation of test cases that can precisely reflect real-world cases (e.g., driving scenes in various weather conditions when taking a DNN-based autonomous driving system). DeepRoad [88] employed Generative Adversarial Network (GAN) based techniques and metamorphic testing to synthesize diverse real driving scenes, and to test inconsistent behaviors in DNN-based autonomous driving systems. In contrast to earlier works, DeepGauge [87] argued that the testing criteria for traditional software are no longer applicable for DNNs. Ma et al. [87] proposed neuron-level and layer-level coverage criteria for testing DNNs and for measuring the testing quality.

CA4: Assuring safety property of NN-based SCCPSs
Formal verification can provide a mathematical proof that a system satisfies some desired safety properties (e.g., the system should always stay within some allowed region, namely a safe region). Formal verification usually presents NNs as models and then apply a model checker, such as Boolean satisfiability (SAT) solvers (e.g., Chaff [140] , SATO [141] , GRASP [142] ) to verify the safety property. Pulina et al. [92] developed NeVer ( "Ne "ural networks "Ver "ifier), which solves Boolean combinations of linear arithmetic constraints, to verify safe regions of MLPs. Through adopting an abstraction-refinement mechanism, NeVer can verify real-world MLPs automatically. As an extended experiment analysis of results of [92] , Pulina and Tacchella [90] compared the performance (e.g., competition-style and scalability) of state-of-the-art Satisfiability Modulo Theories (SMT) solvers [143] , and demonstrated that scalability and fine-grained abstractions remain challenges for realistic size networks. The studies [91,97] verified the "feed-forward NNs with piece-wise linear activation functions " by encoding verification problems into solving a linear approximation exploring network behavior in a SMT solver.
The next generation of collision avoidance systems for unmanned aircrafts (ACAS Xu) adopted DNNs to compress large score table [5] . Julian et al. [95] explored the performance of ACAS Xu by measuring a set of safety and performance metrics. A simulation in study [95] shows that the system based on DNNs performed as correctly as the original large score table but with better performance. Reluplex [97] had successfully been used to verify the safety property of a DNN for the prototype of ACAS Xu. Although the outcomes of Reluplex [97] are limited to verifying the correctness of NNs with specific type of activation functions (i.e., ReLUs and max-pooling layers), the study sheds a light on which types of NN architectures are easier to verify, and thus paves the way for verifying real-world DNN-based controllers.
The method proposed in studies [99,100] verified that Binarized Neural Networks (BNNs) are efficient and scalable to moderate-sized BNNs. Study [99] represented BNNs as boolean formulas, and then verified the robustness of BNNs against adversarial perturbations. In study [100] , BNNs and their input-output specifications were transferred into equivalence hardware circuits. The equivalence hardware circuits consist of a BNN structure module and a BNN property module. The authors of [100] then applied a SAT solver to verify the properties (e.g., "simultaneously classify an image as a priority road sign and as a stop sign with high confidence ") of the BNN in order to identify the risk behavior of the BNN.
When verifying a SCCS, one of the fundamental concerns is to make sure that the SCCS will never violate a safety property. An example of a safety property is that the system should never reach an unsafe region. The main ideas of studies under this sub-theme are to calculate the output reachable set of MLPs, such as in studies [94,96] , or DNNs in study [93] , to verify if unsafe regions will be reached. Xiang et al. [96] proposed a layer-by-layer approach to compute the output reachable set assisted by polyhedron computation tools. The safety verification of a ReLU MLP is turned into checking if a non-empty intersection exists between the output reachable set and the unsafe regions. In a later work of Xiang et al. [94] , they introduced maximum sensitivity to perform a simulation-based reachable set estimation with few restrictions on the activation functions. By combining local search and linear programming problems, Dutta et al. [93] developed an output bound searching approach for DNNs with ReLU activation functions, which is implemented in a tool called SHERLOCK to check whether the unsafe region is reached. Study [98] focused on the safety verification of image classification decisions. In [98] , Huang et al. employed discretization to enable a finite exhaustive search for adversarial misclassifications. If no misclassifications are found in all layers after the exhaustive search, the NN is regarded as safe.
The idea of [101] was to formulate the formal verification of temporal logic properties of a CPS with Machine Learning (ML) components as the falsification problem (finding a counterexample that does not satisfy system specification). The study [101] adopted an ML analyzer to abstract the feature space of ML components (which approximately represents the ML classifiers). The identified misclassifying features are then used to drive the process of falsification. The introduction of the ML analyzer narrowed down the searching space for counterexamples and established a connection between the ML component and the rest of the system.
Another direction to make sure the system will not violate safety properties is to use run-time monitoring. The study [102] envisioned an approach named WISEML, which combines reinforcement learning and run-time monitoring technique, to detect invariants violations. The purpose of this work was to create a safety envelope around the NNbased SCCPSs.

CA5: Improving interpretability of NNs
NNs have proved to be effective ways to generalize the relationship between inputs and outputs. As the models of NNs are learned from training data sets without human intervention, the relationship between the inputs and outputs of NNs is like a black box. Due to the black-box nature of NNs, it is difficult for people to understand and explain how an NN works. Studies under this theme focus on facilitating the understanding on how NNs generate outputs from inputs. Studies in this theme can be classified into the following three sub-themes, which can be overlapped. However, this can be a way to capture the different motivations for the interpretability of NNs.
Studies focusing on understanding how a specific decision is made This line of work mainly focuses on providing explanations for individual predictions (also defined as local interpretability). One study is called Local Interpretable Model-agnostic Explanations (LIME) [129] . LIME can approximate the original NN model locally to provide an explanation for a specific prediction of interest. The problem of LIME is that it assumes the local linearity of the classification boundary, which is not true for most complex NNs. The creators of LIME later extended their work by introducing high-precision rules (i.e., if-then rules), which they called anchors [104] . The study [130] developed an explanation system named LEMNA for security applications and Recurrent Neural Networks (RNNs). LEMNA can locally approximate a non-linear classification boundary and handle feature dependency problems and therefore is able to provide a high fidelity explanation.
In the case of an image classifier, it is also common to use gradient measurements to estimate the importance value of each pixel for the final classification. DeepLIFT [115] , Integrated Gradients [105] , and more recently, SmoothGrad [120] fall into this category. The study [121] proposed a unified framework, SHapley Additive exPlanations (SHAP), by integrating six existing methods (LIME [122] , DeepLIFT [115] , Layer-Wise Relevance Propagation, Shapley regression values, Shapley sampling values, and Quantitative Input Influence) to measure feature importance.
Several approaches attempted to decompose the classification decision (output) into the contributions of individual components of an input based on specific local decomposition rules (i.e., Pixel-Wise decomposition [106,116] , and deep Taylor decomposition [108] ).
Szegedy et al. [103] investigated the semantic meaning of individual units and the stability of DNNs while small perturbations were added to the input. They pointed out that the individual neurons did not contain the semantic information, while the entire space of activations does. They also experimentally proved that the same small perturbation of input can cause different DNN models (e.g., trained with different hyperparameters) to generate wrong predictions.
There are several methods for improving local explanations for NN models compared to the above-mentioned approaches. The study [113] argued that explanation approaches for NN models should provide sound theoretical support. Ross et al. [118] presented their idea as "Right for the right reasons, " which means that the output of NN models should be right with the right explanation. In Ross et al. [118] , incorrect explanations for particular inputs can be identified, and NN models can be guided to learn alternate explanations. Both [113,117] made efforts on real-time explanations since their approaches can generate accurate explanations quickly enough.
Studies focusing on facilitating understanding of the internal logic of NNs. Studies in this sub-theme are also known as global interpretability. To help interpret how NN models work, model distillation is used in Frosst and Hinton [122] , Che et al. [123] , Hinton et al. [124] , Tan et al. [126] . The initial intention of distillation was to reduce the computational cost. For example, Hinton et al. [124] distilled a collection of DNN models into a single model to facilitate deployment. The knowledge distilled from NN models has later been applied for interpretability. Some studies compressed information (e.g., decision rules) from deep learning models into transparent models such as decision trees [122,131] and gradient boosting trees [123] to mimic the performance of models. Others tended to explain the inner mechanisms of NN models through analyzing the feature space. Study [126] distilled the relationship between input features and model predictions (outputs of the model) as a feature shape to evaluate the feature contribution to the model.
Another attempt to produce global interpretability is to reveal the features learned by each neuron. For example, in Nguyen et al. [127] , the authors leveraged deep generator networks to synthesized the input (i.e., image) that highly activates a neuron. Dong et al. [110] adopted an attentive encoder-decoder network to learn interpretable features, and then proposed an algorithm called prediction difference maximization to interpret the features learned by each neuron.
One interesting work [119] used an additional NN module that is fit for relational reasoning to reason the relations between the input and response of the NN models. There is also another promising line of work (e.g., [109,114] ) that combined local and global interpretability to explain NN models.

Studies focusing on visualizing internal layers of NNs to help identify errors in NNs
In study [128] , activities, such as the operation of the classifier and the function of intermesdiate feature layers within the CNN model, were visualized by using a multi-layered deconvolutional network (named DeconvNet). These visualizations are useful to interpret model problems. Unlike [128] , which visually depicted neurons in a convolutional layer, the study [107] visualized neurons in a fully connected layer. Zhou et al. [112] proposed Class Activation Mapping (CAM) for CNNs to visualize the discriminative object parts on any given image. Fong and Vedaldi [111] highlighted the most responsible part of an image for a decision by perturbing meaningful images. DarkSight [125] combined the ideas of model distillation and visualization to visualize the prediction of an NN model. Thiagarajan et al. [132] built a TreeView representation via feature-space partitioning to interpret the prediction of an NN. Mahendran et al. [133] reconstructed semantic information (images) in each layer of CNNs by using information from the image representation.

RQ3. What are the limitations of current research with respect to testing and verifying NN-based SCCSs?
Analyzing failure modes and how the system reacts to failures are crucial parts of the safety analysis, especially in safety-critical domains. When testing and verifying the safety of NN-based SCCPSs, we need to rethink how to perform failure mode and effect analysis, how to analyze inter-dependencies between sub-systems of SCCPSs, and how to analyze the resilience of the system. We need to ensure that even if some of the system's hardware or software do not behave as expected, the system can sense the risk, avoid the risk before the incident, and mitigate the risk effectively when an incident happens. Looking into T&V activities through software development, the ideal situation is that we would find appropriate T&V methods to verify whether the design and implementation are consistent with the requirements, construct complete test criteria and test oracle, and generate test data and test any objects (such as code modules, data structures) that are necessary for the correct development of software [144] . Unfortunately, the fact is that complete T&V is hard to guarantee. In order to investigate the gap between industry needs for T&V of NN-based SCCPS and state-of-the-art T&V methods, we performed a mapping of identified approaches to the relevant standard.

Mapping of reviewed approaches to the software safety lifecycles in IEC 61508
An increased interest in the application of NNs within safety-critical domains has encouraged research in the area of T&V of NN-based SCCSs. Research institutions and industry T&V practitioners are working on different aspects of this problem. However, we have not found strong connections between those potentially useful methods for T&V of NNs and relevant safety standards (such as IEC 61508 [45] and ISO 26262 [46] ). We hereby adopt IEC 61508 [45] as a reference standard to execute the mapping analysis since ISO 26262 [46] is the adaptation of IEC 61508 [45] . We found that the major T&V activities listed in the software safety lifecycles of IEC 61508-3 (including evaluation of software architecture design, software module testing and integration, programmable electronics integration, and software verification) are still valid when conducting T&V for NN-based SCCSs. But for most of them, new techniques/measures for supporting the T&V of NN-based software are demanded. Therefore, we decided to employ safety integrity properties (which are explained in IEC 61508-3 Annex C and Annex F of IEC 61508-7) as indicators to justify to what extent these desirable properties have been achieved by the state-of-the-art methods for T&V of NN-based SCCSs. The detailed mapping information can be found in Table 5 .
In Table 5 , we mapped existing T&V methods for NN-based SCCSs (column 3 and column 4) into relevant properties (column 2) of four major T&V phases (column 1) in the software safety lifecycles of IEC 61508-3. For column 5 in Table 5 , we summarized the remaining challenges in testing and verifying NN-based SCCSs based on reviewed papers. The overviews of these remaining challenges can potentially inspire researchers to look for a focus in the future.

Limitations and suggestions for testing and verifying NN-based SCCSs
In Table 5 , we show the limitations and gaps of state-of-the-art T&V approaches for NN-based SCCSs. In this section, we will take two T&V phases (evaluation of software architecture design and software module testing and integration) as examples to provide detailed analysis of identified limitations and corresponding suggestions on the basis of required safety integrity properties. For the other two T&V phases (programmable electronics integration and software verification), only sum-maries of limitations and suggestions will be presented to avoid duplication.

Evaluation of software architecture design.
The top three properties that have been addressed are: simplicity and understandability (31 papers), freedom from intrinsic design faults (10 papers), and fault tolerance (5 papers). Correctness with respect to software safety requirements specification (1 paper) and verifiable and testable design have drawn little attention (2 papers) for reviewed studies. There are two properties, i.e., completeness with respect to software safety requirements specification and Defense against common cause failure from external events , which have not been addressed in reviewed papers.
4.3.2.1.1. Completeness with respect to software safety requirements specification. No study contributes to the achievement of completeness, which requires the architecture design to be able to address all the safety needs and constraints. The achievement of completeness depends on the achievement of other properties, such as fully understanding the behavior of NN models. The design and deployment of NN-based SCCSs are in its infancy stage. When NN-based SCCS design becomes more practical, more studies may address this topic.

Correctness with respect to software safety requirements specification.
To achieve correctness, software architecture design needs to respond to the specified software safety requirements appropriately. Study [95] reported their successful design of a DNN-based compression algorithm for aircraft collision avoidance systems. Even though they demonstrated that the DNN-based algorithm preserves the required safety performance, the training process is still time-consuming.

Freedom from intrinsic design faults.
Intrinsic design faults can be interpreted as failures derived from the design itself. State-of-theart NNs have proved to be vulnerable to adversarial perturbations due to some intriguing properties of NNs [56] . Most of the studies in this category were aimed at understanding, detecting, and mitigating adversarial examples. Study [98] reported that their approach could generalize well on several state-of-the-art NNs to find adversarial examples successfully. However, the verification process of founded features is timeconsuming, especially for larger images. In this sense, the scalability and computational performance of adversarial robustness are expected to be addressed in the future. In addition, adversarial robustness does not imply that the NN model is truly free from intrinsic design faults. How to assure freedom from interferences (e.g., signal-noise ratio degradation) other than adversarial perturbations is a research gap that needs to be filled.

Understandability.
This property can be interpreted as the predictability of system behavior, even in erroneous and failure situations. In this category, studies focusing on providing explanations for individual prediction (e.g., [103] ) and on visualizing internal layers of NN (e.g., [128][129][130] ) are not meaningful for safety assurance. Studies focusing on facilitating understanding of the internal logic of NNs (such as presenting NNs as decision trees [122] ) could be a solution to improve the understandability of NN-based architecture design. However, this line of work is rare, and most methods are only applied to smallscale DNNs with image input, or specific NN models. Besides, assuming the explanation of NN is available, confirming the correctness of the explanation is still a challenge. Interpretability of NNs is undoubtedly a crucial need in safety-critical applications. Methods in this line should capable of explaining different types of sensor data (e.g., image, text, and point data) and both local and global decisions.

Verifiable and testable design.
The evaluation metrics of verifiable and testable design may be derived from modularity, simplicity, provability, and so on. We observed that existing verifiable and testable designs are limited to specific NN architectures (e.g., [91] ) or specific tasks (e.g., [83] ). There is no standard procedure for determining which type of NNs will be easier to verify. Ehlers [91] argued that NNs that adopt piece-wise linear activation functions are easier to verify, but their method still need to face the conflict between efficient verification and accuracy of linear approximation for the NN behavior.

Fault tolerance.
Fault tolerance implies that the architecture design can assure the safe behavior of the software whenever internal or external errors occur. To achieve fault tolerance, features like failure detection and failure impact mitigation of both internal and external errors should be included in the design. Existing methods showed that unexpected environmental failures are hard to detect and mitigate. Besides, many of the proposed approaches in this category have not yet been evaluated in the real-world. Some studies formulated approximated computational models to represent real-world systems (e.g., [73] ). The study [82] did not use any test oracle when executing system flight tests. Some studies used simulation models to verify the performance of the original NN (e.g., [74] ). They are not able to prove the fidelity of the model compared with the real-world system. 4.3.2.1.7. Defense against common cause failure from external events. Software common cause failure is a type of concurrent failure of two or more modules in the software, which is caused by software design defects and triggered by external events such as time, unexpected input data, or hardware abnormalities [145] . Many safety critical systems adopt redundant architectures (meaning two or more independent subsystems have identical functions to back-up each other) to prevent a single point of failure. However, redundant architectures are vulnerable considering common cause failure. In the context of NN-based SCCSs, it is common to employ multiple NNs with similar architectures in order to improve the accuracy of prediction. If a common cause failure occurs in this kind of software design, the prediction might be totally wrong, and thus the control software might make the wrong decision. DeepXplore, reported in Pei et al. [84] , used more than two different DNNs with the same functionality to automatically generate a test case. If all the DNNs in DeepXplore are affected by common cause failure, such as if a sensor failure causes all the DNNs to make the same misclassification, then it will not be able to generate the corresponding test case. No method is found in reviewed papers that can identify common cause failure modes and defend against such failures. In order to effectively defend against common cause failure, designers need to inspect the completeness and correctness of the safety requirements specification, trace the implementation of the safety requirements specification, and make a thorough T&V plan to reveal the common cause failure modes in the early stage.

Software module testing and integration.
The top two properties that have been addressed are: completeness of testing and integration with respect to the design specifications (9 papers) and correctness of testing and integration with respect to the design specifications (8 papers). Repeatability has drawn little attention (3 papers) from the reviewed studies. There is one property, precisely defined testing configuration , which has not been addressed in the reviewed papers. This property aims to evaluate the precision of T&V procedures, which is not in the scope of our selected papers. Therefore, we will not give more explanation on this property.

Completeness of testing and integration with respect to the design specifications.
We observed some efforts that tried to find a systematic way to generate testing cases (e.g., [85,88] ) to measure testing quality (e.g., [87] ) or to connect different T&V stages in the development of SCCSs (e.g., [146] ). As analyzed in Section 4.2 , we can infer that an NN-based control software is instinctually different in design workflow and software development compared to the design of traditional control software. We suggest that the testing criteria should thoroughly align with the software design. To be more specific, the instinctive features of NN-based softwares (e.g., NN model's architectural details and the working mechanism of NNs) should be carefully considered when setting the testing criteria. That is testing criteria should be defined comprehensively and explicitly under the consideration of not only test case coverage but also the robustness of NN-based system performance (for instance, test how an NN will respond when input data change slightly) and the features of training data sets, such as the data density issue mentioned in Ashmore and Hill [147] .

Correctness of testing and integration with respect to the design specifications.
Several studies (e.g., [55,62,63] ) reported that their methods are vulnerable to the variation of adversarial examples. Another common limitation is that most methods are model-specific, meaning that they can only apply to a single type or class of NN model. To achieve correctness of testing and integration, the module testing task should be completed, which means the testing should cover both NN models and external input. However, few studies focused on the validation of input data. One study [77] identified that sufficient validation of input raw data remains a challenge.

Repeatability.
The complexity and un-interpretable feature of NNs make manual testing almost infeasible. In order to be able to generate consistent results from testing repeatedly, some studies were dedicated to achieving automatic test execution or even automatic test generation. We found three papers (i.e., [83][84][85] ) addressing automatic test generation. However, generating test cases automatically is still a challenge. For instance, studies [84,85] claimed that the test cases generated by an automated testing tool may not cover all real-world cases.

Programmable electronics integration.
The major limitation of this line of work is insufficient testing for hardware accelerators. NNbased SCCPSs requires typically high-performance computing systems, such as Graphics Processing Units (GPUs). Some industry participants have provided specialized hardware accelerators to accelerate NN-based computations. For example, Google deployed a DNN accelerator (called Tensor Processing Unit) in its data centers for DNN applications [148] . NVIDIA introduced an automotive supercomputing platform named DRIVE PX 2 [34] , which now has been used by over 370 companies and research institutions in the automotive industry [149] . However, little research effort has been put into the T&V of the reliability of using hardware accelerators for NN applications. We found seven studies (i.e. [72][73][74][75][76][77]79] ) addressing the evaluation of the error resilience of hardware accelerators. However, the testing is limited to specific type errors (e.g., radiation-induced soft errors, which are presented in Schorn et al. [72] , Santos et al. [76] , Li et al. [79] ). The mitigation method proposed in Santos et al. [76] (called ABFT: Algorithm-Based Fault Tolerance) can only protect portions of the accelerator (e.g., sgemm kernels, which is one kind of matrix multiplication kernels). The study [77] identified errors made by single frame object detectors, but the result showed that the method is not capable of detecting all mistakes. The studies [72,79] investigated the propagation characteristic of soft errors in the DNN system, but they used a DNN simulator instead of a real DNN accelerator for fault injection.

Software verification.
In general, there is a lack of a comprehensive and standardized framework for verifying the safety of NN-based SCCSs. Formal verification procedures are highly demanding. The common limitation of formal verification approaches is the scalability issues. Most proposed methods are limited to a specific NN structure and size (e.g., [91,92,97,99,100] ). The study [92] reported that their approaches can only verify small-scale systems (i.e., the layer of NN is 3 and the maximum amount of input neurons is 64). One approach reported in Narodytska et al. [99] can verify medium size NNs. The verification of large-scale NNs is still a challenge. Another limitation is that proposed approaches are not robust to NN variations. For example, verification methods in studies [91,97] are only adapted to specific network types and sizes.

Discussion
In this section, we first discuss industry practices for T&V of NNbased SCCPSs. Then, we compare this SLR with related works. At the end of this section, we present the threats to the validity of our study.

Industry practice
Our findings on the research questions (RQ1 to RQ3) mainly reflected the academic efforts addressing T&V of NN-based SCCPSs. NNbased applications have drawn a lot of attention from industry practitioners. Taking the automotive industry as an example, several car makers (e.g., GM, BMW, and Tesla) and some high technology companies (e.g., Waymo and Baidu) are leading the revolution in autonomous driving safety.

Safety of the intended functionality
At the beginning of this year, ISO/PAS 21448:2019 [47] was published. It listed recommended methods for deriving verification and validation activities (See ISO/PAS 21448:2019 Table 4 ). In Table 6 , we highlighted six of the recommended methods, which shared similar verification interests with existing academic efforts.

Safety reports
In 2018, three companies (Waymo, General Motor, and Baidu Apollo) published their annual safety reports. As a pioneer in the development of self-driving cars, Waymo proposed the "Safety by Design " [150] approach, which entails the processes and techniques they used to face safety challenges of a level 4 autonomous car on the road. For the cybersecurity consideration, Waymo adopted Google's security framework [151] as the foundation. After that, General Motor (GM) released their safety report [152] for Cruise AV (also level 4). GM's safety process combined conventional system validation (such as vehicle performance tests, fault injection testing, intrusive testing, and simulation-based software validation) with SOTIF validation through iterative design. Baidu adopted the Responsibility-Sensitive Safety model [153] proposed by Mobileye [154] (an Intel company) to design the safety process for the Apollo Pilot for a passenger car (level 3).
In addition, we noticed that Tesla started releasing quarterly safety data since October 2018 [155] . It seemed that Tesla has a completely different approach to self-driving cars than other companies. According to TESLA NEWS [156] , AutoPilot will rely for its self-driving function on cameras, not on LIDAR; the AutoPilot software is trained online (which means that the NN keeps learning and evolving during operation). The Autopilot's safety features are continuously evolved and enhanced through understanding real-world driving data from every Tesla.
Referring to these safety reports of existing autonomous cars, we should be aware that when testing DNN-based control software (the core part of autonomous vehicles), black-box system level testing (by observing inputs and its corresponding outputs, e.g., closed course testing and real-world driving) is still the leading method. More systematic T&V criteria and approaches are needed for more complete and reliable testing results.

Verification and validation of NNs
Taylor et al. [15] conducted a survey on the Verification and Validation (V&V) of NNs used in safety-critical domains in 2003. Study [15] is the closest work we found, although they did not adopt an SLR approach. Our study covered new studies from 2011 to 2018. The authors of [15] also made a classification of methods for the V&V of NNs. They grouped the methods into five traditional V&V technique categories, namely, automated testing and testing data generation methods, run-time monitoring, formal methods, cross validation, and visualization. In contrast to [15] , our study adopted a thematic analysis approach [54] and identified five themes based on the research goals of the selected studies. We thought it was better to classify the proposed T&V methods of NNs based on their aims rather than on the traditional technique categories since many traditional V&V techniques are no longer effective for verifying NNs in many cases. New methods and tools should be explored and developed without being limited by the traditional V&V categories. Another difference is our study specialized more in the T&V of modern NNs, such as MLP and DNN, whereas the study [15] provided more in-depth analysis of V&V methodologies for NNs used in flight control system, such as Pre-Trained Neural Network (PTNN) and Online Learning Neural Network (OLNN). Our study and [15] have some common findings. For example, one category, named Visualization in Taylor et al. [15] , falls into our category CA5 Improving interpretability of NNs.

Surveys of security, safety, and productivity for deep learning (DL) systems development
Hains et al. [16] surveyed existing work on "attacks against DL systems; testing, training, and monitoring DL systems for safety; and the verification of DL systems. " Our study and [16] shared a similar motivation. The critical difference between our SLR and [16] are threefold: 1) We conducted our literature review on 83 selected papers based on specific SLR guidelines, while they used an ad hoc literature review (ALR) approach and reviewed only 21 primary papers. 2) They only focused on DL systems, whereas our scope covered modern NN-based software systems, which embodies DL-based software systems. 3) They inferred that formal methods and automation verification are the two promising research directions based on the reviewed works. In contrast, we focused more on safety issues, and found more categories to be addressed for safety purposes.

Surveys of certification of AI technologies in automotive
Falcini et al. [17,18] reviewed the existing standards in the automotive industry and pointed out the related applicability issues of automotive software development standards to deep learning. Although our SLR takes the automotive industry as an example, we are concerned with SCCPSs in general. This concern is reflected in the distribution of the selected papers (only 13 of the 83 selected papers are oriented to automotive CPSs).

SLR of explainable artificial intelligence (XAI)
There are two very recent SLRs, Adadi and Berrada [157] and Hohman et al. [158] , on the interpretation of artificial intelligence. Both [157,158] employed similar commonly accepted guidelines to conduct their SLRs. The fundamental difference between our study and [157,158] is the scope. Adadi and Berrada [157] reviewed 381 papers on existing XAI approaches from interdisciplinary perspectives. As reported in Hohman et al. [158] , the scope of their SLR is visualization and visual analytics for deep learning. The study [158] focused on studies that adopted visual analytics to explain NN decisions. Our study has a more comprehensive coverage of T&V approaches that were employed to not only interpret NN behaviors but also to assure the robustness of NNs, to improve the failure resilience of NNs, to ensure test completeness, and to assure the safety property of NN-based SCCPSs. In a summary, our SLR tried to provide an overview of key aspects related to T&V activities for NN-based SCCSs.

Threats to validity
In this section, we discuss some threats to the validity of our study.

Search strategy
The most possible threat in this step is missing or excluding relevant papers. To mitigate this threat, we used six of the most relevant digital libraries to retrieve papers. Additionally, we employed two strategies to mitigate potential limitations in the search terms: 1) adopted an PIOC criteria to ensure the coverage of search terms; and 2) improved search terms iteratively. Further, we conducted an extensive snowballing process on references of the selected papers to identify related papers. The search keywords were cross-checked and agreed on by both authors.

Study selection
Researchers' subjective judgment could be a threat to the study selection. We strictly followed the pre-defined review protocol to mitigate this threat. For example, we started recording the inclusion and exclusion reasons from the 3rd stage. We validated the inclusion and exclusion criteria with two authors on the basis of the pilot search. Furthermore, the second author performed a cross-check of all selected papers. Any paper that raised doubts about its inclusion or exclusion decision was discussed between the first and second authors. For example, the "smart grid " is included in the search term, but no relevant papers were found after the 3rd stage. Then, we conducted a snowballing search to identify papers that presented how to use NNs in smart grids. We found out that AI is mainly used to solve the economically relevant problems [159] of the smart grid system (e.g., prediction of energy usage and efficient use of resources). AI is not involved in the safety-critical applications (e.g., decision making on optimal provision of power) of smart grids. Therefore, there were no relevant papers addressing safety analysis or testing/verification (refer to Inclusion criteria I2).

Data extraction
The first author was responsible for designing the data extraction form and conducting the data extraction from selected papers. In order to avoid the first author's bias in data extraction, the two authors continuously discussed the data extraction issues. The extracted data were verified by the second author.

Data synthesis
Data analysis outcomes could vary with different researchers. To reduce the subjective impact on data synthesis, besides strictly following the thematic synthesis steps [54] , the data synthesis was first agreed on by both authors. We disseminated our preliminary findings to two internal research groups at our university (i.e., the autonomous vehicle lab and autonomous ships lab) and presented at a Ph.D. seminar on IoT, Machine Learning, Security, and Privacy for comments and feedback. In summary, the audiences agreed with our research design and results, and they thought that the mapping of reviewed approaches to the IEC61508 is a valuable attempt. Several researchers working in formal verification and safety verification thought that safety cases would be a promising direction to address the challenges of T&V of NN-based SCCSs. One suggestion is adding information about self-driving car simulators. Based on these comments and feedback, we revised our paper accordingly.

Conclusion and future work
In this paper, we have presented the results of a Systematic Literature Review (SLR) of existing approaches and practices on T&V methods for neural-network-based safety critical control software (NN-based SCCS). The motivation of this study was to provide an overview of the state-of-the-art T&V of safety-critical NN-based SCCSs and to shed some light on potential research directions. Based on pre-defined inclusion and exclusion criteria, we selected 83 papers that were published between 2011 and 2018. A systematic analysis and synthesis of the data extracted from the papers and comprehensive reviews of industry practices (e.g., technical reports, standards, and white papers) related to our RQs were performed. Results of the study show that: 1. The research on T&V of NN-based SCCSs is gaining interest and attention from both software engineering and safety engineering researchers/practitioners according to the impressive upward trend in the number of papers on T&V of NN-based SCCSs (See Fig. 5 ). Most of the reviewed papers (68/83, 81.9%) have been published in the last three years. 2. The approaches and tools reported for the T&V of NN-based control software have been applied to a wide variety of safety-critical domains, among which "automotive CPSs " has received the most attention. 3. The approaches can be classified into five high-order themes, namely, assuring robustness of NNs, improving failure resilience of NNs, measuring and ensuring test completeness, assuring safety properties of NN-based SCCPSs, and improving interpretability of NNs. 4. The activities listed in the software safety lifecycles of IEC 61508-3 are still valid when conducting safety verification for NN-based control software. However, most of the activities need new techniques/measures to deal with the new characteristics of NNs. 5. Four safety integrity properties within the four major safety lifecycle phases, namely, correctness, completeness, freedom from intrinsic faults, and fault tolerance, have drawn the most attention from the research community. Little effort has been put on achieving re-peatability. No reviewed study focused on precisely defined testing configuration and defense against common cause failure, which are extremely crucial for assuring the safety of a production-ready NNbased SCCS [160] . 6. It is common to combine standard testing techniques with formal verification when testing and verifying large-scale, complex safetycritical software [15,144] . As explained in Section 4.3 , we found that an increasing concern of the reviewed works is the integration of different T&V techniques in a systematic manner to gain assurance for the whole lifecycle of the NN-based control software.
This SLR is just a starting point in our studies to test and verify NN-based SCCPSs. In the future, we will focus on improving the interpretability of NNs. To be more specific, we plan to develop a method for explaining why an NN model is more (or less) robust than other models. It can guide software designers to design an NN model with an appropriate robustness level, which will greatly support safety by design.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.