Towards AI Forensics: Did the Artificial Intelligence System Do It?

Artificial intelligence (AI) makes decisions impacting our daily lives in an increasingly autonomous manner. Their actions might cause accidents, harm, or, more generally, violate regulations. Determining whether an AI caused a specific event and, if so, what triggered the AI's action, are key forensic questions. We provide a conceptualization of the problems and strategies for forensic investigation. We focus on AI that is potentially ``malicious by design'' and grey box analysis. Our evaluation using convolutional neural networks illustrates challenges and ideas for identifying malicious AI.


Introduction
Today AI is ultimately controlled by humans, but it is already capable of conducting tasks autonomously by learning from examples which makes them superior to traditional computer programs in two ways: performance and ease-of-engineering. Deep learning (DL), the key AI technology and therefore the focus of this paper, has dramatically improved prior systems, e.g., in computer vision and speech recognition. Access to AI is often simple since standard DL models can be trained with little human engineering by uploading data to a cloud platform like Google CloudML. The progress of AI technology is also driven by open-source initiatives of both datasets and ready-to-use models. Consequently, AI provides novel opportunities for abuse: it may be easier to build or manipulate AI to perform malicious acts than to engineer a system otherwise. For instance, altering training objectives or choosing training data instead of building native software likely requires less time and leads to better-performing systems. This fosters the paradigm that a system is 'malicious by design', i.e., it has been trained, designed, or changed to exhibit malicious behavior. An example of 'malicious by design' is shown in Figure 1 and also in Figure 2, where an investigative question could be: Has the AI system been built to cause the incident?.
This work provides the first directions to answer such questions, which is tricky since AI is often seen as a 'black box' that adapts through learning. Understanding AI is notoriously difficult, as witnessed by a large body of works on explainable AI (XAI). This paper contributes as follows: (i) We conceptualize 'malicious by design' and AI forensics; (ii) We provide two methods for investigation using case studies based on convolutional neural networks (CNNs). Specifically, we elaborate on grey box analysis, i.e., activation patterns of features.

Related work
One can differentiate between 'AI for digital forensics' and 'digital forensics for AI' (short: AI Forensics). In the former case, digital forensics has benefited from AI [10] in some areas such as Multimedia forensics (copymove forgery [4] or deep-fake video detection [15]), or facial age estimation [6]; a comprehensive overview is provided by [12]. While relevant, this is not the focus of our work. On the other hand, AI Forensics can be treated as a sub-discipline of digital forensics as defined by [8]. The authors provide a brief high-level conceptualization focusing on AI safety, i.e., incidents due to failures of AIs built with "good intentions". In contrast, we discuss actual forensic work for systems designed for a malicious act.
Security threats and defensive techniques from a data perspective are surveyed in [21] focusing on adversarial examples. The general question 'Can machine learning (ML) be secure?' was discussed in [9], where the authors present different types of attacks on ML techniques and possible defenses. Involving an expert, i.e., an investigator in our case, is not uncommon, e.g., for fake reviews [1,35].
Our work touches on the emerging field of reverseengineering DL models. Weight extraction is the focus of [17,18,27]. [17,24] identify DL architectures and [25] approximates a confidential model. We identify (incident-related) data to which a model reacts. All works except [17] discuss black box models. [17] showed that even when using data encryption, architectural information and weights might be obtainable using memory access patterns. XAI and other analytics techniques [7,16,22,29,34], specifically for CNNs [36], are valuable tools for AI forensics. Many works deal with model introspection relying on fine-grained access Steps to conduct an attack with an AI system. to variables. For example, iterative optimization using full model access, i.e., gradients, can reveal inputs that maximize feature activation (see [13] for a seminal work). In this work, we elaborate on the general process for AI forensics, questions of specific interest in forensics, and grey-box analysis. We also use ideas from XAI, such as explained by samples and analyzing layer activations [28].

Malicious by Design
AI can be defined as the "ability to learn from data and to use those learnings to achieve specific goals" [19]. An AI system is a union of components out of which at least one contains AI (we shall use the term "AI" denoting AI systems and AI components if its meaning is clear in context). A system is "malicious by design" if trained, designed, or changed to exhibit malicious behavior, which means that an attacker alters or builds a system to be used as a tool to perform a malicious act. An AI component can trigger a malicious action or provide deceptive information, so the system is tricked into acting maliciously.
This work focuses on attacks on a subset of AI techniques [23], i.e., deep learning [31]. Manipulations for attacks to those can be based on the training data, inputs to the AI system, the AI's objective, the model, or the AI's (self-)learning mechanism. For attacks based on training data, an attacker can create or alter a dataset and use it to train or alter a model, e.g., using transfer learning. Attacks based on (adversarial) inputs might trick the AI into making incorrect predictions. An attacker aims to identify inputs being misclassified without altering a non-malicious system. The field of adversarial attacks is well-researched [2,3,14,37] and not considered in this work.
An attacker might employ (and train) an unmodified, general-purpose AI component within the malicious system. The attacker might also build, alter or trick an AI, e.g., by architectural changes of a DL model. A malicious system might originate from a non-malicious AI system by exchanging the AI with a manipulated version. For example, a drone's vision system might be altered to drop a parcel not only once it recognizes a dedicated landing zone but also once it recognizes a specific person.
Zero-day malicious AI systems are malicious systems based on substantially novel ideas. The system's design is highly innovative, or it appears at least very unusual. For example, the first application of a recurrent neural network, mainly used for sequence data such as text, to image recognition constitutes a zeroday system. Zero-day systems require expertise and effort. Non-zero-day designs constitute (variations of) standard architectures.
Generic steps in the malicious design of the AI are illustrated in Figure 2. First, an AI has to be acquired or built. Then it is tampered with, e.g., by re-training on a dataset of the attacker. Finally, the AI conducts the attack. Steps can vary, e.g., a step can involve counter forensics to obstruct the forensic investigation.

AI Forensics
Forensic investigators collect, preserve, and analyze evidence, which refers to digital information such as data, models, and (software) systems [32]. While the forensic process involving classical (traditional) software is wellunderstood, investigating AI systems from a forensics perspective is novel. Traits highly relevant for forensic work, such as computation flow or storage of knowledge related to behavior differ considerably between AI systems and traditional software. AI learns features and complex behavior from collected experience. While AI can uncover features automatically from domain knowledge given training data, they have to be defined by domain experts in classical engineering. DL models typically follow a well-defined modular (layered) structure. A layer is an instance of one of the relatively few types. In contrast, classical software with comparable functionality often comprises hand-crafted, domain-specific algorithms and data structures. For  Table 1: Evidence Typology example, in a fully-connected network, in each layer, the input is multiplied with a weight (or parameter) matrix to which another matrix is added (aggregation) and passed through an activation function. Each entry in a weight matrix typically corresponds to a parameter of a neuron. Control flow and data access patterns are typically simpler for AI and, in the case of DL, more easily predictable than for classical software. Computation in DL is often characterized by many relatively simple units, i.e., neurons that operate in parallel in a welldefined order and standardized data structures.
Both AI models and classical software might be complex, and difficult to understand, where complexity arises for DL primarily due to the interaction of many simple units [7] and for classical software due to complex, problem-dependent control flow.
The characteristics of AI influence the types of evidence, as well as the investigation of AI systems, which can be summarized as (i) Analyzing system internals is more tangible for AI systems due to the more tractable control flow and (ii) Training data is highly important in understanding the AI's behavior. Types of Evidence: We focus on evidence that is AI-specific: the AI system itself, any data the system was exposed to as well as its reactions. We exclude any form of direct identification information such as IP or MAC address. The evidence typology is summarized in Table 1. Training data of the model is all data that contributed to learning or updating the model parameters. In contrast, for testing and observational data, the model only computes outputs, but model internals remain unaltered. Testing data consists of inputs and outputs that the model should produce. Observational data consists of inputs and, possibly, outputs the model produced. Observational data often includes data from an incident. We differentiate between access to the model (suspect AI is accessible) or just a similar version thereof (e.g., the drone type was identified, which allows investigators to acquire an identical type). In terms of system internals, we distinguish between white-box access (source code is available), grey box (a compiled model that allows observing interactions with resources such as memory), and black box (only outputs are accessible for arbitrary inputs).

Investigative Questions
From a forensics perspective, the ultimate questions to be answered are: Did the AI system cause an incident? and Why?. Such questions demand an understanding of AI as well as forensics, and, as for any criminal case, answers might not be a simple yes or no. For example, given a set of observed actions or decisions by the perpetrator during the incident, we need to answer 'what is the likelihood of a given AI suspect performing these decisions compared to those of other models?' and 'what triggers the action?' Often (only) circumstantial evidence might be produced by answering questions like 'is the AI suspect reacting to objects related to the incident more strongly than other models?' Given multiple AI suspects, the first question helping to prioritize which AI to investigate more closely might be: 'is the AI suspect behaving normally?'. These questions are discussed as part of our case studies.

Strategies and Techniques for Investigation
Strategies can focus on each component that determines behavior, mainly model, model objective, and data. Data analysis plays a key role. Since training data determines model behavior and operational data reflect model behavior, data on its own might be sufficient to determine malicious intent. An investigator can search for samples in the training data that are malicious or abnormal for any non-malicious system. For instance, the training data for a self-driving car should not contain images showing a human on a street crossing with the label "accelerate". A large amount of data calls for data mining techniques, e.g., filtering relevant data. Technical challenges to determine the relevance of data include quantifying relatedness to the incident. Domain experts might be needed to investigate (filtered data and ultimately decide whether an AI suspect caused the incident and what information triggered the action.
In the absence of model access, training data can be used to reconstruct an 'approximate model' used for analysis.
Model analysis might use abstract reasoning based on model definitions (e.g., as found in static software verification). Models could also be analyzed through empirical investigation (input-responses) in two ways: 1. Investigate the input-output relationship of a model: The model can be treated as a black box. The analysis relies on investigating model behavior based on its decisions. For example, data from the incident can be used to see if the system triggers actions causing the incident. An investigator might also aim to generate or search for samples that trigger malicious actions.

2.
Investigate the reaction of model internals to inputs: This strategy requires white or grey-box access. For example, DL activations of neurons can be investigated.
AI forensics techniques still need to be developed. Techniques from emerging areas in AI such as reverse engineering [17,18,27], explainability (XAI) [7,30], adversarial analysis [11], testing [37] and data mining play an integral part in the investigation. Additional algorithmic work is possibly needed since none of these areas is yet mature. Furthermore, AI forensics comes with its particularities, as shown in our case studies. For example, XAI does not cover analysis of grey-box models or the possibility that an investigator does not have access to any samples that cause malicious behavior but only to samples that share similarities with such samples. Testing aims primarily at verifying given requirements or specifications, whereas AI forensics seeks to uncover requirements and specifications built into an AI (by an attacker). Reverse engineering aims at turning a black-box model into a white-box model, e.g., using memory access patterns, but it does not answer any investigative question.

Grey Box Analysis
We focus on grey box analysis for two reasons: (i) It is particularly interesting for forensics since often an investigator has access to a system but not to its source code, and (ii) 'grey boxes' are not well-studied, e.g., techniques for XAI either assume black-box access or white-box access. We focus on data-driven attacks as they are appealing from an attacker's perspective: They can leverage the AI's strength to learn from samples rather than being forced to explicitly code "malicious behavior". Grey box access to system internals: For the grey box model, activations of neurons can be obtained for multiple layers at once for a given input. We assume a coarse understanding of the model, i.e., it is a layered architecture consisting of common layers like convolutional, relu, and batchnorm layers. We assume that we can interpret memory cells as values. This might be possible since memory cells typically only hold a few standard data types, such as float32 and float64. We assume that the memory location of inputs and outputs of a neuron, i.e., a feature, remains fixed. Lastly, memory cells can be assigned to a lower or upper layer. This is reasonable since data is processed using a fixed set of operations in a fixed order (in DL one layer after the other is evaluated). To access outputs of layer i for input X, we access a set of memory cells M , yielding a set of activation values M (X). The number of values |M | is the same for each X and are indexed as M j (X) with j ∈ [0, |M | − 1]. These values appear to be a random permutation of outputs of three consecutive layers, including layer i, which is of interest. Memory access patterns have also been used in reverse engineering of DL [17]. Access to training and test data: We proclaim three datasets: The unknown and non-accessible labeled dataset T = {T i } used by the attacker to train the malicious drone, the unlabeled dataset U = {U i }, and the labeled dataset L = {L i }. A label gives the output a network should produce. T i , U i and L i denote sets of samples, so that all samples in a set share some commonalities, in particular for the labeled dataset L Y corresponds to all samples of class Y , i.e., it consists of images with the corresponding outputs of a non-malicious system. The dataset L might be public and used to develop and test a drone, e.g., for industrial or research purposes. The set U comprises inputs without labels, e.g., images related to the incident. At least one of U and L is available for forensic investigation. The forensic investigator can label inputs, i.e., they can determine what output an input should trigger. But human labeling is costly.

Scenarios
We have provided three cases. Two cases are more specific and illustrative, while the third is more general. We selected two illustrative, exemplary scenarios (summarized in Table 2 using our prior conceptualization), i.e., two image recognition systems part of drones, where CNNs are commonly used [5,20]. The goal is to assess if a system is likely involved in the incident and manipulated. Models are assumed to be based on a standard, layered CNN architecture, but details are unknown to the investigator. We assume manipulation through training data and grey box access to the suspect system.
The choice of drones is for illustrative purposes only, since drones are an easily accessible consumer product, well-known to a wide audience. However, similar image recognition technologies are also used in many other systems, such as self-driving cars, healthcare, or industrial manufacturing. Drones also allow for many potential malicious scenarios (see the right panel in Figure 1). They are frequently used to capture sports events where incidents have already happened, as depicted in the left panel of Figure 1. Soccer is one of the most prominent sports. Soccer games have been subject to numerous bribery scandals from professional to amateur leagues in multiple countries such as Germany 1 , and assaults on players occurred with fatal consequences 2 . Thus, malicious acts have already taken place.
The third case presents a different approach to identifying if a classifier was trained to identify specific concepts related to a class and if the investigative data covers all concepts of the actual training data. It seeks to answer these questions in a more general framing, also evaluated with two classifiers and two datasets. However, this case is less graspable and concrete.

Case 1: Drone Sports Filming
Drones frequently capture sports events where incidents have already happened (see Figure 1). Our case is based on actual drone footage, where a non-malicious, human-controlled drone moves along the sideline. We assume an autonomous drone with a simplified architecture shown in Figure 3. A non-malicious drone tracks the referee in the center of the camera. The referee   is supposedly near the main action of the game. A maliciously designed drone attacks a specific player recognized by its jersey number by dropping on her once she is close to the sideline to make it look like an accident. An image from the drone's camera is processed as follows: A standard object detector provides a rough categorization and bounding boxes for objects on the image. We used Faster-RCNN [26] trained on the COCO dataset. Objects from the category "person" are further classified using a custom CNN designed by an attacker. The custom CNN is fed one (sub)image showing a detected object after the other. (More design options and information on data are in the Appendix.) Forensic Goals and Evidence: The investigator wants to understand the custom CNN: "Is the drone's vision system detecting non-expected, incident-related classes?". Analyzing non-AI-based control logic is not part of this study. The investigator knows that the system tracks the referee or assumes so based on statements from witnesses. They also know that the victim should not be tracked. The investigator has only access to an unlabeled dataset U made of similar footage as captured by the drone camera during the incident. It might be gathered by asking players to reenact the situation of the incident or, more easily, at the next game between teams. The investigator correctly assumes that an object detector followed by a CNN has been employed. Since the CNN is difficult to separate from the remaining system, one does not know what and how many classes the custom CNN outputs. Also, disguise layers ( Figure 3) might be used and evaluated lastly, so that the layers lastly computed might not be the actual output. Thus, the investigator cannot directly provide inputs to the custom CNN nor obtain its outputs or directly access the detected objects O identified by the object detector. The investigator can feed arbitrary raw images X R to the system, e.g., replacing the drone's camera. For raw input X R one obtains for each detected object o ∈ O and layer i of the CNN a superset of activations M i (o), including those of layer i.

Forensic Investigation:
The strategy is to determine if and what characteristics the system encodes specific to the incident but not to normal operation. For instance, if the drone crashes on a player, characteristics such as her face or jersey number are of interest. The procedure has four steps: 1) Identifying detected objects and getting their activations: For a single raw-image X R the system detects potentially many objects O. The investigator obtains a set S(X R ) = {M i (o)|o ∈ O} of activations without a mapping to objects on the image X R . Investigators can run their own standard object detector on the image X R to identify all relevant objects and potentially more. They get objects O ′ ⊇ O. They replace one detected object o ′ ∈ O ′ in X R to get image X ′ R , e.g., erasing it using image in-painting. If activations S(X ′ R ) remain unchanged due to the removal of object o ′ , i.e., S(X ′ R ) = S(X r ), the object is not detected. Otherwise, they obtain activations 2) Reduce activations through clustering: The number All locations are treated as independent for a feature map since we do not know the beginning, end, and layout of a feature map in memory.
To reduce the number of activations, we group them into clusters G and investigate a representative m l for each cluster G l ∈ G. This is justified since nearby locations (such as pixels in the inputs) are highly correlated. Thus, clustering helps remove redundancy. We cluster using k-Means into k = |G| = 500 clusters. The number of clusters k is a parameter that trades off time to investigate and the risk of misjudging a malicious AI as non-suspicious. To cluster, for each m ∈ M , we compute activations of all samples U and use the resulting concatenated vector of size |U| as a datapoint for clustering. For each cluster center G l , the representative m l is the point closest to the center of G l . That is, m l corresponds to all activations of samples in U of a specific location of a feature map. Pseudocode for steps (ii) and (iii) is shown in Figure 4.
3) Computing top activating samples per cluster representative: For each cluster G l ∈ G, we compute the top activating samples T l ⊂ U, e.g., n = |T l | = 6. The n-top activating samples T l are those inputs that yield the n-largest entries, i.e., activations, in the representative m l . For each G l , the samples T l are then presented to the investigator for visual inspection.

4) Manual investigation:
An investigator assesses for each group G l whether all top images T l share one or more characteristics relevant to the incident. The challenge is that samples might be grouped due to other irrelevant shared characteristics. Using a larger number of top samples n = |T l | increases the confidence in the analysis, i.e., increases true positives at the expense of false negatives and more time to investigate. For example, assume a feature activates for the color "orange". The most activating samples could be 3 samples, all showing the same player. Suppose more samples |T l | are used for investigation. In that case, the likelihood decreases that the same player is shown for a feature if the feature is similarly prevalent for other players' images, i.e., if pictures of other players also contain orange. Visual inspection is fast, i.e., one can mostly assess within seconds for samples T l whether images share one or more characteristics and, if so, to what extent they are relevant. The question of what exactly to look for is incident-specific. Implications can be casedependent, e.g., whether the findings serve as proof or merely as circumstantial evidence.

Case 2: Drone Parcel Delivery
A (non-tampered) drone is supposed to deliver parcels and drop them onto a dedicated area like a helipad. The outputs of the vision system directly trigger actions. The attacker wants to drop off the parcel on a person, i.e., action A should be triggered given the image of a specific person's face. Other actions are not of interest for manipulation. The tampered drone performs these actions similar to a non-tampered drone. The drone is trained with the unknown dataset T , which shares similarities with the public dataset L. It is of interest to the attacker to choose training data T similar to L so that the drone behavior appears "typical". Therefore, if not stated differently, we assume that training samples T i triggering action i bear similarity to those in the public training data L i . We discuss three scenarios for choosing the training data T : i) No Tampering (NT): Baseline that has not been tampered with. It is trained using dataset That is, the (malicious) action A is triggered for samples S A , i.e., T RT A = S A . Samples S A differ, though possibly in a subtle manner, from those in the public samples L A that trigger A for appropriate inputs in contexts where action A is non-malicious. iii) Enhancement Tampering (ET): The attacker uses a dataset T ET , where action A is triggered for This ensures that the AI reacts to the targets that should trigger an action for a nontampered drone as well as to samples that are needed for malicious behavior chosen by the attacker. That is, the set S O is similar to those a normal non-tampered drone would be trained on to trigger action A, i.e., S O is similar to L A . The motivation is to hamper forensics.
Forensic Goals and Evidence: The investigator has access to the public data L and a set U of unlabeled items. The latter consists of sets of related items, i.e., U := {ζ i }, where each category ζ i is composed of a set of related samples. Relatedness refers to some form of visual similarity. For example, sets of samples are available for many real-world categories, such as humans, cars, etc. Sets might consist of images related to the incident, like images of the victim or any other images that the drone might encounter before or during the incident. The investigative goal is to decide if a suspicious AI system is subject to N T , RT , or ET . Investigation: Abnormal AI? RT? We first assess whether the suspect system deviates from expected or normal behavior. The aim is to detect replacement tampering (RT), where training data triggering a specific action has been replaced. In this case, the system is not processing inputs like a non-tampered system. RT implies a large error err(L A ) for samples L A of some class A from the public training data compared to a baseline obtained from the public training data. Investigation: NT or ET? An attacker anticipating an investigation might counteract by enhancement tampering ET . Thus, the system treats a set of pre- defined inputs according to the specification, but it acts maliciously for other unknown inputs related to the incident. Given a large set of unlabeled samples U we aim to identify and analyze those inputs related to malicious behavior S A . That is, input samples U can be partitioned into three disjoint sets The sets are illustrated in Figure 5. The categorization into sets {ζ O , ζ A , ζ R } is unknown to the investigator, particularly the set ζ A . Thus, in its most simplified form, the investigator's goal is to identify ζ A out of three unlabeled sets in U. We do so by identifying ζ R first and then distinguishing ζ O from ζ A (and ζ R ), leaving us with the desired attacker set ζ A . We assume that ζ O is similar to the public data L A that triggers action A in appropriate contexts. The method consists of two steps: i) Identify ζ R using misclassification analysis: We first eliminate samples ζ R that do not trigger A. We use the rank Rk(ζ, A) indicating the position of the class A, if classes are sorted depending on the fraction of samples in ζ classified as A. ζ R should have a significantly higher rank than ζ O and ζ A . Computation of Rk(ζ, A) is as follows (1) Compute the class of each sample in ζ; (2) For each class j count how many samples N j were predicted to be of class j; (3) Sort classes using count N j in descending order. The rank Rk(ζ, A) is the position of class A in that ordering. ii) Separate ζ O and ζ A using feature-based analysis: So far, we have assumed that we have unlabelled data triggering action A. The unlabelled set ζ A ∈ U might bear some similarity with objects S A , but it might be insufficient to trigger a mis-classification commonly. For example, a malicious system might trigger an action if an image shows the victim and no potential witnesses, i.e., no other people. However, if the set ζ A available to the investigator as part of U only shows the victim with other people, action A will never be triggered. We propose a feature-based analysis that is more finegrained than considering inputs as a whole. It allows identifying features associated with action A even if the samples with such features do not trigger action A. A feature-level analysis requires investigating model internals, i.e., activations of layers M . We identify characteristics/features that are relevant to L A , i.e., features F L A that activate more often for samples from L A (and likely also for samples S O ) than for other samples. Say samples from ζ activate features F ζ . If there are features that are only in F ζ but not in F L A then F ζ \F L A is non-empty. This indicates that samples ζ contain images that exhibit characteristics associated with action A that are not found in the public training data L A for A. We say that a feature m ∈ M is activated for a set ζ if the mean of all activations of set ζ is larger than the mean of all activations of the entire data L plus the standard deviation.

Evaluation
We discuss each case study separately following the same structure: We provide details of the setup followed by results and a discussion.

Evaluation Case 1
We used Pytorch's pre-trained Faster-RCNN [26], trained on the COCO dataset. We trained the custom CNN being a VGG [33] variant (Table 3) on our labeled dataset for 100 epochs with data augmentation (rotation, random crop, and horizontal flipping) and L2 weight regularization factor of 0.003. The batch size was 64. We used stochastic gradient descent with learning rate decay starting from 0.1, decaying by a factor of 10 after 50 and 80 epochs.   labels) constitute the unlabeled dataset used by the investigator. Labels of this 40% were used to compute evaluation metrics. The dataset with the number of samples per class is shown in Figure 3. We considered two scenarios each with different datasets, i.e. T , U and U ′ , T ′ . Both T and T ′ (as well as U and U ′ ) had the same images but differed in their labeling. The dataset RefTrack for non-malicious networks consists of two classes (referee and any other object). The RefTrack corresponds to our described scenario. For the second dataset, AllTrack attack objects and other detection objects are very similar. It consists of nine classes (all classes except the jersey number of the subject to attack). A malicious dataset has one more output class than a non-malicious dataset. The extra class is one of the five jersey numbers, e.g., 10,11,90,4,20. The AllTrack dataset is designed to be more difficult for forensic investigation.
For non-malicious networks, the images of attack classes were assigned to either #homeAway or #otherAway. We trained with image augmentation and class-weighing, yielding test accuracies consistently above 90%. The confusion matrix showed that all classes were learned well. For each hyperparameter setting, we trained three models.
For quantitative assessment, we introduce a measure denoted feature consistency F c . It allows capturing to what extent a feature is associated with a suspicious concept (or class) only. We expect that malicious networks exhibit more features with high feature consistency for suspicious concepts than non-malicious networks. Feature consistency F c is computed on representatives m j ∈ G j as the ratio of the number of samples S ⊆ T j that activate a particular feature or concept out of all samples T j . It is expressed as a percentage, i.e., F c (m j ) := 100|S| |Tj | . A suspicious concept is any of the jersey numbers 10, 11, 4, 90, and 20 (see classes in Figure 3). Figure 4 illustrates most activating test samples for three features from G j and their feature consistencies F c . Thus, if for multiple groups G j feature consistency is large, the network is likely sensitive to the suspicious concept and, thereby, possibly malicious. Results: A qualitative assessment can be done using Figure 6 for the dataset RefTrack. The figure shows the most activating, i.e., top n = 6 samples, for a subset of all 500 cluster representatives for a malicious and non-malicious network trained with jersey number "10 home" as attack samples. Each row contains 6 samples T j for a specific m j . An investigator checks each set of samples T j whether they consistently show the same suspicious characteristic, e.g., a jersey number. Some sets T j in Figure 6 for the malicious network show indeed consistent features that resemble the attack class, i.e., images with jersey number 10. In Figure  6 for the non-malicious network, no such sets can be identified.
For quantitative assessment, Table 4 shows the average feature consistencies across all attack classes. An investigator aims to distinguish malicious from nonmalicious networks. Malicious networks show higher feature consistencies for attack classes than non-malicious networks. Thus, they can be separated. Suppose an investigator looks at the top activating samples for the groups G j identified by our algorithm and illustrated in Figures 4 and 6. In that case, one will find groups where all samples share a malicious concept more often for malicious networks. The effectiveness depends on the chosen layer and the similarity of attack objects and non-attack objects. For a non-malicious network, feature consistency is 0 for all layers, indicating no features exist corresponding to attack objects, i.e., suspicious jersey numbers. This holds although the training data consisted of objects of the incident (showing jersey numbers). But the suspicious concepts, i.e., jersey numbers, were not needed or beneficial for the task at hand, i.e.,  Table 4: Feature consistency F c for non-and malicious nets distinguishing between the referee and players. Thus, an investigator will not detect a suspicious concept in a group stemming from the non-malicious network. In contrast, malicious network results depend on the layer. There are no groups with a high feature consistency for lower layers. For other layers, a few percent of all features have a feature consistency of 1, indicating that the network learns features related to the incident. The top layers encode semantically rich information, and the lower layers are more basic, generic information common in many classes. Thus, the outcome is aligned with existing knowledge on DL. An investigator looking into all groups will detect such a group. One can be confident that the network was specifically trained to detect a suspicious concept.

Results for the second dataset for Case 1
Feature consistencies are stated in Table 4. In the paper we discussed results for RefTrack, showing that malicious networks are easy to identify. The situation is more intricate for the dataset AllTrack when attack classes and non-attack classes bear more similarities. More precisely, similar concepts are used to distinguish between classes of non-attack objects and attack and non-attack objects. In our case, jersey numbers are needed to distinguish among non-attack classes, and a jersey number (though a different one) is needed to identify the attack class. Such an overlap raises challenges, i.e., we observe that also the non-malicious networks contain some features with large feature consistencies. The malicious network still contains significantly more (A t-test gave a p-value <0.001). Malicious and nonmalicious networks must discriminate between very similar classes, i.e., both have multiple classes focusing on jersey numbers. Thus, the investigation becomes more difficult if the attack objects are very similar to the classes that the network should detect. It is not sufficient to detect a group where all top samples share a suspicious concept. An investigator must compare the outcome of a potentially "malicious" system to an adequate reference, e.g., to a non-malicious system. Thus, our method allows distinguishing malicious from non-malicious networks. Investigators can use the proposed method to identify groups from unlabeled data. They can compare the top samples and identify shared relevant concepts for the incident, i.e., if top samples share a suspicious concept, the network is likely malicious (see Figure 6). While the method is widely applicable, as a limitation, it requires data related to the incident and manual work.

Discussion Case 1
On the technical side, the evaluation showed that malicious networks could be identified using visual investigation if different features are required to distinguish between non-attack objects and attack objects from non-attack objects. For the dataset RefTrack the nonmalicious network did not learn any features related to jersey numbers. However, they were shown in the training data, and they would, in principle, have proven helpful to distinguish the referee from the players. In contrast, for the dataset AllTrack the non-malicious network learned such features since it had to distinguish between different jersey numbers. In the latter case, simply identifying a shared suspicious concept is not sufficient. Once a concept is identified, it must be shown that it is more frequent than in a non-malicious reference. This increases the investigator's effort since a reference is not readily available. An investigator might follow our evaluative process, including data labeling, training a malicious and non-malicious classifier (serving as reference), and comparing feature consistencies between networks as in Table 4. Furthermore, it is not necessarily evident what constitutes a suspicious concept. The op samples of a group might share multiple similarities. It depends on the investigator to identify those relevant to the incident. Thus, the investigator's judgment remains highly relevant.

Evaluation Case 2
We trained a VGG [33] variant (Table 5) for 30 epochs without data augmentation with L2 weight regularization factor of 0.0005 with a batch size of 128 leading to test accuracies of about 45%. Note that better-performing models would be to our advantage, i.e., to that of an investigator. Low accuracy might be used as a means to disguise. The reason is that low accuracy relates to noisy classifications making differences used for investigation between models (and features) less profound. For k-Means we used Python's sklearn library. We used stochastic gradient with learning rate decay starting from 0.12, decaying by a factor of 10 after epochs 20 and 30.
We trained on the Cifar-100 dataset. For N T , RT , and ET , we choose distinct sets of samples S O and S A , each corresponding to those of a random class. Cifar-100 consists of 100 classes summarized by 20 more general categories, e.g., 'people' contains baby, boy, girl, man, and woman. For the three classes U := {ζ O , ζ A , ζ R } we use for ζ A samples of a class from the same category as S A , for ζ O samples from the same category as S O and for ζ R samples of a class not in any category of S O or S A . We train the networks without the three classes U. They are used for evaluation. We train 30 networks for each N T , RT , and ET . For each network, we randomly choose the sets U. Results: Abnormal AI? RT? The overall error err(L) in Table 6 is similar for N T , RT , and ET , indicating that manipulations do not distort overall   Results: NT or ET? Identify ζ R using misclassification analysis: Table 7 shows the output for assessing unlabeled samples U. As expected, samples from ζ O and ζ A have a lower rank and lower error than random. However, the rank of the random class seems fairly low, i.e., under the assumption that on average a random class should have the mean rank of all classes, which is 48.5 (we trained using 97 classes). The reason is that class A is trained with 0±6.61 12.0±17.15 4.0±9.58 2.14 0.10 Table 7: Median Rank for action A for ζ ∈ U; Mean # features in intersection with L A

Discussion Case 2
Our evaluation highlighted that an attacker could easily associate an output with different stimuli, e.g., in our scenario, an attacker can trigger dropping a parcel on a person rather than on a heliport. However, this can also be identified. In practice, the evaluation is more intricate since there are likely more than three sets of samples. However, they all fall into one of the three considered types. Thus, this should not hinder the application of our method. An adversary has multiple means for counter forensics, possibly at the price of making the attacks "less targeted". For example, the tampered system might be trained only briefly or with very few samples of the attack class. This could cause the system to only occasionally trigger the malicious action, although the attacker would want it to always trigger in this situation. But, in turn, detection is much more difficult, i.e., notable differences between the attacker dataset and other sets become more subtle.

Case 3: Generalized Setting
Forensic Goals and Evidence: The aim is to answer two general questions: i) Was the network trained to detect a specific set of samples sharing similarities, e.g., constituting a class? ii) Does the investigative data contain all classes used for training a classifier? We assume that the investigator has access to data D ⊆ (U ∩ L). In particular, D contains some sets that are similar or even identical to some sets used for training the model T . The two questions are illustrated in Figure 7.
For Question (i) the investigator wants to assess if a network was trained to identify samples D ∈ D as a class, i.e., if the network was trained specifically to treat samples D as belonging to one class Y . For example, a drone might be trained to identify a heliport for dropping cargo and a general class constituting obstacles. Still, it might not be trained specifically to identify humans, let alone specific individuals. Knowing whether the network recognizes a specific person can help judge a network as suspicious. In the prior section, we employed a semi-automatic approach that identified samples leading to the strongest activations for each neuron. Here, we present another methodology that relies on assessing mean activations of samples of a Question (ii) is motivated by the observation that an investigator certainly does not want to miss relevant classes in her investigative data. If so, it is not fully known to what the network reacts and a network cannot be reliably judged as being (non-)malicious. Investigation: A network learns features that allow discrimination among classes. While lower layers encode characteristics that are often common for many classes, upper layers contain more specific features that are typically characteristic for one (or at most a few) classes. A feature j ∈ [0, |M | − 1] is characteristic of a set D if activations M j (X) for X ∈ D are frequently considerably larger than those of other sets D ′ . To capture this intuition we compute averages and compare them. We define the average activation for samples D and feature j as: The standard deviation s To address Question (i), let D all := ∪ S∈D S be the union of all samples used for investigation. We say a feature j is characteristic for samples D, if , where c 1 is a fixed constant, i.e., we shall use 1.5. We define the number of characteristic features N (D) as: A set of samples D was not used for training if there are fewer characteristic features N (D) than for sets that are known to be used for training (or are very similar to such sets) N (D ′ ). The average number of activated features for sets D is: The standard deviation is We say that samples D have not been used for training if N (D)+c 2 ·sd(N (D)) < N (D), where c 2 is a constant, i.e., we used c 2 = 1.5. We discuss finding the thresholds in our evaluation.
To address Question (ii), the idea is to assess if the network contains features that never strongly activate for any of the sets of the investigative data. This question is more challenging than Question (i) since average activations of two features can vary strongly, even when using the actual training data T since some features (or patterns) tend to be more common among samples in general than others. Thus, detection tends to be more brittle. We aim to identify features that are not activated for any of the sets in the investigative data. We compute the maximum mean activation of a feature across all sets D ∈ D, i.e., M j,max (D) = max D∈D M j (D). We consider a feature as non-activated (for all sets) if it is below a threshold c 3 . We assessed two values of c 3 , i.e., 0.025 and 0.05. We discuss finding the thresholds in the evaluation. The number of non-activated features for the investigative data is as follows: We say that the investigative data D misses at least a set D ∈ T , if N A(D) > c 4 , where c 4 = 3 is a threshold.

Evaluation Case 3
Setup. We used two classifiers, i.e., VGG-11 (V11), and ResNet-10 (R10). We employed two datasets, namely Fashion-MNIST and Cifar-10. Fashion-MNIST consists of 70k 28x28 images of clothing stemming from 10 classes. As data preprocessing for Fashion-MNIST, we scaled all images to 32x32 and performed standardization. We assume that activations M (X) stem from the second last layer before the linear layer, i.e., consisting of 512 neurons. We also investigated using three layers, i.e., the fifth last layer up to the second last layer, but we do not include detailed results but discuss their differences qualitatively. To investigate Question (i) we train a classifier on T ′ = T \ D, i.e., with all but one class D ∈ T . We assume investigative data D = T . Thus, our method correctly detects the non-trained class corresponding to samples D, if (a) it says the classifier was not trained on samples D and it says that the classifier was trained on all other classes T ′ .
In our evaluation, we leave out each class D ∈ T and train per left-out class three classifiers of each classifier type, i.e., V11 and R10. That is, we trained a total of 60 classifiers for the 10 classes. We report the mean detection accuracy and the standard deviation as well as the model accuracy and standard deviation though this is not a primary concern. To investigate Question (ii) we could train using all classes T . However, to get more diversity in outcomes and show that our method works robustly, we reuse more diverse training data, i.e., we reuse the same classifiers as for Question (i), i.e., trained on T ′ = T \ D. The investigative data lacks one additional class D ′ , i.e., D = T ′ \ D ′ with D ′ ∈ T ′ . We let each class D ′ ∈ T ′ be missing once and say that the non-completeness of D was correctly detected, if N A(D) > c 4 , i.e., the algorithm outputs that the investigative data lacks a class D and it outputs that actual training data T ′ does not lack a class.
Results Question (i): The threshold constants c 1 and c 2 can be identified through exploratory analysis, i.e., doing plots like Figure 9 shown in our results for different values for c 1 and coloring activated and non-activated features, i.e., increasing c 1 so that only features with large means are characteristic for a class. 3 Note, we do not use any information on the set D we aim to assess for that purpose. Qualitatively, in Figure  9 it becomes apparent that if a class is left out from training, few or no mean activations of features tend to be large. In Table 8 we can see that we can correctly detect the non-trained class (among all classes) in the majority of cases -even using the same parameters c 1 and c 2 . This procedure works well also in the scenario, where activations M consist of three (different) layers. If activations of multiple layer types are merged, it generally suffices to focus on neurons with large mean activations, which tend to be those after batchnorm, i.e., means of conv-layer tend to be near 0. If neurons of different layers are approximately of the same magnitude, they do not have to be distinguished and can be analyzed jointly.  Results Question (ii): To find thresholds c 3 and c 4 , we performed a plot like in Figure 9, but intentionally leaving out a set D ′ \ D. If the number of features with very low activations increased significantly, i.e., among the lowest activating features, the majority intersects with activated features of set D ′ (see Figure  9) then it can be concluded that the investigative data is complete. If only a small fraction, i.e., less than 1/2, of the lowest activating features stem from the left D ′ then many other features have not been activated for the investigative data. The challenge is that some features can have low mean activation for all classes. Thus, among the lowest activating features there can be some features that are activating for a set in D, but the activation is small. In our analysis, we aim to distinguish between the case that the dataset D is complete, i.e., it is equal to the training data, or at least one set D ′ ∈ D is missing. A set is missing if the number of non-activated features is above a threshold c 4 that captures the number of anticipated "outliers", i.e., features that have very low activation although they are part of the data D. As can be seen in Figure 9, this approach works well if we have access to activations of a specific layer, but noise significantly increases if we assume that activations stem from multiple layers. The separation of activating and non-activating features based on means is a challenge. The good thing is that means of features of different layers of a conv-layer, ReLU, and a batchnorm layer tend to differ. The problem is that there is still overlap making a simple separation difficult. Under the assumption that activations are randomly permuted, this is non-trivial.
As can be seen in Figure 9, the mean activations fluctuate depending on the missing class. Still, for all left-out classes during training, there are multiple nonactivated features below the threshold (blue dots) in the resulting networks. If we train on all data (violet dots), we can see that in particular, towards the lower end, there are clear differences between the blue dots. Furthermore, while some features that strongly activate for the missing class (black dots) also yield fairly strong activations for other classes, most features remain nonactivated if the mean is large for some other class. Quantitatively, Table 9 shows that accuracy tends to be high across classifiers, but they are sensitive to the threshold c 3 .

Net. Dataset
Thres.  Table 9: Model accuracy and detection accuracy for a (sample) set contained in training but missing from investigative data

General Discussion and Future Work
Our conceptualization provides a rich set of options to investigate. Still, this set of options is by no means comprehensive, and it cannot be so since forensics and attackers are both constantly evolving. Additional questions for research include "How can suspects based on other AI techniques such as reinforcement learning, other scenarios (see Figure 1) and other deep learning network architectures such as LSTMs be identified?", "How can explainability methods be leveraged (modified) to support analysis of grey-box models ?", "How can data mining techniques, e.g., to identify anomalous decisions be leveraged ?", "How can other types of evidence such as operational data be used?". Our work is only a first step towards AI forensics. We demonstrated that perpetrators could often be  identified in the given attack scenario. We showed how properties of AI systems, e.g., memory access patterns of deep learning and knowledge of alleged behavior (encoded via a public dataset), can be leveraged for forensics. Our case studies are just a small selection of countless options. Furthermore, various design decisions must also be made for these options. This naturally raises questions with respect to generalizability. We believe that our techniques are widely applicable since systems for very different applications use the same deep learning technology, e.g., object detection is used in self-driving cars, manufacturing, healthcare, etc. Most effort for investigation might not be related to applying our methods but to collecting adequate data, which holds true for many machine learning projects. For example, an investigator might rely on data similar to what the system observed during the incident to understand what caused the AI to act in a certain way. However, getting such data might be difficult and costly. Even if data is acquired, it might differ from the incident in subtle but important ways. That is, the acquired data might not trigger malicious actions during the incident due to differences in data. To this end, evaluating more DL architectures on more datasets might be beneficial, but this alone is not sufficient given the many decisions to be made for any case study. More foundations are needed in deep learning in general, and more studies are needed that are only dedicated to a specific forensic case or question.
Our work is not without limitations. For once, both our image classification networks seem fairly inaccurate. Note that our proposed techniques for forensic investigation become more reliable given higher accuracy since this reduces noise. As of today, in safety-critical applications like self-driving cars, outputs of computer vision components are used in conjunction with other sensors. Thus, an attack on a real system might be more complex than just retraining a single classifier. Still, modifying a classifier like a computer vision component seems an important step. Furthermore, one might question whether an attacker would attempt to manipulate an AI rather than just perform physical modifications to a device or adjust control logic in other ways. To this end, there is no definitive answer yet. We believe there are multiple decisive factors: i) Maturity of AI forensics. The availability of methods to identify attacks is making attacks less attractive. On the contrary, if law enforcement is not prepared to investigate an AI, it becomes more attractive for attackers to manipulate AI; ii) The evolution of AI. AI is likely to become easier to use, more autonomous, and cover more applications. We believe that this will make it more attractive to manipulate AI. iii) Regulation of AI. If AI systems are regulated to follow specific design guidelines (that also ease forensic investigation), this might make AI more difficult to abuse. In summary, for some time to come, AI might not be designed to be malicious, but the risk for this to eventually happen is large, and, thus, forensic work should start now to be one step ahead of attackers.
The overarching approach of our methods was to identify data samples that trigger malicious actions and present data samples having similar activations to the investigator. Our methods emphasize that the role of the digital forensic investigator remains highly important. In particular, for our first case study, a forensic investigator must identify concepts related to the incident, label data, etc. While some of these tasks might be further automated, we believe that the black-box nature of AI combined with its novelty makes the role of forensic investigators more difficult and more relevant in the context of AI.
Rather than working on data samples, one might also work with features, e.g., reconstructing features of a neural network. There is a rich set of explainability methods [7,29] that might be leveraged as part of the investigative process. Many common explainability methods such as LIME or SHAP that compute proxy models assuming black-box access or methods that require gradients such as GRAD-CAM can be helpful. However, methods for black-box models might not give the best results for grey-box models, since they do not leverage internal information. Furthermore, outputs (used by the system) might not always be easy to access, e.g., due to disguise layers as shown in the architecture for one of our case studies. Methods for white-box models cannot be applied directly since they often require internal information, e.g., gradients that cannot be computed. Still, adjusting XAI methods to the grey-box model might be helpful. For example, [28] proposed to decode layer activations. Potentially, this might allow identifying concepts of the input a classifier reacts to without the need for gradients. Since this method was developed assuming access to a single layer, more work is needed to clarify whether this or other techniques work when exact access to a single layer is unavailable.

Drone design options
On-board processing vs. communicating to server: A key challenge for drones, in general, is limited battery power. From a system design perspective, two fundamental design options strongly impact energy consumption: Transmit the image to a server or process it on the drone. While both options have their strengths and weaknesses, in the paper, we discussed the latter, more battery energy-consuming option, because arguably, a non-autonomous drone would leave digital traces if it had to communicate with the nearest cell tower or WLAN access point. This poses a high risk for an attacker since the server the drone communicates with might be identifiable. Furthermore, if the server does a similar type of processing as the autonomous drone would, the investigative methods are the same once it is identified. From an investigative point of view, where the emphasis is on analyzing the AI system, the design options are the same. Also, both are found in practice. Object tracking vs. object recognition: We employ object recognition, assuming that each image of a drone is analyzed independently. There exist techniques designed for object tracking. However, these networks also rely on CNN layers, and they are not necessarily advantageous, e.g., in terms of ease of use and computational needs. Additional Custom CNN vs. Fine-Tuning of object detector: For the drone design, one might fine-tune an object detector or use a custom CNN to refine the output of the object detector. The former has slightly lower energy consumption for the drone, but arguably requires more effort. We opted for a custom CNN, but both options are very similaralso from an investigative perspective. Choice of Object Detector: We opted for Faster-RCNN, but acknowledge that any other object detector could be used and not a key factor for our analysis. For example, YOLO provides a computationally lightweight object detector. Also, our neural network designs do not require much computing (our custom CNN only has a few layers). However, our networks might be further optimized toward less energy consumption. Furthermore, other architectures like MobileNet might lead to even lower energy consumption, though potentially at the cost of reduced detection/classification performance. Drone control logic: Controlling a drone is complex. Therefore, our control logic consisting of a few simple rules is on a rather high level. The reason is that drone control algorithms are not the focus of investigation and, thus, are not of primary interest, though one could envision other scenarios where they are. Tracking referee vs. ball: From a conceptual point of view, there is little difference. Also, the drone image covers a fairly large area of the sports ground (much larger than on the cropped images in the paper). Therefore, it is rarely the case that any action is missed if the focus is on the referee. Disguise and Impact on investigator: Another reason why an investigator might not rely only on the last outputs based on the order of computation is disguise layers as shown in one of the figures in the paper: the network might intertwine genuine and nongenuine layer computations, where the last layers might consist of non-genuine layers. Thus, the investigator has to assess all layers.

Dataset(s) for Case 1
Data labeling: The dataset was collected by extracting images from the drone video lasting about 35 minutes. For a frame of the video, we obtained all objects (as returned by faster RCNN). We retained those with heights between 80 to 128 and labeled them according to the 10 classes shown in the figure in the paper. Objects with lower heights are hard to assess visually. After processing each frame, we skipped 50 frames to have some variation between frames and reduce the overall labeling load. This left us with 2013 labeled samples. After a first labeling round, we checked for mis-labelings and fixed those we identified.
Data source: The data stems from a public event in a public setting uploaded on YouTube, where spectators are welcome, and filming is commonplace. We are happy to share more information, including our labeled dataset, or upload it publicly upon reviewer request. We only show very low-resolution images of players in the paper to respect their privacy, though millions of YouTube videos covering more private themes have been used in research, e.g., see the 8 million video YouTube-8M dataset. We also contacted the uploader and carefully took into account ethical pros-and cons 4 .