Agent-based vulnerability assessment at airport security checkpoints A case study on security operator behavior

Despite enormous investments in airport security, terrorists have been able to ﬁ nd and exploit vulnerabilities at security checkpoints. Existing vulnerability assessment methodologies struggle with accounting for human behavior, and agent-based modelling forms a promising technique to overcome this limitation. This paper investigated how the decision-making and performance of human operators can be taken into account while assessing vulnerability at an airport security checkpoint. To this end, an agent-based model was designed, in which the performance of security operators was modelled using a functional state model, while decision making was modelled using decision ﬁ eld theory. Passengers and an attacker that brings a weapon to the security checkpoint were also explicitly modelled as agents. Simulation results indicate that the highest skilled operators outperformed their lowest skilled counterparts on analyzing X-ray images, but performed worse on both searching luggage and performingpatdowns.Furthermore,resultsshowedthatahighfocusonspeedofsecurityoperatorsleadstoadecrease in luggage searches and therefore increased vulnerability. Moreworkis neededto calibrateandvalidate thesimulationresults, but initialresultsarepromising.Theagent-based model can be used by airport regulators and managers to understand the workings of their security checkpoint better and ultimately to reduce vulnerabilities.


Introduction
Despite enormous investments in airport security, terrorists have been able to find and exploit vulnerabilities at security checkpoints. In the years after 9/11, aviation has been targeted by several bombing attempts (BBC News, 2006;Edmunds, 2010;Burns, 2010), such as the shoe bomber (CNN, 2001). Each of those attempts exploited new vulnerabilities and bypassed the security checkpoint successfully. It is only after such an attempt that new regulations and procedures are developed to address the exploited weakness in the security checkpoint. This reactive approach leaves airports vulnerable to innovating attackers. This problem is well recognized within the scientific literature, but developing a method that accurately assesses all vulnerabilities in a security checkpoint is a challenging task.
The security checkpoint is operated by security operators that constantly have to perform cognitive tasks, such as detecting illegal items on an X-ray image (IATA, 2012). These operators also continuously have to make decisions, such as the decision to confiscate a potential weapon or not. Empirical research has shown that security operators do not necessarily follow protocol, but regularly bend and break the rules (Kirschenbaum et al., 2012a;Kirschenbaum et al., 2012b;Kirschenbaum, 2013;Kirschenbaum, 2015). They commonly ignore potential threats and alarms are often processed as false. Furthermore, the performance of security operators is dependent on a variety of factors, of which cognitive task demands and personality are two examples. These human factors affect the performance of the checkpoint as a whole and additional vulnerabilities may emerge from their behavior. Therefore any method that aims to systematically identify all vulnerabilities in a security checkpoint should include these cognitive aspects in the analysis.
The objective of this work therefore is to understand how the decisionmaking and performance of human operators influence vulnerability at an airport security checkpoint. To this end, we employ an agent-based modelling approach to identify and quantify the vulnerabilities of two typical airport security checkpoint setups. The contribution of this work is twofold. First, we define a novel agent-based model to assess vulnerability, in which we specify security operators' behavior by combining two different cognitive models. The performance of security operators on different tasks in the checkpoint is modelled using the functional state model (Bosse et al., 2008), and their decision-making process is modelled using decision field theory (Busemeyer and Townsend, 1993). The developed model can easily be adapted to test future concepts of security checkpoints, such as X-ray operators working remotely. These types of experiments are hard to perform directly at airports, as it may interrupt security operations. Secondly, by performing experiments with the model, we generate new insights with Transportation Research Interdisciplinary Perspectives 5 (2020) 100139 ⁎ Corresponding author.

Contents lists available at ScienceDirect
Transportation Research Interdisciplinary Perspectives j o u r n a l h o m e p a g e : h t t p s : / / w w w . j o u r n a l s . e l s e v i e r . c o m / t r a n s p o r t a t i o n -r e s e a r c hi n t e r d i s c i p l i n a r y -p e r s p e c t i v e s respect to vulnerabilities at the security checkpoint. Three types of experiments are performed: experiments related to operator performance, experiments related to operator decision making and experiments related to different airport security checkpoint setups. This paper is structured as follows. First, related literature about existing vulnerability assessment methodologies and human performance and decision making is reviewed in Section 2. Then, the agent-based model that we developed for this work is described in Section 3 and calibrated in Section 4. Three experiments were performed with the model and are described in Section 5. The first experiment is used to understand the influence of security operator performance on vulnerability, and the second for understanding the influence of security operator decision making on vulnerability. The third experiment is then used to investigate the effect of using different security checkpoint setups on vulnerability. Finally, the work is concluded in Section 6.

Related work
This section provides an overview of existing vulnerability assessment techniques, with a special focus on agent-based modelling. Furthermore, existing human performance and decision making models are discussed.

Existing vulnerability assessment methodologies
Following the ISO (I. Guide, 73, 2009) standards, vulnerability is defined as "intrinsic properties of something resulting in susceptibility to a risk source that can lead to an event with a consequence". Most vulnerability assessment methodologies are interested in quantifying vulnerability as a value between 0 and 100%.
The most common method to estimate vulnerabilities is to consult security experts involved in the field. These experts know about the equipment, procedures, and performance of the operators. They use this knowledge to quantify vulnerabilities on a predefined scale. Unfortunately, expert elicitation has proven to perform poorly in parameter estimation when there are dependencies in the system (Cooke and Goossens, 2008), while these dependencies inevitably exist (Cole and Kuhlmann, 2012).
Wu and Mengersen provide an overview of models that were used to aid security policy and planning (Wu and Mengersen, 2013). The discussed models range from simple probabilistic models (Chawdhry, 2009), to Bayesian models (Babu et al., 2006;Nie et al., 2009) and a fuzzy model (Akgun et al., 2010). None of these models explicitly incorporate the behavior of security operators, and therefore lack the capability to understand how the decision-making and performance of human operators influence vulnerability at an airport security checkpoint.
Several other methods exist as well. A failure tree models events in the form of an event tree (Benner, 1975), in which the nodes of the tree are events and the branches of each node specify the possible outcomes of the event. For each event in the tree, the likelihood of the different outcomes must be estimated, and vulnerability is finally calculated based on the combination of these events. The reliability of these methods hinges on the accuracy of those estimates which are not always accurate (Stroeve et al., 2013). Furthermore, a failure tree is unable to accurately model a complex socio-technical environment such as a security checkpoint.
The scenario-based approach is an approach that tries to identify threat scenarios to which airport security is most vulnerable (Cole and Kuhlmann, 2012). This method starts with identifying all threat elements that together describe the airport environment. For each of those scenarios, the countermeasures are identified as well. This allows users to identify scenarios in which a limited number of countermeasures are available. This approach has two major limitations. First of all, the airport environment is reduced to a linear set of relations between threat elements and countermeasures. Second, this method aims to identify the threat scenarios to which the system is most vulnerable, but only identifies the number of countermeasures related to each threat scenario, without factoring in the effectiveness of each measure.
Using penetration testing, airport defenses are physically tested. The results from those tests can be used to give better vulnerability estimates and develop better protocols. The big advantage of penetration testing is that it does not require any assumptions about the complexity of the system nor the behavior of the security operators. This guarantees that the vulnerabilities found during these tests coincide with reality, and can be beneficial to calibrate or validate other methodologies. However, a problem with penetration testing that it is labor-intensive and thus expensive (Bennett, 2015;Airport data & contact information, 2018). Due to these problems, airports can only test their defenses on a small subset of security scenarios.

Vulnerability assessment using agent-based modelling
Over the last few years, agent-based modelling has gained interest as an approach to assess airport security (Wilson et al., 2006). This is proposed as a risk management methodology called AbSRiM by Janssen et al. (2019b). In this methodology, security risks are assessed by building an agent-based model that can be used to simulate a threat scenario. Using Monte Carlo simulations, vulnerabilities can then be assessed. The agent-based modelling approach allows for modelling complex environments that emerge from the interactions between autonomous agents. Furthermore, human aspects can be explicitly modelled and therefore be taken into account in the assessment of vulnerability. On the contrary, agent-based models require an extensive modelling effort and this is a time-consuming process. As this approach forms a promising alternative to overcome the limitations of the other methodologies, we employed this methodology to assess vulnerabilities at security checkpoints.

Modelling security operator performance and decision making
Different studies have shown that the performance of security operators is not always optimal and that it is common for them to bend and break the rules (Kirschenbaum et al., 2012a;Kirschenbaum et al., 2012b;Kirschenbaum, 2013;Kirschenbaum, 2015;United States House of Representatives, 2011).
The performance of humans is dependent on a variety of factors, of which cognitive demands of a task and personality are two examples (Gonzalez, 2005;Hancock, 1989). Several computational models have been proposed in literature to model the performance of humans. Many of these models have a specific focus on aspect, like situation awareness (Endsley, 1995). The Functional State Model is a dynamic performance model that describes the performance of an agent as a function of task complexity, the state of the agent and its characteristics (Bosse et al., 2008). The model incorporates a large set of different factors, such as stress, exhaustion and situation awareness. The model was validated by empirical experiments with human operators from defense. This is also the model that we use in this work to model human performance, as it aims to incorporate a diverse set of factors.
Human decision making has been a long-studied field, and an extensive overview of modelling human decision making can be found here (Osman, 2010;Canellas, 2017). Two main streams can be distinguished when modelling human decision making: bounded and unbounded rationality (Canellas, 2017). In bounded rationality, decisions are made within a set of human constraints such as limited information or processing speed of the brain, while in unbounded rationality these constraints are not present. In this work, we also use the bounded rationality paradigm. Within the bounded rationality paradigm, several types of models are developed in literature: linear decision making models (Canellas, 2017), machine learning approaches (Gibson et al., 1997) and diffusion models (Busemeyer and Townsend, 1993;Ratcliff and McKoon, 2008). We focus on probabilistic decision making, which is shown to be well capable of account for human irrationality. Important models in this area are the decision field theory model (Busemeyer and Townsend, 1993) and the Ratcliff diffusion model (Ratcliff and McKoon, 2008). We used the decision field theory model in this work to model decision making of security operators, as it has strong empirical support and is famous for its ability to reproduce many known irrationalities in human decision making.
Only a few works exist that aim to model the behavior of security operators (Skorupski and Uchroński, 2015;Skorupski and Uchroński, 2017). This research models the effects of human factors on the performance of the security system by using a fuzzy inference system. However, their system is mostly based on expert opinions. Our work focuses on more detailed cognitive models of human security operators, and how they can be used to estimate vulnerability. Furthermore, we explicitly represent interactions between agents (security operators and attackers), and important security devices, such as body scanners.

Modelling the security checkpoint
This section describes the agent-based model that was developed to assess vulnerabilities at an airport security checkpoint, while focusing on human performance and decision making. The specification of the environment is discussed in Section 3.1, and the different types of agents are discussed in Section 3.2.

Environment
The environment of the model contains four different objects: luggage, weapons, sensors and equipment. Luggage has a complexity level that influences the task complexity of operators that interact with this luggage. The complexity level can either be high or low. Furthermore, luggage is owned by a passenger and may contain explosive traces (represented as a Boolean value) and/or a weapon.
Then, the weapon object conceptualizes a weapon that an attacker agent aims to bring past the security checkpoint. A weapon is of a certain type. The type of weapon can, for instance, be a ceramic knife or an explosive liquid. Similar to luggage, a weapon can contain explosive traces. Furthermore, a weapon has a perceived risk, r perc , that indicates to which extent a security operator perceives objects that closely resemble the weapon as a risk to the airport. For instance, explosive liquids resemble a bottle of water and are therefore perceived as a low risk. Perceived risk is formalized as a real number between 0 and 1. While r perc is different for each operator, we assumed that it is the same for everyone, and, therefore, included it as part of the weapon. Finally, a weapon can be on the body of an attacker, or in the luggage of an attacker. A full list of weapon types and their corresponding parameters is shown in Section 4. Different sensors were defined in the model: X-ray sensor, Walkthrough metal detector (WTMD), explosive trace detector (ETD) and body scanner. Each sensor has a probability to detect a specific weapon type, called base detection probability p detect sensor (weapon_type). This parameter is calibrated and shown in Section 4. Based on this detection probability, the sensor either detects or does not detect a weapon when presented, which can then be observed by operators. The X-ray sensor is an exception to this standard, as this sensor only allows operators to observe the luggage that is currently sensed by the sensor. In this case, the likelihood of detection is determined by the skill of the x-ray operator. Finally, two types of equipment were defined in the model: queue separators and X-ray systems. Queue separators are used to guide passengers to the security checkpoint, while the X-ray system moves luggage forward through an X-ray sensor.

Agents
Three different agent types were defined: passengers, attackers, and operators. Each of these agents is human agents and is discussed in more detail below.

Passengers and attacker agents
Passengers and the attacker were defined similarly. They both do not exhibit sophisticated strategical behavior. Passengers carry luggage that they bring to the security checkpoint. Furthermore, passengers could carry explosive traces and they can own a weapon. This weapon can, as defined above, either be on the body of the passenger or in its luggage. We refer to a passenger that owns a weapon as an attacker, and as a passenger otherwise.

Security operators agents
A set of security operators that execute activities at the security checkpoint were defined: patdown operator, ETD check operator, luggage check operator, and X-ray operator. This section discusses the definition of the X-ray operator, as the other operator types are defined similarly.
Airport security is largely defined by regulations and guidelines defined by different regulatory institutes. For instance, the European Union has regulations for its member countries (Council of European Union, 2008a; Council of European Union, 2008b), the United States has the Aviation and Transportation Security Act (107th Congress, 2001), and the ICAO developed a security manual (ICAO, 2017).
Following these regulations and guidelines, each of these operators executes a fixed set of tasks and decisions. An X-ray operator inspects output generated by the X-ray machine and determines if there is a potentially illegal item. When this is the case, (s)he has to inform the luggage check operator, who then searches the luggage. Security operators do not necessarily follow this protocol, but regularly bend and break the rules. They commonly ignore potential threats and alarms are often processed as false (Kirschenbaum et al., 2012a;Kirschenbaum et al., 2012b;Kirschenbaum, 2013;Kirschenbaum, 2015; United States House of Representatives, 2011). Furthermore, humans cannot continuously perform optimally. It is dependent on a variety of factors, of which cognitive demands of a task and personality are two examples (Gonzalez, 2005;Hancock, 1989).
The performance of security operators on different tasks in the checkpoint is modelled using the functional state model (Bosse et al., 2008), and their decision-making process is modelled using decision field theory (Busemeyer and Townsend, 1993). In the case of the X-ray operator, the modelled task is inspecting output generated by the X-ray machine, while the modelled decision is that of informing or not informing the luggage check operator.
The functional state model, the decision field theory model, and their integration are discussed below.
3.2.2.1. Functional state model. To model the performance of security operators, the functional state model was selected (Bosse et al., 2008). While the model contains a set of 37 parameters, we only discuss the most important parameters here. For the other parameters, the reader is referred to the work of Bosse et al. (2008). The input for the model is the task level (TL), which is dependent on the skill level (SL) of the operator and the task complexity (TC) of the task at hand. For an X-ray operator, the task complexity represents how complex the luggage (s)he is currently investigating is. This dependency is modelled as follows: The output of the model is the performance quality (PQ) of the agent, indicating how well the operator is performing. A PQ of 1 corresponds to the baseline performance of an agent, while values lower than 1 correspond to performances that are worse than this baseline and values higher than 1 correspond to performances better than baseline. No theoretical bounds of PQ were provided in Bosse et al. (2008). However, typical PQ values in our simulation results are in the range of 0.5 and 1.5.
PQ is dependent on two factors: provided effort (PE) and task level (TL): PE is determined by the generated effort (GE) of the agent, recovery effort (RE) and noise effort (NE). The latter two parameters correspond to the ability of humans to decreases exhaustion, and the effort the human has to contribute to the noise in the environment respectively.
GE is the most important contributor to PE, and is ultimately defined by effort motivation (EM), among many other parameters. We refer to Eqs. (2) and (4) in the work of Bosse et al. for a complete deduction of GE (Bosse et al., 2008;Bosse et al., 2011).
Effort motivation is based on the current task level and the difference between experienced pressure (EP) and optimal experienced pressure (OEP). EP is similar to a person's stress level, while OEP determines how well a person can cope with a high EP. Finally, EP is, among other terms, related to generating effort above and below a critical point. The critical point is the amount of effort someone can generate without becoming exhausted. For an X-ray operator, PQ is reflected in the likelihood (s)he observes a weapon from the observations of the X-ray sensor. This is modelled as follows.
The value p detect x−ray (weapon_type) corresponds to the base likelihood that a specific type of weapon is detected by an X-ray operator, which is calibrated in Section 4.2. The value 1 − p detect x−ray (weapon_type) corresponds to the base probability of not detecting the weapon: the base false-negative rate. When performing well (i.e. a high PQ), X-ray operator improves on this base false-negative rate, and vice versa. We model this by dividing the base false-negative rate by the performance quality and a scaling factor k. The underlying assumption here is that the false-negative rate linearly decreases with increasing PQ. This false-negative rate is then transformed back to a detection probability by subtracting it from 1. To ensure that the value falls between 0 and 1, we take the maximum of 0 and the value obtained above.
The other operators at the security checkpoint use this performance model to execute the patdown activity, search luggage and perform an ETD test. The value of k, and other related parameters of the functional state model are calibrated in Section 4.
Two different personality types are introduced based on the work of Bosse et al.: personality I and personality II (Bosse et al., 2008). Bosse et al. extensively experimented with these two personality types and performed an in-depth analysis of their behavior. Type I has a relatively high OEP, meaning that it can cope well with high EP levels, while type II does not. This allows the first personality type to perform better under high pressure. We experiment with these personality types in our analysis.
3.2.2.2. Decision field theory. The decision-making process of the security operators was modelled based on the work of Busemeyer and Townsend (1993). The decision-making process in this model is an iterative process in which the operator constantly updates their preferences until the preference for one of the options exceeds a decision threshold value. This threshold value is one of the inputs of the model and its magnitude is related to the effort an agent spends on a decision. The higher the threshold value, the more time and energy the security operator needs to reach it.
During each iteration, the agent focuses on one of his goals. The selection of this goal is a random process, but the likelihood of the agent focusing on a goal depends on the attention weight. Once the attention of the agent is focused on one of his goals, the agent's preferences are updated based on the agents beliefs about how each of the options helps him in achieving the goal (s)he currently focuses on. The magnitude with which the preference for each of the goals is updated is known as valence. This valence is defined for each combination of goals and options.
Finally, the decision-making process is influenced by the agent's initial beliefs. This initial preference is the preference the agent has for each outcome before the decision process starts. An overview of this process is shown in the bottom part of Fig. 1.
For an X-ray operator, one decision is identified. If the X-ray operator observes a potential weapon (see also Section 3.2.2), (s)he has to decide if the luggage requires a search from the luggage check operator. The options for the X-ray operator are inform or ignore. Furthermore, three goals are defined for the X-ray operator based on existing literature (Sharpanskykh and Haest, 2016;Fairbrother, 2010).
• Accuracy -The operator wants to do their work as well and accurate possible. The importance of this goal may be dependent on pressure within the organizations or the agent's standards.
• Speed -The operator wants to do their job as fast as possible. The importance of this goal may be due to pressure within the organization to reach a certain throughput or the security operator wanting to minimize effort.
• Perceived Risk -It is the job of the security operator to minimize the risk of an attack. Perceived risk represents the beliefs an agent has about the potential consequences of the observed prohibited item. The importance of this goal may be dependent on the agent's beliefs about the likelihood of an attack and his risk aversion.
Both luggage check operators and physical check operators use this decision mechanism to determine if a passenger requires secondary screening when an illegal object was found. The ETD operator makes the same decision when explosive traces were observed. The other related parameters of this model are calibrated in Section 4.
3.2.2.3. Integration of models. We integrated the models by relating parameters of the functional state model to the decision field theory model. The relation between the models is shown in Fig. 1.
The decision threshold was set to be equal to the provided effort (PE) as defined in the functional state model. Provided effort denotes the effort that is contributed to the task by the agent. This relation means that the higher the provided effort, the more effort the agent wants to invest in making an accurate decision. This is based on findings by Busemeyer and Townsend (1993). Furthermore, we assumed that the initial preference of the X-Ray operator is according to regulations present at the security checkpoint, meaning that there is a strong initial preference to request a luggage check if needed. The next section describes how these parameters are calibrated.

Model sensitivity and calibration
In this section, the sensitivity of the functional state model and the decision field theory model is discussed. Furthermore, it is described how the overall model was calibrated. Different parameters had to be calibrated: parameters related to weapons, sensors, airport configurations, and operators. These are discussed in detail below.

Sensitivity analysis
We performed sensitivity analysis of both the functional state model and the decision field theory model. Fig. 2(a) shows how different task levels affect the performance quality and the provided effort in the functional state model. Results were obtained after the task level was kept constant for 20 s. At this point, the performance quality converged to an equilibrium value for any task level. From the figure, it becomes clear that both personality types have the highest performance quality around a task level of 250. The peak performance of personality type I is at a task level 230. At this point, it outperforms personality type II by 24%.
At task levels lower than 225 the performance quality of both personalities rapidly drops. This is mainly due to a lack of provided effort as can be seen in Fig. 2(b). In this range, personality type II outperforms personality type I by 20%. At task levels above 275, the performance of both personalities exponentially decreases. The provided effort of both agents stays approximately stable around 230, meaning that the agent cannot provide more effort. At even higher task levels the performance quality drops.
We investigated the sensitivity of the decision field theory model as well. The values in Table 3 were used with c = 30, but the initial preference for the inform decision was varied. The decision threshold was set to a uniform random value between 70 and 250, which is the range of provided effort values as observed above. Two scenarios were investigated: 1) a weapon with a perceived risk of 0 was observed, and 2) a weapon with a perceived risk of 1 was observed. A total of 1000 simulations were performed for each data point. Fig. 3 shows how different initial preferences for the inform decision influence the decision of the X-ray operator. Both graphs have the same general shape. The choice to inform the luggage-check operator increases from a baseline value to 100% when the initial preference becomes 250. At initial preferences above 250, the operator chooses to inform the luggage-check operator 100% of the time. This is because the initial preference already exceeded the threshold value. The range of values for both scenarios is different. In scenario 1, the luggage-check operator is informed 93% of the time without any initial preference, while this is only 50% in scenario 2. In scenario 1 the dominant decision is to inform, as two out of three goals favor this decision. As the perceived risk is zero in scenario 2, there are effectively only two goals: accuracy and speed.
Neither of these goals dominates the other, leading to a baseline of 50% inform decisions. Table 1 shows the different weapon types used in this work. For each of the weapon types, it is indicated if they contain explosive traces and their perceived risk. It should be noted that the perceived risk represents the risk that is perceived by operators for objects that resemble the weapon. For instance, if a bomb is not recognized as such (like explosive liquids), the perceived risk is much lower. Bombs and fire-arms were assumed to have the highest perceived risks, while liquids were not perceived as a large risk, as operators continuously confiscate water bottles. Knives were perceived as a larger risk, but they are still commonly observed. Table 2 shows the different detection probabilities for weaponsensor combinations and weapon-activity combinations. The values for X-ray performance is based on literature (Wales et al., 2009), as well as the explosive bulk detection probabilities for body scanners (Grabell, 2011). No data could be found on how security operators perform on searching luggage and patdowns. These values were therefore based on assumptions.  (Bosse et al., 2008) (top) and the decision field theory model (Busemeyer and Townsend, 1993) (bottom) used in this work. The integration between the two models is shown as well.

Airport configurations calibration
Two different airport configurations were defined based on IATA documentation (IATA, 2012): the regional airport and the international airport. There are two main differences between these configurations. First, the regional airport uses a WTMD whereas the international airport uses a body scanner. This choice of equipment impacts the detection rates of weapons hidden on the body of the attacker. The second difference is the communication between the X-ray operator and the luggage check operator. At the regional airport, there is the possibility to communicate directly, while at the international airport the luggage check operator is not in direct contact with the X-ray operator. The luggage check operator has to perform an X-ray himself/herself to determine where the weapon can be found.

Operator performance calibration
The task complexity (TC) of operators for the different tasks that they perform are calibrated in this section. To this end, we assumed three different skill levels (SL) for operators: 0.8 (low), 0.9 (medium), and 1.0 (high). These levels correspond to a realistic variation in skill between agents in the functional state model, based on experimentation with the functional state model and values found in literature (Wales et al., 2009). Furthermore, we assumed a base task level (TL base ) of 150, corresponding to a performance quality (PQ) of around 0.5. The base task level is the task level when the agent is not performing its activity and is based on the work of Bosse et al. (2008). An X-ray operator has about 1 s to identify potentially prohibited items in luggage and research has shown that the number of false negatives increases when the images become harder to interpret (Wales et al., 2009). Based on the same work, we assumed that the performance of an X-ray operator decreases with 5.47% for more complex luggage.
The scaling factor k was calibrated as follows. The value for k × PQ should be equal to 1, as an average operator performs according to the    base detection probability. We have performed 1000 simulations for each skill level, personality type, and task level to determine the mean performance quality of operators, and found that it corresponds to 0.59. We use this mean performance quality to finally find k to be equal to 1.68.

Operator decision calibration
For operator decision making, the following parameters had to be calibrated: initial preference, decision threshold, and valence. We used attention weight as a parameter to experiment with. The decision threshold is equal to the provided effort PE as suggested in the work of Busemeyer and Townsend (1993). The valences and initial preferences are shown in Table 3. The initial preference for the inform decision was assumed to be the decision threshold of the agent multiplied with a constant of c pref = 0.95. This indicates that the X-ray operator has a strong preference to follow rules and regulations. Furthermore, the valences related to speed and accuracy are ±c for both options. We chose c = 30 such that the mean decision time of X-ray operators corresponds to times reported in literature (Wales et al., 2009). Finally, the valences for the perceived risk goal were made dependent on the perceived risk of the observed weapon. We assumed this to be a multiplication between c and the perceived risk. The parameters of the other operators were determined similarly.

Experiments and results
We performed experiments with the model to assess vulnerabilities at different security checkpoint setups. The setup of the experiments is discussed first, followed by a discussion of results.

Experimental setup
The model was implemented in the AATOM simulator, which is a Javabased airport terminal operations simulator (Janssen et al., 2019a). It is agent-based and contains several calibrated presets and templates of basic airport terminal components that can readily be used. No other simulator that we know of contains such a combination of agent-based modelling and pre-calibrated airport-specific components.
As specified in the model description, four operator agents were defined in the model: patdown operator, ETD check operator, luggage check operator, and X-ray operator. We used a single security lane setup, and passengers were generated for a single flight with up to 100 seats. A single attacker was introduced among the passengers that went through the security checkpoint.
The following parameters were varied in the execution of the experiments.
• Attacker parameters -Weapon. The weapon the attacker uses is one of the weapons shown in Table 1. -Weapon Location. The attacker has the option to hide the weapons on his body or in his luggage.
• Checkpoint Configuration. The checkpoint configuration is either the regional airport or the international airport.
• Operator parameters -Skill Level. The skill level of the agents is either 0.8 (low), 0.9 (medium) or 1.0 (high). -Personality Type. The agents either have personality I or II, based on the work of Bosse et al. (2008).
-Attention Weights. The attention weight for each goal is set to 0.33 (low), 0.5 (medium) or 0.67 (high). The weights are normalized so that they add up to one after they are selected.
A total of N = 15,000 simulation runs were performed, while using a uniform random assignment of the above parameter values.

Results
The results are discussed as follows. We define vulnerability as the proportion of attackers that moved past the security checkpoint with their weapon. These attackers did not receive secondary screening and their weapon was not confiscated. We first show how the skill level and personality type of security operators influence their performance. Then, we show how different attention weights of the decision field theory model influence the decisions made by the operators. Both these results are an indication of the vulnerability of the security checkpoint, as both performance and decision making directly influence the number of secondary screenings and weapon confiscations. Finally, an overall vulnerability assessment of the different checkpoint configurations is conducted and a discussion is provided.

Performance of operators
The performance quality of X-ray operators and luggage check operators can be found in Fig. 4. The performance quality of security operators is directly related to the vulnerability of the security checkpoint. A low performance quality of any of the operators leads to a higher vulnerability, as items are detected with a lower probability. As can be seen in the figure, PQ increased with skill level for X-ray operators. The agents with the highest skill level (of 1) outperformed the agents with the lowest skill level (of 0.8) with 5.2%.
Different results were observed for luggage check operators. The operators with the highest skill level were outperformed by the agents with the lowest skill level by 4.0%. This also seems counter-intuitive but can be explained from the mechanisms of the functional state model. If the task level becomes too low, the performance quality drops, as skilled agents are not motivated enough to generate effort. For operators with a lower skill level, the task is more challenging and they are more motivated to put in the effort. This lead to the counter-intuitive result that the most skilled agents were not top performers on this relatively simple task. This result may seem counter-intuitive but is caused by the fact that agents perform (relatively) simple tasks and find it hard to motivate themselves to put in enough effort. Following the functional state model, operators with a higher skill level, experience a lower task level for the same task as their lower-skilled counterparts. Generated effort is, among other parameters, based on the motivation of the operator, which in turn is partially determined by the task level. Because the task level is lower for higher-skilled operators, the effort motivation decreases, which decreases the provided effort. Our simulation results have shown that this negative effect on performance quality of decreased motivation is larger than that of an increased task level for lower-skilled operators. Section 3.2.2 provided a discussion of the different variables in the functional state model.
These results are not unique to the Functional State Model and our model. Hackman and Oldham proposed a so-called Motivating Potential Score (Hackman and Oldham, 1976) which is a framework that is widely used in literature. MPS is, among other terms, composed of skill variety. This is strongly related to what we have defined as skill level in our paper and explains the connection between motivation and skill level. Furthermore, jobs with a high MPS, have a positive effect on motivation, performance and job satisfaction (Singh et al., 2016). This then relates motivation to task performance.
The differences between the performance quality of analyzing X-ray images and checking luggage can be explained as followed. Analyzing X-ray images is a difficult cognitive task for humans. A large number of stimuli have to be processed and illegal items have to be identified at a high speed. An operator performing a luggage check has more time to Table 3 Calibration of the decision parameters for the X-ray operator. Initial Pref. Accuracy Speed Perc. risk execute the task at hand. Typically, they take around 90 s, while X-ray operators only have a few seconds for their task. This allows the luggage check operators to generate more effort and therefore reaching higher performance quality. Furthermore, agents that cannot cope with pressure well (type II) outperform agents that can better cope with pressure (type I) by 20%. The results of the different personality types shown in this figure are an aggregate of all skill levels. Personality type II has a relatively low OEP, which is closer to the actual experienced pressure than the high OEP of personality type I. The difference between these values determines the effect on effort motivation. A low difference leads to a low reduction in effort motivation, while a high difference leads to a high reduction of effort motivation. As mentioned before, a lower effort motivation finally leads to a lower performance quality.

Decision making of operators
We analyze the decision-making process of X-ray operators. When an X-ray operator detects a potential weapon, the agent has two options. The first option is to ignore that the potential weapon was observed, while the second option is to inform the luggage check operator. When a potential weapon was detected, luggage check operators were informed correctly 93.7% of the time on average. This number varied based on the attention weights for each of the goals, as shown in Fig. 5. Not searching luggage when it contains a weapon, directly increases the vulnerability of the system.
One of the reasons for an X-ray operator to not inform the luggage check operator is that it might not perceive the potential weapon as an actual weapon. For instance, liquid explosives might resemble a water bottle. While a water bottle is illegal according to checkpoint regulation, regulations are not always strictly enforced by security operators (Kirschenbaum et al., 2012a;Kirschenbaum et al., 2012b;Kirschenbaum, 2013;Kirschenbaum, 2015). Not informing the luggage check operator then leads to faster processing of passengers, which is of enormous economic importance for airports.
From the figure, it becomes apparent that the attention weight for speed was the most dominant parameter in the inform decision. Varying this parameter from low (0.33) to high (0.67), lead to a 12% decrease in luggage searches. The second most important parameter is the attention weight for accuracy. Increasing this parameter from low to high, caused an 11% increase in luggage searches. The attention weight for risk was less dominant. An increase from low attention to high caused a 5.3% increase in luggage searches. This parameter was less influential, as many potential weapons are not perceived as a large risk by the operators. Speed and accuracy, on the other hand, played a more important role in the decision-making process. While not shown, results for decisions by other types of operators followed similar trends.
Jesus performed a questionnaire among security operators at a regional airport to determine how they make trade-offs between security and efficiency (Jesus, 2018). One of the main findings of his research was that operators could be classified into three categories: 1) passenger level of service operator, 2) security-focused operator, and 3) efficiency-focused operator. About 13% of the surveyed employees fell into the last category. These employees mostly focused on improving the efficiency of checkpoint operations and barely on security.
While the results of Jesus are not readily comparable to our results, we do observe an interesting similarity between them. Both results indicate that some employees mostly focus on executing their work efficiently (i.e. high attention weight for speed), which then results in increased vulnerabilities.

Different checkpoint setups
The performance of the security checkpoint for different weapons and locations are shown in Fig. 6. In this figure, the distribution between three potential outcomes of a scenario are shown: vulnerability (weapon not confiscated and no secondary screening), secondary screening (regardless of weapon confiscation) and the situation in which the weapon was confiscated while no secondary screening was conducted.
From this figure, it becomes clear that some weapons were never confiscated at the regional airport. These weapons cannot be detected by the equipment used to scan the passengers. None of the explosives smuggled on the body get detected by the WTMD and the same holds for ceramic knives. Explosives only got detected by a random ETD check, which lead to a secondary screening in 10.1% of the cases. Furthermore, knives can be taken through the checkpoint at the regional airport without large consequences. Most often, the knife got confiscated and the attacker could try again at a different time as the chances on a secondary screening were found to be almost zero. The regional airport performed best on detecting guns in luggage. These weapons were confiscated 84.8% of the time when they were located in the luggage (as compared to 70.0% in the international airport) and immediately lead to a secondary screening. This becomes 90.7% when the attacker carried the weapon on their body.
At the international airport, only one type of weapon remained undetected. Smuggling explosive powder through a body scanner had a success rate of 88.6%. The only measure against it was a random ETD check. Furthermore, liquid explosives and powders hidden in luggage were confiscated only 32-34% of the time. Even when these items were confiscated, the security operator did not necessarily recognize these items as bomb parts and allowed the attacker to move on. Bulk explosives, on the other hand, were detected in 50% of the cases and lead to immediate secondary screening. An attacker bringing a gun was very unsuccessful at the international airport. The attacker was most successful when locating the gun in their luggage, but this only had a success rate of 30%. Knives could best be brought hidden in the luggage as well. In that case, they were only confiscated 36% of the time and the chances of secondary screening were minimal. However, the potential impact of a knife past the security checkpoint is far more limited than that of other weapons investigated in this work.
The regional airport outperformed the international airport on checking luggage for all weapons. In the regional airport, 62.6% of the weapons in the luggage are confiscated, whereas in the configuration of the Fig. 4. The mean performance quality (and their 95% confidence intervals) for the different skill levels and personality types. international airport this is only 42.6%. The main reason for this is the lack of communication in the configuration of the international airport. The Xray operator flags luggage for a search, but the luggage check operator has to identify the weapon on the X-ray image himself. This extra step in the process caused a loss in performance of 32% and occurs solely because two officers independently had to recognize a weapon on an X-ray image instead of just one. 5.2.3.1. Relation with related work. ABC News reported in 2015 that in 95% of trials undercover investigators were able to smuggle mock explosives or banned weapons through checkpoints (Fishel et al., 2015). Two years later, this percentage decreased to a still extremely high value of around 80% (Kerley and Cook, 2017). These problems do not only exist in the United States. The Telegraph reported late 2014 that airport security failed to detect half of the dangerous weapons at Frankfurt airport (Huggler, 2014). The vulnerabilities that we found in this work (see Fig. 6) are close to the vulnerability as described in these public reports. However, to the best of our knowledge, there is no public data available that evaluates the effectiveness of security checkpoints specifically for different weapon types, as we do in our work. We have used all data that was available to calibrate the capabilities of the different sensors to detect each weapon type in our model. However, more work is needed to validate our results. This can be done by performing penetration tests with different weapon types at security checkpoints, and comparing these results with our simulation results.

Discussion of results
As mentioned in Section 4, the calibration of the model was based on a set of simplifying assumptions. These assumptions influenced the magnitude of the resulting vulnerabilities in different checkpoint configurations. The calibration of the model can be improved by performing field tests to determine different parameters. For instance, the performance of operators for searching luggage can be evaluated by providing security operators a set of luggage containing legal and illegal objects. Furthermore, perceived risks of objects can be evaluated for different operators using a similar method. Decision making of operators can be calibrated better by performing choice experiments, such as the one performed by Jesus (2018).
While vulnerability estimates are inherently hard to validate, some researchers performed real-life experiments (Ford, 2017;Gholami et al., 2017). This form of validation is a direction of further research for this work. Both calibration and validation of the model can still be improved, but the proposed model is still valuable for airport security practitioners, as it can be used to generate improved results when more data becomes available.
We did not consider all security mechanisms that are present in airports. For instance, intelligence agencies can detect attackers before they arrive at the security checkpoint. Behavior detection officers (United States Government Accountability Office, 2013; Winter and Cora, 2015) are also capable of detecting suspicious behavior at the security checkpoint and perform secondary screenings based on that. Furthermore, more strategic attacker behavior in which the attacker chooses the right type of weapon for the checkpoint configuration can be considered as well.
Agent-based modelling is an important tool to better understand complex systems. Using our model, vulnerabilities caused by imperfect human decision making and performance were identified. Understanding how these vulnerabilities emerge enables airport security managers and policymakers to improve their security policies and reduce vulnerabilities. For instance, in Section 5.2.1 we found that operators with a low skill level outperform their higher-skilled counterparts on checking luggage. This is a surprising result that hiring officials can take into account while hiring security operators. Furthermore, higher-skilled operators outperform lower-skilled operators on the more difficult task of examining X-ray images. We also quantify the effect of different skill levels on the performance of operators (see Fig. 4). This result can be used as a basis for hiring officials to hire operators for specific positions. Furthermore, they can use this result to better understand the effects of the composition of their current team on the performance of the security checkpoint.
Our model can also be used to test future concepts of security checkpoints. For instance, when X-ray officers do their work remotely, our model can be adapted with relative ease to determine the performance of such a setup. These types of experiments cannot easily be performed at airports, as it may interrupt security operations. Furthermore, experiments with humans are known to be hard to perform due to the diversity of human behavior. Using our model, these experiments can be performed more easily. This can, for instance, be done by hiring operators with the right personality type and skill set, or by taking these aspects into account while planning operators. Fig. 6. The performance of the checkpoint setups for the different weapons defined in this work. Performance is shown in terms of vulnerability (weapon not confiscated and no secondary screening; red bar), percentage of secondary screening (with or without weapon confiscation; green bar), confiscated weapons (without secondary screening; blue bar).
Our agent-based approach is more time consuming to perform than most other vulnerability assessment methodologies and requires a large amount of data for calibration. Other vulnerability assessment methodologies form better alternatives in cases with a lack of time or data, but our approach is particularly suitable to investigate vulnerabilities in which human behavior plays a role. A more in-depth discussion about the advantages and disadvantages of the use of agent-based modelling is discussed by Janssen et al. (2019b).

Conclusion
In this paper, we investigated how the decision-making and performance of human operators can be taken into account while assessing vulnerability at an airport security checkpoint. We developed an agent-based model, in which the performance of these operators was modelled using the functional state model, while decision making was modelled using decision field theory.
Simulation results indicated that the highest skilled operators outperformed their lowest skilled counterparts on analyzing X-ray images, but performed worse on both searching luggage and performing patdowns. This lead to similar differences in security checkpoint vulnerabilities as well. These skilled operators found their tasks too easy and are unable to motivate themselves to put in the required effort. Furthermore, the goals the operator focuses on during the decision-making process were found to influence vulnerability. A high focus on accuracy or perceived risk for the X-ray operator lead to an increase in luggage searches, and therefore reduced vulnerabilities. However, a high focus on speed leads to a decrease in luggage searches and therefore increased vulnerability. The developed model can be used to assess the effect of human behavior and decision making on the performance of current and future security checkpoint procedures, which is often impossible in real-life experiments. More work is needed to calibrate and validate the model and simulation results, but initial results are promising.
This work can be extended by investigating how other types of security measures (i.e., behavior detection officers) influence the vulnerability of the security checkpoint. The vulnerability with respect to other threat scenarios (i.e., a bomb attack before the security checkpoint) can be investigated as well. Finally, the model can be calibrated better by using classified data on sensor performance, operator performance and attacker behavior.