Assessment of Training Effectiveness Adjusted for Learning (ATEAL) Part I: Method Development and Validation

,


Introduction
Employee training in work environments is a popular way to increase competency and/or change expected behavior (Tai, 2006).Globally, organizations spent $359 billion on training in 2016 (Glaveski, 2019) with the US spending a total of $87.6 billion in 2018 (Freifeld, 2018).With this substantial amount of resources being invested, it is critical that organizations are able to ensure that the training is effective and is leading to the expected changes.Measuring training effectiveness using training evaluations or assessments is a the most widely used method to understand and quantify the deficiencies in the training programs and in developing prescriptions for improving (Alvarez, Salas, & Garofano, 2004;Simkins & Allen, 2000).Of the various models presented by Campbell-Kyureghyan, Ahmed and Beschorner (2013) and in the meta-analysis conducted by Alvarez et al. (2004), the Kirpartick's model (Kirkpartick, 1967), remains one of the most frequently used in training environments to measure training effectiveness (Arthur Jr., Bennett Jr., Edens, & Bell, 2003;Salas & Cannon-Bowers, 2001).The Kirkpatrick's model is comprised of four evaluation levels that measure participants Reaction (Level 1), Learning (Level 2), Behavior (Level 3) and Results (Level 4).The evaluation of Learning (Level 2) by participants, is typically measured by scores attained in post-test assessments or by score changes between preand post-test assessments (Dimitrov & Rumrill Jr., 2003).The tests that are administered are typically Multiple Choice Question (MCQ) tests as they are the most expeditious to administer (Bar-Hillel & Budescu, 2005).As jel.ccsenet.owe assess effective, knowledge A pre-test between th different s effectivene and Rumr pre-/post-t Wagner (2 and the pe the four q scores, or influenced Despite al an easy m directional for each c effectivene The aim o methodolo results.

Learni
The evalua various po terminolog was quadr described correct ans "CC" indi correct ans In Figure 1 • CC: T pre-knowledge of the question or concept • CI: The question is answered correctly in the pre-test and incorrectly or as IDK in the post-test, indicating that the participants experienced negative learning of the question or concept • IC: The question is answered incorrectly or as IDK in the pre-test and correctly in the post-test, indicating that the participants learned the concept • II: The question is answered incorrectly or as IDK in both the pre-and post-test, indicating that the participants did not learn the question or concept

Traditional Assessment Metrics
Assessment metrics are used to measure the effectiveness of the training and to help determine if there has been an increase in the level of knowledge for the learning objectives among the participants.There are several traditional metrics used to assess pre-/post-training effectiveness.It is important to note that learning is the measure of success, and that training is one technique that can be used to impart the desired knowledge.In this manuscript training impact is the impact of the training on learning of the concepts that were trained.
The most common method to assess testing results for a certain question or concept is to report the number of participants who answered a certain question correctly compared to the total number of participants who answered the question (Campbell-Kyureghyan et al., 2013) The key benefits of TPC are that it can be easily calculated, explained, and understood by the training participants and other organizational stakeholders.However, it gives broad stroke representations of the learning of the participants and thus the performance of the trainee.It is very difficult to discern participant pre-knowledge from actual learning and the use of this metric to make improvements to the training programs is problematic.Additionally, this metric does not provide an understanding of the negative learning that any of the participants may have experienced, where negative learning is defined as answering the pre-training question correctly and answering incorrectly on the post-training assessment (CI).
Another method to assess learning is to examine the difference between the number of participants who answered the question correctly in the post-test and the pre-test, which can only be used when the same questions are administered before and after the training (Bonate, 2000).The formula (2) below illustrates the calculation in the case of a pre-/post-training assessment model with or without an IDK option.As seen in Figure 1, both the IDK and an incorrect answer are treated identically.
Similar to the TPC metric, this measure is easy to calculate, explain and understand.It can also be used to determine the number of participants who answered a certain question correctly.Additionally, it compensates for participants who might have experienced negative learning.However, it is difficult to easily discern what percentage of the participants actually learned the new concept as this measure is insensitive to the prior knowledge of the participants.This also means that it does not allow for determination of the total knowledge of the participants.

Assessment of Training Effectiveness Adjusted for Learning (ATEAL)
In an effort to overcome the deficiencies of the metrics detailed above, this paper introduces and validates the ATEAL method that compensates the assessment scores for trainee prior knowledge and guessing.A description of this method starts with the introduction of the Learning Adjustment Coefficient (LAC) and the Net Training Impact Coefficient (NTIC).A number of intermediate metrics and parameters, which will be subsequently used in the calculation of these two coefficients, are defined first.

Prior Knowledge (PK)
This metric represents the proportion of all participants who answered a question correctly in the post-training assessment who also answered correctly in the pre-training assessment; it is calculated using the formula (3) below.
Prior Knowledge (PK) = (3) This metric ranges from 0 to 1, where a 0 implies that none of the participants who answered the question correctly in the post-training assessment had any prior knowledge of the concept and 1 implies that all of the participants who answered the question correctly in the post-training assessment had prior knowledge of the concept.That is, a higher PK indicates greater prior knowledge among the participants.This metric is specifically different from CC as a fraction of all the participants answering the question since it helps better estimate the proportion of correctly answering participants with prior knowledge.

Positive Training Impact (PTI)
This metric represents the proportion of all the participants who needed to learn the concept (responded incorrectly or IDK in the pre-test assessment) who actually did learn the concept as indicated by their response changing to correct in the post-test.It is described below in (4).
Positive Training Impact (PTI) = (4) This metric ranges from 0 to 1, where a 0 implies that none of the participants who could potentially learn actually learned the concept, and a 1 implies that all of the participants who could potentially learn actually learned the concept.That is, a higher PTI indicates more learning among the participants who did not know the concept prior to training.This metric is specifically different from IC as a fraction of all the participants answering the question since it helps better estimate the proportion of participants who did not know the concept prior to training who learned the concept.

Negative Training Impact (NTI)
This metric represents the proportion of participants who presumably knew the concept prior to training (answered correctly in the pre-training assessment) who answered incorrectly or IDK in the post-test assessment, potentially due to confusion during the training or guessing.It is described below in (5).
Negative Training Impact (NTI) = (5) This metric ranges from 0 to 1, where 0 implies that none of the participants were negatively impacted by the training and 1 implies that all of the participants (who knew the material prior to training) were negatively impacted by the training.That is, a higher NTI indicates that more participants "unlearned" the material after training.This metric is specifically different from CI as a fraction of all the participants answering the question since it helps better estimate the proportion of participants who had a negative impact from the training.

Learning Adjustment Coefficient (LAC)
The LAC is intended to measure the necessity of the training.That is, it compares the positive impacts of the training, determined through the PTI, to the prior knowledge (PK) of the participants.This difference between (actual) learning and prior knowledge is calculated (6) as: This metric ranges from -1 to +1.To make the scale more intuitive, it is transformed to represent a proportional change by the following transformation, resulting in the Learning Adjustment Coefficient as shown in ( 7): The LAC coefficient ranges from 0 to 1, where a 0 implies that all the respondents had prior knowledge so there was no actual learning for that specific concept/question, and 1 implies that there was no prior knowledge and all the respondents who needed to learn the concept did learn the concept.Higher values of LAC thus indicate that the training was needed, and effective, for a higher proportion of the respondents.Lower values indicate that either the training was ineffective, or a substantial number of respondents had previous knowledge and did not require training on the concept.

Net Training Impact Coefficient (NTIC)
The NTIC is intended to measure the negative impact of the training session.That is, it compares the positive allows for clear expectations and intuitive insight into the meaning of the metrics.Second, a simulation was performed to allow for investigating a larger number of possible outcomes and scenarios, across the range of possibilities.The results of the traditional and proposed metrics were compared to determine their relationship and aid in interpretation of all metrics.

Hypothetical Scenarios
The scenarios, detailed in Table 1, were developed to represent the responses (using the categories from Figure 1) of a hypothetical group of 100 training participants.These scenarios were chosen as they represent the extremes of learning outcomes in a Pre-/Post-Test assessment model as well as a middle ground of participant performance during a training assessment.The scenarios shown in Table 1 included various combinations of complete (C), high (H), moderate (M), and zero (Z) levels of Baseline knowledge, Positive learning, and Negative learning.The LAC and NTIC were calculated for each one of these scenarios and plotted on the matrix in Figure 3 (see Results section) to illustrate their quadrant placement and how they can be interpreted.Additionally, the TPC and PPPC are also calculated for each of the scenarios so a comparison can made in terms of how each metric reports the effectiveness of the training (see Table 3 in the Results section).

Data Simulation
To further expand on the scenarios modelled and examine a larger population of questions and trainees, a random number generator (in MS Excel) was used to generate 100 participant responses on 1000 questions for both pre-and post-training.The MS Excel random number generator generates numbers from a uniform distribution, ranging from 0 to 1, and the generation technique produced data for CC, CI, IC & II.The uniform distribution was considered a good way to generate the data as it does not make any preconceived assumptions on how participants would respond in an assessment and if they would learn or not learn a concept.That is, it allows for equal probabilities of the possible outcomes.The data points generated ranged from 0 participants to all the participants included in any of the quadrants and the sum of the number of answers in each of the pre-/post-condition totals 100 participants answering each question.

Results
Results of the simulations are presented with an emphasis on comparing the traditional and newly proposed assessment metrics, and the relationship between the two new metrics.

Scenario Results
Table 3 illustrates the metrics calculated for each of the twelve scenarios detailed in Table 1.In scenario 1, where all the participants have pre-knowledge of the concept taught, the TPC reports the score as 100% implying that all the participants learned the concept, which is an incorrect interpretation of the training effectiveness.
The PPPC reports the score as 0% implying that none of the participants learned the concept.Although this is a correct interpretation of training effectiveness, it is not distinguishable from scenario 4 and it would not be possible to distinguish concepts in which the participants had all pre-knowledge or zero learning.Looking at the two new coefficients for scenario 1, the LAC is 0 implying that 100% of the participants had prior knowledge and none learned the topic during training, and an NTIC of 0 implying that there is equal amount of positive training impact and negative training impact.The two introduced coefficients must be examined together to clearly understand the performance of the participants for each scenario.To visualize the implication of each scenario, the TEM is provided in Figure 3 and includes each scenario labeled by its number.From the matrix we can easily see that scenarios 2, 5, 6, 7, and 8 show a positive training impact on the participants, to varying degrees, and it is easy to visualize the magnitude of the impact based on how the points lie in the upper right quadrant.We can also see that scenario 8, along with scenario 1, consists of more prior knowledge than learning, representing cases in which the training was perhaps unnecessary.Scenario 4 shows zero learning impact, and participants had equal learning and prior knowledge.Finally, scenarios 3, 9, 10, 11 & 12 show more negative training impact that positive impact.Similar interpretations for most, but not all, scenarios can be made by looking at the PPPC.However, it is not possible to make that same determination using the TPC.Thus, the LAC and NTIC provide a finer resolution on the PPPC and TPC.This additional information will help trainers and organizations better understand whether the concept needs to be taught and ensure that the participants experience more positive than negative learning due to the content presented or method by which it was delivered.also be used to determine the amount of supervisor support and reinforcement needed to help support the use of skills (Russ-Eft, 2002).
In analyzing the same scenarios with TPC, we see that this metric is overly optimistic in its interpretation of the participants' performance.As shown in Table 3, for Scenarios 1, 5, and 6, TPC reports the performance of participants as 100%.This would imply that all participants learned these concepts; however, in these scenarios, all participants had prior knowledge.In using this metric, we would interpret the training as extremely effective although the participants would feel that the training of the concept was a waste of time because they already knew it.The correct course of action for a concept that behaves like Scenario 1 is to either not train on the concept or do a cursory training without testing on the concept and focus instead on concepts for which the participants have less prior knowledge.In Scenario 3, all the participants exhibited negative learning but the TPC reports the performance as 0%, implying that there was no learning among the participants.In this scenario we know that the participants were, in effect, guessing or losing knowledge due to the training process, which would indicate that there were significant issues with the content or the method of delivery.It is not possible to distinguish between this outcome (Scenario 3) and Scenario 4 in which had all the participants answered incorrectly in both the pre-and post-test assessments.Additionally, when using the TPC metric to measure training effectiveness, it is not possible to distinguish between Scenarios 9, 10, 11, and 12, which all had differing amounts of negative learning and participants answering incorrectly in both the pre-and post-test assessment.This severely limits the understanding of participant performance and the determination of needed training improvements.
When examining the scenario results using PPPC, we observe that this metric performs better than the TPC metric in representing the learning of the participants.In Scenario 1, it reports that there was no learning by the participants since they had 100% prior knowledge; however, unlike the ATEAL method, it is not possible to easily discern if the low score is due to prior knowledge or a lack of learning or guessing.In Scenario 2 PPPC indicates that the participants experienced 100% learning, same as the ATEAL method.This is distinctly different from the results illustrated by the TPC (100% in both scenarios 1 and 2) and helps the trainers better understand the impact of the training.In Scenario 3, PPPC reports a result of -100% since all the participants experienced negative learning, same as the ATEAL method that plots Scenario 3 at the lowest score in Quad 3. In Scenarios 5, 6 and 7the PPPC reports positive learning based on changes in the number of participants who have prior knowledge and those experiencing positive learning.When there is more negative learning than positive learning or prior knowledge (Scenarios 9,10,11 & 12) PPPC reports a negative value, thereby indicating that there is a significant issue with the training and that the participants are being affected in a negative manner.These negative results are similar to the ATEAL method that plots these scenarios in Quad 3. In Scenario 8, the PPPC reports that the participants experienced positive learning, however, using the ATEAL method, we are very quickly able to diagnose that Scenario 8 had more prior knowledge than positive learning.This is not readily apparent when looking at the PPPC results, and it requires the trainers/assessors to review the raw data to arrive at the conclusion that the ATEAL method readily provides.Additionally in Scenarios 1 and 4, PPPC reports that no participants learned the concept trained, however, when using the ATEAL method, we observed that in Scenario 1 all the participants had prior knowledge of the concept taught and did not need to learn the concept and in Scenario 4, none of the participants exhibited any learning.

Simulation Results
In interpreting the results of the simulation using the ATEAL methodology, we observe that the LAC is the most sensitive (slope of -0.82) of all the metrics to prior knowledge of the participants.This implies that as the prior knowledge among the participants increases, for a certain question or concept taught, the value of the LAC decreases.Similarly, the NTIC is the most sensitive (slope of -0.82) of all the metrics to negative training impact.As in the case of the LAC, this implies that as the participants experience more negative training for a certain question or concept, the value of NTIC decreases, and when 100% of the participants experience negative training, all associated NTIC values are negative.Thus, the use of these two coefficients to develop the TEM, enables the matrix to be more sensitive for the effects of prior knowledge and negative training when reporting the training effectiveness for the concepts taught.
It is also important to note that the NTIC can be sensitive to the number of trainees with prior knowledge.If a small number of trainees have prior knowledge the NTI can be large, even if only one or two trainees experienced negative learning.Conversely, if most trainees have prior knowledge the PTI is greatly impacted by even a small number of trainees who learn the concept.Thus, either very high or very low values of NTIC must be further examined to determine the cause, since either extreme case may indicate problems with the training related to prior knowledge rather than the training quality.
The TPC metric is completely insensitive to participant prior knowledge and treats it as learning, which is troublesome as it does not give feedback to the trainers or the organization that would help improve the training and better focus on the needs to the participants' knowledge gaps.It paints an overly optimistic picture of the training when, in effect, the participants' and organizations' time might be wasted by the training.Additionally, the participants could be getting bored during the training, causing them to lose focus and pay less attention to the concepts that they actually do not know and need to learn.The TPC does illustrate a negative trend when the participants experience negative training.This is due to the fact that participants experiencing negative learning answer incorrectly in the post-test assessment, thus reducing the TPC score.The score, however, does not clearly show that this is due to negative learning and it can be interpreted to mean that the participants did not learn the content being trained, which is a completely different scenario.
Finally, the PPPC metric is sensitive to prior knowledge as it decreases with an increase in prior knowledge as noted by several authors (e.g., Bonate, 2000;Dimitrov & Rumrill Jr., 2003;Tannebaum & Yukl, 1992).The PPPC also has similar sensitivity towards Negative Training Impact, in that it decreases with an increase in negative training impact.However, unlike the NTIC, when the negative training impact is close to a 100%, a small percentage of the data points are greater than zero.This makes interpretation of the PPPC metric slightly more challenging than the NTIC in which all the values are negative when 100% of the participants experience negative training.Additionally, it is difficult to discern participant performance when there is a low positive score; that is, we are not able to easily determine whether the low score was due to high prior knowledge or due to negative learning.Hence, it makes it difficult to quickly determine the countermeasures that are needed to improve the effectiveness of the training.
The comparisons of the scenario and simulation results using these metrics and associated discussions in this section allow us to observe the following benefits of the newly introduced ATEAL: -It is much more effective in helping determine the true performance of the participants in a training session for each concept taught.
-The metrics involved are easy to calculate and provide visual guidelines for the training providers and the organizations on the best and worst learned concepts.
-It is much more specific than the other two metrics and helps to quickly diagnose issues with participant performance by identifying whether the training should be improved (by making the content taught more challenging, to get around prior knowledge) or if the training is causing confusion among the participants and thus reducing their learning.

Conclusion/Future Direction
Metrics to quantify the amount of learning that training participants exhibit for a particular training course, or concepts within the course, are critical to understanding and quantifying the effectiveness of the training.The Assessment of Training Effectiveness Adjusted for Learning (ATEAL) method is introduced in this paper and defines new metrics to measure the level of prior knowledge, as well as positive and negative training impacts experienced by the participants.Additionally, it introduces two coefficients, Learning Adjustment Coefficient (LAC) and Net Training Impact Coefficient (NTIC), that are plotted in a novel method to create the Training Effectiveness Matrix (TEM).This matrix helps visually assess the performance of the participants for each question/concept introduced in the training.The method proves effective in quickly identifying the training gaps that the participants experienced and providing direction on the countermeasures that should be taken for each concept trained.
Validation of this new method and comparison of its performance to the traditional metrics of TPC and PPPC was conducted using scenario modelling and a simulation.Some recommendations that can be derived from this study are: • Using only the TPC in the post-test assessment to assess training effectiveness (i.e., how much the participants learned) may give a highly inaccurate impression and does not provide clear guidance on areas of improvement.
• The PPPC is a much better metric than the TPC to assess training effectiveness, but it lacks the ability to quickly provide guidance on changes to be made to the training content or training delivery to improve training effectiveness.
• The use of the ATEAL method in calculation of the Learning Adjustment Coefficient and the Net Training Impact Coefficient is extremely easy and interpretation using the Training Effectiveness Matrix is intuitive and visual.

Figure
Figure 4 il data points -1) to (1, 1 PK.In eith contain an the TEM d

Table 1 .
Scenario model data sets, where C = complete, H = high, M = moderate, L = low, Z = zero

Table 2 .
Table 2 is an excerpt from the of the values of CC, IC, CI and II for the simulation and illustrates the result of the training effectiveness metrics for each question.Excerpt of the values for the simulation model and the calculated training effectiveness metrics

Table 3 .
Metrics calculated for each scenario