A New Scoring Procedure in Assessment Centers: Insights from Interaction Analysis

This paper proposes interaction analysis as an alternative scoring procedure in assessment centers (ACs). Interaction analysis allows for a more fine-grained scoring approach by which candidate behaviors are captured as they actually happen, thus avoiding judgment errors typically associated with traditional scoring procedures. We describe interaction analysis and explain how this procedure can improve the validity of ACs. In a short research example, we showcase how interaction analysis can be implemented in AC settings. Finally, we integrate our arguments in terms of three key propositions which we hope will inspire future research on more dynamic scoring procedures.


Personnel Assessment And decisions interAction AnAlysis in Acs
ACs), the codings of the behaviors are far less subjective; behavioral units are classified and not immediately judged on their effectiveness, nor linked to a trait or competency, thereby reducing the risks of rater errors. Therefore, we think interaction analysis will result in more accurate and less subjective evaluations than traditional AC scoring procedures. Table 1 provides an overview of the differences between the traditional AC scoring procedure and the new procedure.

Basic Steps in Interaction Analysis
Although specific research questions and applications across these different settings differ widely, the general approach to understanding and analyzing behavioral processes using interaction analysis is quite similar. The following basic steps have been described in detail in Lehmann-Willenbrock and Allen (2018) as well as Meinecke and Lehmann-Willenbrock (2015). Here, we apply them to the specific case of AC exercises. First, the interested researcher will need to set up the behavioral data gathering. To this end, most previous studies using a quantitative interaction analytical approach rely on videotaped behavioral data, which allows the identification of both verbal and nonverbal behavior. It can be played back repeatedly for additional or follow-up analyses and can also be used for training and feedback material at a later point. Previous research suggests that groups tend to ignore or forget the camera as soon as a group discussion is under way (Kauffeld & Lehmann-Willenbrock, 2012). As long as only verbal behavior is of interest or when videotaping is not possible, audiotaped data may also be an option (e.g., Meinecke, Lehmann-Willenbrock, & Kauffeld, 2017).
Once video (or, less ideally, audio) data become accessible, the specific phenomena that are to be identified from the data have to be defined. Subject matter experts could be asked to develop coding schemes for specific AC dimensions (e.g., integrity, valuing diversity, adaptability, problem solving, or conflict resolution). However, using an existing coding scheme is often preferable, as findings can then be related to theoretical models. Coding schemes generally focus on the occurrences of specific behaviors: coders label each specific behavior taking place during the exercise (e.g., "suggesting a solution" or "presenting an idea") without making inferences about the candidate's traits or competencies (see Table 2 for an example of a coding scheme). Here, the difference between the traditional AC scoring procedure and interaction analysis becomes clear. If, for example, the goal of an AC exercise is to assess how candidates approach and solve a complex problem, assessors would traditionally use some kind of Likert-type scale to score the candidate's overall skills based on the behaviors they have seen during the exercise. With interaction analysis, the occurrence of any specific behavior related to problem solving (e.g., "describing a problem," "defining the objective," or "describing a solution") is coded as it actually happened in time. Upon deciding on a coding scheme that is suitable for analyzing the relevant question(s) and capturing the behavioral units of interest, coders need to be trained in order to establish inter-rater reliability. In the case of interaction analysis, inter-rater reliability is examined by having several (i.e., at least two) trained raters code the same video material and calculating the degree to which they reach the same conclusions regarding each coded behavioral unit.
Then, the question of unitizing needs to be addressed: Where does one behavioral unit start and stop, and when will a new unit be assigned? Unitizing rules can differ depending on the goal of the assessment  but typically adhere to one of the following rules: (a) turns of talk (i.e., assign a new behavioral unit as soon as the speaker changes; e.g., Chiu, 2008), (b) utterances (i.e., assign a new behavioral unit when a functionally different statement begins; e.g., Lehmann-Willenbrock et al., 2015), or (c) specific temporal segments within a conversation (e.g., 2-minute segments, Barsade, 2002; or predefined group discussion phases). Within ACs, we would recommend either unitizing based on turns of talk or utterances, as this would allow focusing on candidate behaviors at the individual level. Note that a unitizing rule based on specific utterances could mean that two consecutive behavioral units are contributed by the same candidate, for example when a candidate voices an idea and immediately follows up with a question to the other candidates. Unitizing based on specific temporal segments could be useful when one is interested in answering more generalized research questions, such as how specific behaviors (e.g., humor) within certain time fragments (e.g., the start of the exercise) affect overall assessment ratings.
Once inter-rater reliability is established and the data have been coded, there are several options for examining the annotated data. These include frequency analysis, co-occurrence analysis, lag sequential analysis, or pattern analysis for identifying behavioral triggers and emergent behavioral patterns (for an overview, see . There are several software solutions available that facilitate the coding process substantially, such that traditional transcripts of the observed behaviors are no longer required. These software solutions preserve the temporal order of the interaction data by registering time stamps (i.e., onset and offset times) along with each coded behavior (for an overview and comparison of possible software options, see Lehmann-Willenbrock and Allen, 2018).

Applying Interaction Analysis in ACs: An Example
To illustrate how interaction analysis can inform and improve ACs, we showcase the results of a laboratory study. In this study, we set up videotaped group discussions that exemplify the typical group setup in an AC exercise.
These group discussions were videotaped, and an interaction analytical procedure was used in order to explore the utility of this method for measuring leadership. Note that although we applied interaction analysis to a leaderless group discussion, this new scoring procedure can be used for any type of AC exercise that involves actual interactions (e.g., interviews and role plays).

Sample and Procedure
We recruited 30 groups of three participants at a large university in the Netherlands. The majority of the participants were psychology students, and two-thirds of them were female (60 out of 90 participants). Their age ranged from 18 to 34 years (M = 22.64, SD = 3.67). Participants could choose from earning participation credits or 10 euros of remuneration. The experiment was formally approved by the ethics committee at the participating university. Each participant was randomly assigned to one of three roles in a leaderless group discussion (i.e., HR manager, production manager, or sales manager) and provided with unique rolebased background information that needed to be revealed and synthesized to reach a solution (Klehe et al., 2012;Klehe, König, Richter, Kleinmann, & Melchers, 2008). Participants were given 10 minutes to prepare and up to 30 minutes for the actual group discussion.

Measures
The discussions were rated by two randomly chosen assessors out of a pool of five trained graduate students, using traditional AC observer rating sheets (see Appendix, in this study we focused on the leadership items; α = .82; ICC = .78). In addition, we videotaped all interactions and analyzed the data using the act4teams coding scheme (Table  2) implemented in INTERACT software (Mangold, 2010). The videos were independently coded by two trained graduate students. As depicted in Table 2, the act4teams scheme describes (verbal) social interactions in terms of four broader dimensions: problem-focused behaviors, procedural behaviors, socio-emotional behaviors, and action-oriented behaviors. Although this coding scheme has been developed to score a broad range of behaviors in team contexts and not necessarily emergent leadership behaviors, there is considerable overlap between the behaviors in the act4teams coding scheme and emergent leadership behaviors (e.g., Kickul & Neuman, 2000;Lord, Phillips, & Rush, 1980). For details on the theoretical background of this coding scheme and its development and validation, see Kauffeld and Lehmann-Willenbrock (2012). In order to establish the reliability of our coding approach, five interactions were coded by both students, showing sufficient inter-rater agreement (Cohen's κ =.75).

Figure 1.
Sample segment (first 5 minutes) showing participants and specific verbal contents. Each line in the graph represents one specific behavior by one specific participant. For example, the first lines shows instances (and time stamps) when participant B contributed solutions in the discussion process.

Individual Behaviors Related to Leadership Potential
To explore which behavioral patterns were indicative of leadership, we related the each specific type of behavior coded with the act4teams scheme to the overall leadership score obtained from the traditional AC rating for each individual in the group. To do so, we enumerated the absolute frequency of each specific type of behavior coded with the act4teams scheme per observed individual participant. The discussion varied somewhat in duration (M = 19.77 minutes; range = 11-30), and so we summed the absolute frequency of each behavioral category per participant and related this frequency to a 20-minute period (i.e., dividing each of them by the respective discussion length and multiplying by 20). We then calculated Pearson's correlations at the individual level (N = 90) to explore the relationship between the coded behaviors and the overall leadership rating for each participant.
When examining individual items in the leadership rating on the traditional observation sheet (Appendix), the additional findings obtained from interaction analysis yield more specific insights into otherwise relatively vague descriptions of leadership. For example, the item L2 in Appendix ("manages the discussion") does not specify how this is actually accomplished by the candidate. A closer inspection of the behavioral correlates of this item highlights procedural behaviors such as goal orientation (r = .35, p < .01), clarifying (r = .28, p < .01), procedural suggestions (r = .38, p < .01), procedural questions (r = .29, p < .01), time management (r = .32, p < .01), task distribution (r = .30, p < .01), and summarizing (r = .39, p < .01).

Personnel Assessment And decisions interAction AnAlysis in Acs
Graphing the Interaction Process Figure 1 illustrates how these different behaviors unfolded over the course of one exemplary AC group discussion. Using INTERACT software, we can "zoom in" on particular discussion segments. For illustrative purposes, Figure 1 only shows the first 5-minute segment from this group's discussion. The top of Figure 1 shows the actual time line from this discussion. Each consecutive line shows specific behaviors by each of the participants in this group discussion. The discussion was initiated by participant B, who contributed a solution. Then B proceeded to describe a problem, followed by a procedural question by participant A, and so forth. Graphic depictions of the fine-grained details of the discussion process are helpful for exploring the communication dynamics during such AC exercises, for understanding the role of individual participants within the social context, and for identifying ("eyeballing") potentially critical statements or behavioral triggers, which can then be followed up by in-depth quantitative analyses.

Identifying Interaction Patterns
Whereas traditional ratings in ACs focus on the individual only, interaction analysis also allows us to consider sequences or patterns of behavior. Lag sequential analysis can test how specific behaviors by candidates during the AC exercise trigger other behaviors. Significant behavior patterns are identified by z-values larger than 1.96. In our research example presented here, at lag1 (i.e., behavior sequences from one behavior immediately to the next) we found that support by other group members was triggered by goal orientation (z = 13.68), by clarifying (z = 311.97), by task distribution (z = 3.18), time management (z = 5.69), and summarizing (z = 24.13). Hence, those procedural behaviors that were linked to overall leadership ratings by observers in fact also triggered support by other group members within the interaction process.

Rater Errors and Gender Stereotypes
The current data allowed us to test both the occurrence of a halo effect and the use of gender stereotypes when contrasting the two different scoring procedures. For the traditional observation sheet (Appendix), the mean intercorrelation of .59 between the four dimensions (i.e., planning, cooperation, leadership, and communication) suggests a pervasive halo error. In contrast, the codings of the specific behaviors did not show such a halo effect: The mean absolute intercorrelation of the codings was only .11. At the overall dimension level, the mean intercorrelation was .34, which is still considerably lower than the intercorrelation of the traditional scoring procedure.
Furthermore, the traditional leadership rating showed a substantial score difference in favor of male candidates

Future Research Agenda
ACs highly depend on assessors' subjective ratings of candidates' behaviors. Several authors have therefore highlighted the need to move away from "gut feelings" and subjective ratings and toward a more fine-grained and objective scoring procedure (e.g., Silzer & Jeanneret, 2011). We have proposed an alternative scoring procedure in ACs: interaction analysis. Through interaction analysis candidate behaviors are captured as they actually happen, thereby avoiding judgment errors typically associated with traditional scoring procedures. In this paragraph we discuss the validity and acceptability of this alternative approach and integrate our arguments in terms of three key propositions.

Predictive Validity
The most important difference between traditional AC scoring procedures and interaction analysis is that instead of relying on overall rater observations of behavior, specific behavioral observations are used to predict performance. These behavioral observations address the social context in which each behavior occurs by studying its direct antecedents and consequences. For example, in a leaderless group discussion, interaction analysis can show how specific behaviors by candidates during the AC exercise trigger other candidates' behaviors (or, in case of interviews or role plays, the interviewer's or actor's behaviors). This is relevant as not every conceptually useful behavior will be useful at every point in time; the same behavior may be much more useful at the start of an exercise than at the end when all information has possibly already been shared and discussed. Such a differentiation is usually not considered in traditional observation sheets but can well be taken into account in interaction analyses. Thus, group discussion-based AC exercises and the intricate social dynamics inherent in them the complexity of interpersonal relations and unfolding interaction patterns that characterize real job situations (e.g., Lehmann-Willenbrock & Allen, 2018). For this reason, we expect interaction analysis to allow for a stronger predictor-criterion alignment, and hence more predictive power of the AC (Arthur & Villado, 2008). Furthermore, a focus on actual behavioral expressions embedded in social interactions instead of more abstract traits and competencies might be beneficial for the predictive and incremental validity of ACs, as traits and competences might be more economically and objectively captured via personality questionnaires and cognitive ability tests (Meriac et al., 2008). Based on these arguments, we formulated the following proposition.
Proposition 1: ACs using behavioral ratings derived from interaction analyses have higher predictive validity than ACs using traditional scoring procedures.
We believe the predictive validity of ACs to especially benefit from using interaction analysis when predicting behavioral criteria (e.g., interpersonal or communication skills, decision-making, citizenship behaviors), as these allow for the strongest predictor-criterion alignment.

Construct Validity
To date, the construct validity of ACs remains somewhat elusive because different dimensions within exercises correlate higher than similar dimensions across exercises (e.g., Wirz, Melchers, Schultheiss, & Kleinmann, 2014;Woehr & Arthur, 2003). Several studies have already demonstrated increases in the construct validity of ACs by improving the observation of the behaviors shown during the exercise. This was accomplished either by reducing the number of dimensions to rate (Kolk, Born, & Van der Flier, 2004), by ensuring that the behaviors to be rated will be visible in the exercise (Klehe et al., 2008;Lievens, Chasteen, Day, & Christianson, 2006), by frame-of-reference trainings (Woehr & Huffcutt, 1994), and by using behavioral checklists instead of overall trait ratings (Jackson, Barney, Stillman, & Kirkley, 2007). Although the effects on construct validity tend to be small, these findings suggest that more systematic procedures that enable AC developers to select independent and easily measurable (behavioral) dimensions will help distinguish between these dimensions. Interaction analysis goes one step further than these previously suggested methods as it allows for differentiation based on identifiable and differentiable behaviors as they happen during the AC exercise. In addition, by using interaction analysis the raters can focus on behaviors of interest that can be observed independent of the exercises. For example, behavioral checklists are completed by assessors immediately after an exercise. Behavioral checklist are therefore more cognitively demanding than interaction analysis, as the assessor has to observe, recognize, and recall the behaviors of each of the candidates (Reilly, Henry, & Smither, 1990). In contrast, interaction analysis makes use of recordings that can be played back as often as needed. Interaction analysis also provides additional advantages over frame-of-reference training. Although both methods can reduce rater errors (including the halo effect), a frame-of-reference training does not guarantee changes in assessor behaviors, nor does it make the evaluation procedure less subjective. Interaction analysis, however, forces the assessor to focus on the actual behaviors that are being demonstrated during the exercise. Based on these arguments, we formulated the following proposition.
Proposition 2: Interaction analysis improves the construct validity of AC exercises.

Acceptability
In order for interaction analysis to be a viable measurement approach in ACs and to be accepted by assessors and candidates, the benefits should outweigh the potential costs. Compared to other selection instruments, an AC is already an expensive, complex, and labor intensive procedure. Interaction analysis requires videotaping exercises and coding the behaviors of each candidate, which makes ACs potentially an even more time consuming and expensive procedure. However, these costs might be reduced in the near future, as modern technology such as latent semantic analysis (e.g., Campion, Campion, Campion, & Reider, 2016) and social sensing technology (Schmid Mast, Gatica-Perez, Frauendorfer, Nguyen, & Choudhury, 2015) might allow for automatic scoring of behaviors.
Typically, candidates receive some initial feedback at the end of day. Using interaction analysis would reduce the speed at which any feedback to candidates can be provided. Furthermore, not every candidate (especially candidates with high-level executive roles or candidates that have been headhunted) may allow videotaping or any other proof of their interest in other jobs until selection decisions are final. For these reasons, we expect there will be some limitations to the application of the interaction approach. However, we do think that such an approach can be used in a broad range of ACs, including ACs that are used in development, coaching, or promotion programs. We believe that in such programs, an interaction approach has a number of other benefits for both assessors and candidates. First of all, interaction analyses extract more information from the same interaction sequences than traditional AC observations. As such, they can generate more in-depth information while using existing AC exercises (i.e., in this case no additional investment costs are incurred for developing new exercises). A more reliable and valid approach, based on such indepth behavioral observations, has benefits for each party involved. Second, because ACs are often used for developmental purposes rather than selection purposes (Hazucha et al., 2011), an in-depth analysis of a candidate's behaviors and others' responses to those behaviors (i.e., temporal patterns) can offer promising practical implications for development purposes. Showing participants their own behaviors Personnel Assessment And decisions interAction AnAlysis in Acs and helping them reflect about how those behaviors helped them succeed within the social process is likely to be more informative than the feedback from more traditional ratings as currently used in AC practice. For these reasons, we believe that -when it is possible to videotape AC exercises and use interaction analyses-the benefits of this approach outweigh the potential costs for both assessors and candidates.
Proposition 3: The acceptability of using interaction analyses, especially the type of feedback it provides, is higher than the acceptability of traditional AC scoring procedures

Conclusion
The purpose of this paper is to draw attention to interaction analysis as an alternative scoring procedure in ACs and to showcase how this scoring procedure can be implemented in ACs. We have integrated our arguments in terms of three key propositions regarding the validity and acceptability of ACs using interaction analysis, which we hope will inspire future research.