Tell us about your leadership style: A structured interview approach for assessing leadership behavior constructs

Abstract It is widely recognized that leadership behaviors drive leaders' success. But despite the importance of assessing leadership behavior for selection and development, current measurement practices are limited. This study contributes to the literature by examining the structured interview method as a potential approach to assess leadership behavior. To this end, we developed a structured interview measuring constructs from Yukl's (2012) leadership taxonomy. Supervisors in diverse positions participated in the interview as part of a leadership assessment program. Confirmatory factor analyses supported the assumption that leadership constructs could be assessed as distinct interview dimensions. Results further showed that interview ratings predicted a variety of leadership outcomes (supervisors' annual income, ratings of situational leader effectiveness, subordinates' well-being and affective organizational commitment) beyond other relevant predictors. Findings offer implications on how to identify leaders who have a positive impact on their subordinates, and they inform us about conceptual differences between leadership measures.

Leadership behaviors have been shown to relate to indicators of organizational performance such as leader effectiveness and subordinate outcomes (Harms, Credé, Tynan, Leon, & Jeung, 2017;Jackson, Meyer, & Wang, 2013;Judge & Piccolo, 2004). Given the impact of leadership behaviors on organizational outcomes, a careful assessment of these behaviors is a relevant step in selecting and developing supervisors.
In light of this problem, there have been strong calls to explore alternative approaches to measuring leadership behavior constructs (Antonakis, Bastardoz, Jacquart, & Shamir, 2016;DeRue, Nahrgang, Wellman, & Humphrey, 2011;Hunter et al., 2007;Yukl, 1999). The structured interview method may be considered a particularly useful alternative to conventional leadership questionnaires, given that structured interviews are established and feasible assessment instruments that do not rely on potentially biased self-ratings or subordinate ratings (Huffcutt, Conway, Roth, & Stone, 2001;Huffcutt, Weekley, Wiesner, DeGroot, & Jones, 2001;Krajewski, Goffin, McCarthy, Rothstein, & Johnston, 2006). Furthermore, structured interviews ask for interviewees' behavior in specific situations, and therefore offer the opportunity for a more context-sensitive assessment of leadership (similar to situational judgment tests; see also Liden & Antonakis, 2009;Peus, Braun, & Frey, 2013). Despite these advantages, research has yet to employ interview methodology for assessing established leadership behavior constructs.
The present study is the first to connect structured interviews with leadership research by assessing leadership behavior constructs as dimensions in a structured interview, and therefore offers several contributions. First, this study investigates the potential of a novel approach to assess leadership behavior constructs that is based on an independent rating source (i.e., interviewer ratings), and that considers the situation in which leadership is exhibited. Second, we address the https://doi.org/10.1016/j.leaqua.2019.101364 Received 3 December 2017; Received in revised form 19 November 2019; Accepted 21 November 2019 antagonistic relationship between effectiveness and well-being in organizational research (Kozlowski, 2012) by incorporating both outcomes regarding leaders' performance and outcomes regarding subordinates' well-being when examining the interviews' validity. Third, to deepen our understanding of the interview method, this study explores how interview ratings of leadership behavior constructs correspond to a variety of different approaches to assess leadership (i.e., self-ratings, subordinate ratings, and behavioral codings).

Assessing leadership behaviors
While the leadership literature has produced a broad range of valuable constructs to describe leadership behaviors (e.g., Antonakis & House, 2014;Avolio, Bass, & Jung, 1999;Conger & Kanungo, 1998;Stogdill & Coons, 1957), there has been a growing demand for more conceptual clarity in leadership frameworks. For instance, various authors have argued that there is substantial conceptual and empirical overlap between different leadership behavior constructs both within and across frameworks (e.g., Piccolo et al., 2012;Shaffer, DeGeest, & Li, 2016;Yukl, 1999).
To address this lack of conceptual clarity, Yukl, Gordon, and Taber (2002) clustered different leadership behaviors from existing frameworks into three broad meta-categories: task-oriented leadership (e.g., explaining tasks and responsibilities, planning and prioritizing activities), relations-oriented leadership (e.g., providing individual support and encouragement, recognizing achievements), and change-oriented leadership (e.g., communicating a vision of what can be accomplished, explaining why changes are needed). This taxonomy only includes leadership behaviors from existing frameworks that are (a) directly observable, (b) potentially relevant to all types of supervisors within and across organizations, and (c) clearly assigned to only one category of leadership behaviors (Yukl, 2012;Yukl et al., 2002).
Although Yukl's taxonomy provides a common ground to conceptualize leadership constructs, a remaining question is how to obtain meaningful ratings of these constructs in the context of selection and development. Self-ratings of leadership behaviors, for example, may need to be interpreted with caution, given that a number of factors can produce systematic bias in the perception of one's own typical behavior (i.e., personality and demographic characteristics; see Fleenor et al., 2010;Gentry, Hannum, Ekelund, & de Jong, 2007;Sala, 2003). Accordingly, previous research suggests that self-ratings do not reflect actual leadership behaviors, but rather they provide insight into the characteristics of the self-rater (e.g., supervisors' self-esteem and selfawareness; Atwater & Yammarino, 1992;Goffin & Anderson, 2007).
Similarly, subordinate ratings of leadership behaviors are prone to different types of biases (for an overview see Fleenor et al., 2010). For example, subordinates tend to evaluate their supervisors' everyday behavior more favorably when supervisors' attributes correspond to subordinates' implicit assumptions about what a supervisor should be like (i.e., leader prototypicality; see Junker & van Dick, 2014) and when they personally like their supervisor (e.g., Rowold & Borgmann, 2014). This indicates that subordinate ratings "may have little to do with actual leader behavior" (Hansbrough et al., 2015, p. 221), but rather reflect relationships of supervisors with their subordinates.

Structured interviews for assessing leadership behavior
Given the issues involved in current leadership measurement practices, there is a need to explore the potential of alternative approaches to assessing leadership behaviors (e.g., DeRue et al., 2011;Hunter et al., 2007). Although there exists a variety of assessment methods (i.e., structured interviews, situational judgment tests, business games), these instruments have rarely been used for assessing constructs from leadership research (for an exception see Peus et al., 2013). Among these instruments, the structured interview method possesses particular potential as an alternative to conventional leadership questionnaires for several reasons.
To begin with, structured interviews usually rely on independent rating sources, namely the interviewers. As compared to supervisors or subordinates, interviewers should have less motivation to provide socially desirable ratings because they are usually not involved with the rated supervisor at a personal level. Furthermore, interviewers are often trained on how to avoid biases and on how to systematically evaluate the constructs that are to be assessed (e.g., Roch, Woehr, Mishra, & Kieszczynska, 2012).
In addition, structured interview questions allow for a more situation-specific assessment of leadership behaviors as compared to conventional leadership questionnaires. This is because the most common structured interview question types typically ask how interviewees behaved or would behave in actually experienced situations in the past (i.e., past-oriented interview questions; Janz, 1982) or in hypothetical situations (i.e., future-oriented interview questions; Latham, Saari, Pursell, & Campion, 1980). Thus, structured interview questions provide interviewees with a more detailed context as compared to conventional questionnaires that consist of generic items and do not present context information (e.g., "My supervisor leads by example" or "My supervisor will not settle for second best"; Podsakoff, MacKenzie, Moorman, & Fetter, 1990).
Given their high level of contextualization, structured interviews may function in a similar way as other situation-specific assessment tools such as situational judgment tests. The underlying rationale of situation-specific assessments is that the assessed person is first required to identify the demands of a given situation, and then has to present an adequate reaction to this situation (Jansen et al., 2013;Kleinmann et al., 2011). Hence, structured interviews capture (a) interviewees' cognitive understanding of the presented (past or future) situations (see also Fan, Stuhlman, Chen, & Weng, 2016;Melchers & Kleinmann, 2016), and (b) interviewees' past experiences, specific knowledge, and general beliefs about which behaviors are effective in these situations (Lievens & Motowidlo, 2016;Motowidlo & Beier, 2010).
Despite these advantages of the interview methodology, only few studies have examined structured interviews specifically for assessing leadership behavior (Huffcutt, Weekley, et al., 2001;Krajewski et al., 2006). Huffcutt, Weekley, et al. (2001) found that interview questions that are designed to assess various leadership behaviors have the potential to predict supervisors' performance as rated by superiors (i.e., supporting the interviews' criterion-related validity). However, they did not find evidence supporting that the internal data structure of interview ratings represented the intended interview dimensions. In particular, correlations were low (i.e., r = 0.09 in Sample 1 and r = 0.05 in Sample 2) between past-oriented and future-oriented interview questions that were designed to assess the same leadership behaviors. This raises questions as to whether interview questions adequately assessed the leadership behaviors that they intended to measure. Krajewski et al. (2006) replicated this study; they also used superiors' ratings as the sole criterion measure and found results similar to those from Huffcutt, Weekley, et al. (2001) In sum, the main limitations of previous research are (a) that interviews were not designed to assess constructs from existing leadership frameworks, (b) that there is little evidence that existing leadership interviews measure the leadership behaviors that they intend to measure, and (c) that only ratings from supervisors' superiors have been used as criterion measure, even though leadership behaviors may primarily affect subordinates and, thus, subordinate outcomes have been seen as important criterion measures in leadership research (Hiller, DeChurch, Murase, & Doty, 2011).

The present study
Drawing from a comprehensive leadership taxonomy (Yukl, 2012;Yukl et al., 2002), the present study explores the potential of the A.L. Heimann, et al. The Leadership Quarterly 31 (2020) 101364 structured interview method to assess leadership behavior constructs. To this end, it examines whether the internal data structure of interview ratings reflects the leadership behavior constructs that the interview is designed to measure; whether interview ratings can predict different types of leadership outcomes over and above other relevant predictors; and to what extent interview ratings relate to other leadership measures.

Internal structure of interview ratings
Generally, previous research has found little evidence that the internal data structure of interview ratings represents the constructs or interview dimensions that the structured interviews were actually designed to measure (Huffcutt, Weekley, et al., 2001;Macan, 2009;Van Iddekinge, Raymark, Eidson Jr., & Attenweiler, 2004). Researchers often find that the interviews' data structure is best represented by factor models that do not specify different interview dimensions in confirmatory factor analyses (CFAs; see for example Klehe, König, Richter, Kleinmann, & Melchers, 2008;Krajewski et al., 2006). A reason for this consistent finding might be that structured interviews often do not assess well-defined leadership constructs but unspecific and ad-hoc labeled interview dimensions such as "leadership behaviors" (Melchers et al., 2009).
Despite these findings, we posit that it is possible to assess task-, relations-, and change-oriented leadership as separate interview dimensions for two reasons: First, they are conceptually distinct and clearly defined constructs indicated by specific, observable behaviors (Yukl, 2012;Yukl et al., 2002). Second, there is initial empirical evidence that these leadership behavior constructs can be measured as different dimensions: Previous studies conducting CFAs found that factor models including task-, relations-, and change-oriented leadership as separate factors fitted the data structure of questionnaire-based leadership ratings (Borgmann, Rowold, & Bormann, 2016;Yukl et al., 2002). Accordingly, we hypothesize: Hypothesis 1. A factor model specifying task-, relations-, and changeoriented leadership as distinct factors will best represent the internal data structure of interview ratings.

Interview ratings and leadership outcomes
The utility of the structured interview approach for leader selection and development ultimately depends on whether interview ratings can predict relevant leadership outcomes (i.e., show criterion-related validity). When studying outcomes of leadership, it has been recommended to "include a variety of criteria" (Yukl, 2013, p. 26). In their integrative meta-analytic review of the leadership literature, DeRue et al. (2011) propose to distinguish between three content dimensions of leadership outcomes: (a) the overall judgment of leaders' success; (b) performance-focused outcomes; and (c) affective outcomes. To cover such a proposed variety of leadership outcomes, this study examines variables indicative of each category of outcomes: supervisors' reported annual income as an indicator of overall success; ratings of supervisors' situational effectiveness in behavioral leadership tasks as a performance-focused outcome; and ratings of subordinates' general well-being and affective organizational commitment as affective outcomes. We included two affective outcomes to cover different foci: Subordinates' general well-being helps studying the overall impact of supervisors on their subordinates' health and psychological functioning without referring to the context of work (e.g., Montano, Reeske, Franke, & Hüffmeier, 2017), while subordinates' affective organizational commitment is a more specific outcome telling us to what extent supervisors shape subordinates' attitude and attachment to their specific job (e.g., Jackson et al., 2013).
In determining the utility of the interview method, the final question is whether interview ratings explain incremental variance in leadership outcomes over and above other leadership predictors. Given that incremental validity evidence needs to be grounded in theory (Antonakis & Dietz, 2011), we drew from DeRue et al.'s (2011) integrative framework that distinguishes between two common classes of predictors of effective leadership: Leader traits and leadership behaviors.
Leader traits are stable person characteristics that can be grouped into three categories (DeRue et al., 2011): (a) characteristics that refer to demographics, (b) characteristics associated with task competence, and (c) characteristics describing interpersonal competencies. The present study includes predictors from each of these categories: age, gender, and leadership experience as demographic characteristics; supervisors' core self-evaluations as a competence-related leader trait; and supervisors' emotional intelligence as an interpersonal leader trait. All these leader characteristics have been shown to be related to leadership outcomes (Ahn, Lee, & Yun, 2018;Hoffman, Woehr, Maldagen-Youngjohn, & Lyons, 2011;Judge & Kammeyer-Mueller, 2011;Miao, Humphrey, & Qian, 2016;Paustian-Underdahl, Walker, & Woehr, 2014;Walter & Scheibe, 2013). Yet, we expect interview ratings of leadership behaviors to outperform these traits given that behaviors are considered more proximal predictors to leadership outcomes than traits (DeRue et al., 2011).
Leadership behavior constructs describe how supervisors act towards their subordinates (Yukl et al., 2002). Given the feasibility of assessing leadership behavior constructs by asking supervisors or subordinates to fill out a leadership questionnaire (e.g., Hunter et al., 2007), it seems relevant to examine whether interview ratings are stronger predictors of leadership outcomes than questionnaires when both types of measures were designed to assess the same leadership constructs. As discussed earlier, we expect interview ratings to be more predictive of leadership outcomes than questionnaire-based self-ratings and subordinate ratings of leadership behaviors, given that interview ratings rely on independent and trained rating sources (i.e., interviewers; see also Roch et al., 2012) and allow for a more context-specific assessment of leadership behaviors (similar to situational judgment tests; Peus et al., 2013). Taken together, we explore the incremental criterion-related validity of the interview method by examining whether the overall interview rating score of leadership behaviors predict different leadership outcomes over and above leader traits and also selfratings and subordinate-ratings of leadership behavior.

Interview ratings and supervisors' annual income
Generally, supervisors engage in leadership behaviors with the objective of being successful at their jobs. Thereby, they aim to advance their organization, as well as their own career and income. In support of this assumption, previous research has demonstrated (a) that higher scores on leadership behavior constructs relate to higher levels of individual and organizational success (e.g., Ceri-Booms, Curşeu, & Oerlemans, 2017;Geyer & Steyrer, 1998;Wilderom, van den Berg, & Wiersma, 2012), and (b) that self-reported annual income can be regarded as an indicator of individual success (e.g., Judge, Klinger, & Simon, 2010;Ng, Eby, Sorensen, & Feldman, 2005;Rode, Arthaud-Day, Ramaswami, & Howes, 2017). Accordingly, we expect supervisors who demonstrate higher behavioral leadership competencies in the interview to be more successful at their jobs and therefore to earn greater annual income. Considering that annual income is a relatively broad indicator of success that is potentially confounded with a number of variables (e.g., age, leadership experience, gender, industry, size of the organization), we will control for these confounding variables in examining the validity of the overall interview in predicting annual income: Hypothesis 2. The overall interview score explains a significant level of variance in supervisors' reported annual income over and above the size of the organization, industry, demographic variables, core self-evaluations, emotional intelligence, and the overall scores of self-rated and subordinate-rated leadership behaviors.

Interview ratings and ratings of situational leader effectiveness
Leader effectiveness is a frequently studied outcome and describes how well supervisors perform their job (i.e., their leadership tasks; see Hiller et al., 2011). Per definition, actively engaging in leadership behaviors should help supervisors to master their leadership tasks (Yukl, 2012). In support of this, leadership behavior constructs (measured via leadership questionnaires) have often been found to meta-analytically relate to leader effectiveness (e.g., DeRue et al., 2011;Judge & Piccolo, 2004).
The present study aims to collect ratings of leader effectiveness from a rating source that can evaluate supervisors' performance more independently than potentially biased subordinates (e.g., Rowold & Borgmann, 2014;Wang, Van Iddekinge, Zhang, & Bishoff, 2019). Specifically, we collect ratings from role-players who evaluate supervisors' situational effectiveness in standardized behavioral leadership tasks (i.e., in simulated situations). In this context, role-players' effectiveness ratings serve as an indicator of supervisors' performance in the presented tasks. The advantage of role-players is that they should not have any motive to provide biased ratings (i.e., they have no long-term commitment towards the person they rate). In addition, role-players evaluate the situational effectiveness of all supervisors in the same standardized leadership tasks, thereby increasing the comparability of ratings across supervisors. We posit the following hypothesis: Hypothesis 3. The overall interview score explains a significant level of variance in ratings of situational leader effectiveness over and above demographic variables, core self-evaluations, emotional intelligence, and the overall scores of self-rated and subordinate-rated leadership behaviors.

Interview ratings and subordinates' general well-being
More recently, subordinate well-being has received considerable attention in the leadership literature (Kurtessis et al., 2017;Montano et al., 2017). It is defined as a state of positive mental health in which an individual often feels cheerful, active, fresh, and rested (Topp, Østergaard, Søndergaard, & Bech, 2015;WHO, 2001). Previous research suggests that supervisors whose leadership behaviors support their subordinates at work (e.g., by providing sufficient resources to accomplish work objectives, recognizing accomplishments, explaining why subordinate's work is relevant; Yukl, 2012) can serve as a work resource for subordinates and thereby foster subordinate well-being (see Nielsen et al., 2017). Accordingly, a meta-analysis showed that leadership behavior constructs (measured via questionnaires) significantly relate to subordinates' well-being (Montano et al., 2017). Hence, we posit that supervisors who are rated higher on leadership behavior constructs in the interview will have subordinates with higher levels of general well-being: Hypothesis 4. The overall interview score explains a significant level of variance in subordinates' well-being over and above supervisors' demographic variables, core self-evaluations, emotional intelligence, and the overall scores of self-rated and subordinate-rated leadership behaviors.

Interview ratings and subordinates' affective organizational commitment
Affective organizational commitment refers to subordinates' emotional attachment to or involvement in an organization (Allen & Meyer, 1990) The rationale behind this is that supervisors whose leadership behaviors support their subordinates create favorable working conditions (see also Kurtessis et al., 2017). From a social exchange perspective (see Cropanzano & Mitchell, 2005), supervisors creating a positive work environment should help subordinates to develop favorable attitudes towards their job and organization, including the desire to stay with the organization (Jackson et al., 2013). In support of this, meta-analyses found leadership behavior constructs to relate substantially to subordinates' affective organizational commitment (Borgmann et al., 2016;Jackson et al., 2013). Thus, we expect that supervisors who demonstrate higher behavioral leadership competencies in the interview will have subordinates who report higher levels of affective organizational commitment: Hypothesis 5. The overall interview score explains a significant level of variance in subordinates' affective commitment over and above supervisors' demographic variables, core self-evaluations, emotional intelligence, and the overall scores of self-rated and subordinate-rated leadership behaviors.

Interview ratings and other measures of leadership
Finally, to deepen our understanding of the interview methodology, we further explore relationships between interview ratings and other measures of leadership (i.e., questionnaire-based self-ratings, subordinate ratings and behavioral codings of leadership). Previous research generally indicates little convergence between interview ratings and other types of measures (Atkins & Wood, 2002;Van Iddekinge et al., 2004;Van Iddekinge, Raymark, & Roth, 2005), which adds weight to the question of what type of interview ratings capture. No study has yet examined the extent to which interview ratings correspond to other measures that were designed to assess the same leadership constructs (i.e., to what extent structured interviews are either redundant to or complement other measures). We therefore pose the following research question:

Research Question.
To what extent do interview ratings of task-, relations-and change-oriented leadership relate to other measures of leadership behavior (i.e., self-ratings, subordinate ratings, or behavioral codings)?

Sample Supervisors
The sample consisted of 152 supervisors (62 women, 90 men) who participated in a leadership interview as part of a leadership assessment program. The assessment program was delivered at a center for continuing education in Europe and was open to supervisors from all kinds of organizations. Supervisors participated in the assessment program to receive extensive feedback on their leadership behaviors. Participants were recruited via local print advertising, social media, and by contacting the human resource departments of local organizations. To be included in the present study, supervisors had to allow us to collect data from at least two of their subordinates (a) who they supervised directly, (b) with whom they had been working together for at least six months, and (c) with whom they interacted with several times a week. Supervisors were on average 44.95 (SD = 7.41) years old, worked 41.45 (SD = 5.06) hours per week, and had 9.32 (SD = 6.54) years of leadership experience.
Supervisors worked in different sectors. About 5% of them worked in finance and insurance, 9% in health care and social services, 16% in manufacturing, 5% in media and communication, 6% in non-government organizations, 14% in public services and administration, 20% in research and education, 6% in sales and marketing, 14% as private service providers, 2% in transport and logistics, and 3% did not indicate any of these categories.

Subordinates
We collected data from 450 eligible subordinates (224 women, 226 men). For 95 supervisors, we obtained data from two subordinates, and for 57 supervisors, we obtained data from three or more subordinates. Subordinates were on average 41.36 (SD = 10.55) years old, worked 36.74 (SD = 10.04) hours per week, and had been working together with their respective supervisor for 3.59 (SD = 3.17) years.

Interviewers
Interviewers were 54 (41 women, 13 men) advanced psychology students who had been trained as interviewers in a one-day frame-ofreference training (see Bernardin & Buckley, 1981;Roch et al., 2012). During this training, interviewers learned about the definitions of task-, relations-, and change-oriented leadership behaviors, and how to evaluate them. Interviewers were on average 29.94 (SD = 7.52) years old, had studied psychology as a major for 6.93 (SD = 3.07) semesters, and about 62% of interviewers held a Bachelors' degree. About 21% of interviewers had more than ten years of work experience, 59% had between one and ten years of work experience, and 20% had less than one year of work experience.

Role-players
Role-players were 52 (43 women, 9 men) advanced psychology students who rated supervisors' situational leader effectiveness in two behavioral leadership tasks as part of the leadership assessment program. Role-players had previously participated in a role-player training, in which they received a script for their role and exercised the roleplaying several times. Role-players were on average 29.45 (SD = 8.06) years old, had studied psychology as a major for 6.76 (SD = 2.60) semesters, and about 65% of them held a Bachelors' degree. About 16% of role-players had more than ten years of work experience, 61% had between one and ten years of work experience, and 23% had less than one year of work experience.

Procedure
Before participating, supervisors provided informed consent that their data could be used for research purposes. At the beginning of the assessment program, supervisors completed an online questionnaire assessing their own leadership behaviors, and their subordinates filled in an online questionnaire which included questions pertaining to their supervisor's leadership behaviors, their own well-being and their affective organizational commitment. Subsequently, supervisors participated in an on-site assessment in which they (a) participated in a structured face-to-face interview assessing different leadership behavior constructs, (b) completed two behavioral leadership tasks, in which role-players rated their situational leader effectiveness, and (c) filled in paper-and-pencil questionnaires including items on their core-self evaluations and emotional intelligence. All assessment instruments were presented in randomized order.
In the structured interview, supervisors responded to seven pastbehavior and seven future-behavior questions referring to specific leadership situations. Each interview was administered by a panel of two interviewers and took about 35 min. Interviewers were equipped with an interview guide that provided detailed instructions on how to conduct the interview and that contained all interview questions. After each interview question, interviewers took notes, and then individually rated supervisors' responses. After all interviews were completed, the two interviewers compared and discussed their individual ratings. Interviewers did not have access to supervisors' self-ratings, subordinate ratings, nor to role-players' ratings.
In the two behavioral leadership tasks, supervisors had to take a leadership role and present on a leadership-related topic (i.e., introduce a new project in a team meeting, or inform subordinates about a corporate reorganization) while two role-players, who took the role of subordinates, asked critical standardized questions. Role-players were instructed to rate how effectively supervisors performed in these tasks. In each behavioral leadership task, supervisors were rated by a different pair of role-players. Thus, in total, each supervisor was rated by four role-players to allow for a reliable assessment of situational leader effectiveness. Supervisor's behavior in the two behavioral leadership tasks was videotaped.
At the end of the on-site assessment, supervisors provided demographic and additional information (e.g., years of leadership experience, and their annual income including salary and any other forms of monetary compensation). Afterwards, they were debriefed regarding the leadership behavior constructs that had been assessed during the assessment program, and they received extensive feedback and suggestions to further improve their leadership skills.
After all other data had been collected, we obtained behavioral codings of leadership behaviors. Therefore, an independent sample of coders watched video recordings of the two behavioral leadership tasks and coded the amount of time that each supervisor engaged in each type of leadership behaviors.

Measures
Supervisor self-ratings Supervisors rated their own task-oriented leadership behaviors on a 12-item scale, relations-oriented behaviors on a 15-item scale, and change-oriented behaviors on a 12-item scale, all from the Managerial Practices Survey (MPS G16-3; Yukl, 2012;Yukl et al., 2002). Example items for leadership behaviors are "As a supervisor, I clearly explain task assignments and member responsibilities" (task-orientation), "As a supervisor, I show concern for the needs and feelings of individual team members" (relations-orientation), and "As a supervisor, I describe a proposed change or new initiative with enthusiasm and optimism" (change-orientation). Items were rated on a scale from 1 (not at all) to 5 (to a very great extent).
Core-self evaluations were measured with a 12-item scale from Judge, Bono, Erez, and Locke (2005), and emotional intelligence was measured with 32 items from a scale from Schutte et al. (1998). Example items are "I am confident I get the success I deserve in life" (coreself evaluations) and "I easily recognize my emotions as I experience them" (emotional intelligence). Items were rated on a scale from 1 (strongly disagree) to 5 (strongly agree). We excluded one item ("I expect that I will do well on most things I try") from the original scale for emotional intelligence because it substantially reduced reliability. Cronbach's alphas for all scales used in this study are provided in Table 1.

Subordinate ratings
Subordinates rated their supervisor's task-, relations-, and changeoriented behaviors on the same leadership scales as supervisors rated their own behaviors (MPS G16-3; Yukl, 2012;Yukl et al., 2002). Again, all items were rated on a scale from 1 (not at all) to 5 (to a very great extent). Given that we posited hypotheses at the supervisor level, we aggregated subordinate ratings for each supervisor and assessed indices of within-group agreement (r wg(j) ; James, Demaree, & Wolf, 1984 and consistency across raters based on the one-way random effects model (ICC[1] and ICC[2]; Bartko, 1976;Bliese, 2000). We assumed slightly skewed null distributions to calculate r wg(j) relying on findings from previous studies that used the same leadership scales (Yukl, Mahsud, Hassan, & Prussia, 2013;Yukl, O'Donnell, & Taber, 2009). Within group agreement ranged from r wg(j) = 0.80 for changeorientation to r wg(j) = 0.86 for relations-orientation leadership, and rater consistency ranged from ICC(1) = 0.19 and ICC(2) = 0.42 for task-orientation to ICC(1) = 0.29 and ICC(2) = 0.55 for relations-orientation. Hence, agreement and consistency were comparable to the values reported in the literature (Woehr, Loignon, Schmidt, Loughry, & Ohland, 2015).
Subordinates rated their general well-being on a 5-item scale from the World Health Organization (WHO-5; Topp et al., 2015), and their affective organizational commitment on the revised 6-item scale from Meyer and Allen (1997). Example items are "During the last two weeks, I have felt active and vigorous" (general well-being), and "This organization has a great deal of personal meaning for me" (affective commitment). Items for general well-being were rated on a scale from 1 (never) to 6 (all the time), and items for affective commitment were rated on a scale from 1 (strongly disagree) to 7 (strongly agree). Again, we examined agreement within and consistency across subordinates before averaging their ratings. Based on previous research using these variables, we assumed slightly skewed null distributions to calculate r wg(j) (Bernecker, Herrmann, Brandstätter, & Job, 2017;Meyer, Allen, & Smith, 1993). Regarding general well-being, within group agreement was r wg(j) = 0.71 and, thus, comparable to the mean values for withingroup agreement as reported in the literature (Woehr et al., 2015). As expected, rater consistency was rather low with ICC(1) = 0.06 and ICC(2) = 0.16, given that subordinates evaluated their general wellbeing which can be influenced by many factors (e.g., Nielsen et al., 2017). For affective organizational commitment, within group agreement (r wg(j) = 0.67), and rater consistency (ICC[1] = 0.19 and ICC[2] = 0.41) were comparable to the levels of agreement and consistency reported in the literature (Woehr et al., 2015).

Interview ratings
Interview development proceeded in six steps. First, interview items were developed based on the critical incident technique (Flanagan, 1954). We used 14 out of 32 critical incidents that had been collected in a study by Peus et al. (2013) via focus group discussions with supervisors. We selected those critical incidents from Peus et al. (2013) that referred to (a) situations in which supervisors interacted with their subordinates, (b) challenging situations in which more than one type of leadership behavior can lead to a desired outcome, and (c) situations that are likely to be a common experience for most supervisors.
Second, these 14 identified situations were enriched so that each situation contained cues that can activate task-, relations-, and changeoriented leadership behaviors. Thus, each situation was designed to capture three leadership behavior constructs simultaneously. This is because situations that are indicative of effective leadership are complex and thus offer a wide array of behavioral reactions that are associated with different kinds of leadership behaviors (e.g., Mumford, Zaccaro, Harding, Jacobs, & Fleishman, 2000).
Third, out of the 14 identified situations, half of the situations were used to write past-oriented interview items (i.e., they were framed to ask for supervisors' past experiences), and the other half were used to write future-oriented interview items (i.e., they were framed into hypothetical situations). Past-oriented and future-oriented questions were comparable in length and in complexity, and they were administered by the same interviewers and in the same manner.
Fourth, we developed rating scales with behavioral examples for task-, relations-, and change-oriented leadership behaviors for each interview item. Thus, each interview item together with its three rating scales formed one interview question. Behavioral examples for the rating scales were adapted items from established leadership questionnaires, specifically the Leadership Behavior Description Questionnaire (LBDQ; R. M. Stogdill, Goode, & Day, 1962), the Transformational Leadership Inventory (TLI; Podsakoff et al., 1990), and the Managerial Practices Survey (MPS G16-3; Yukl, 2012;Yukl et al., 2002). For each interview item, interviewers evaluated each leadership behavior construct on five-point rating scales. Given that interviewers provided three ratings (i.e., one rating of task-, relations-, and changeoriented behavior respectively) for each of the 14 interview items (i.e., seven past-behavior and seven future-behavior items), the interview comprised 42 rating scales in total.
Fifth, we thoroughly revised the rating scales by adapting them more closely to the specific contexts of the respective interview item. Examples of past-oriented and future-oriented interview questions with their respective rating scales can be found in the appendix.
Sixth, we asked seven subject matter experts (SMEs) to provide ratings on the content validity of the developed interview. SMEs were researchers in the field of I/O psychology with a focus on leadership and/or assessment methods. SMEs were provided with the definitions of the leadership behavior constructs and received the full leadership interview, but the labels of the ratings scales (i.e., the names of the leadership constructs) were blinded. First, SMEs were asked to identify the respective leadership construct from Yukl's taxonomy for each of the rating scales. Results show that all SMEs accurately identified the intended leadership construct for each single rating scale. Second, SMEs rated the extent to which each of the rating scales (whose labels were still blinded) were suited to measure each of the three leadership constructs; scales ranged from 1 (strongly disagree) to 5 (strongly agree). On average, SMEs indicated that the rating scales were well suited to assess the leadership construct that the respective rating scale intended to measure (M = 4.86, SD = 0.17), while rating scales were less suited to assess the two leadership constructs that the respective rating scale did not intend to measure (M = 1.33 SD = 0.24). Third, SMEs were asked to rate the overall interview, specifically whether interview questions represented relevant leadership situations and were suited to assess the intended leadership constructs. Again, SMEs made ratings on a scale ranging from 1 (strongly disagree) to 5 (strongly agree). Their ratings illustrated that the interview questions generally represented relevant leadership situations (M = 4.71, SD = 0.45), and that the interview questions were well suited to assess the three leadership constructs (M = 4.86, SD = 0.35).
Finally, each interview was administered by two interviewers. Thus, for one supervisor, the same panel of two interviewers rated all 14 interview questions. We examined agreement within and consistency across interviewers before averaging their ratings. Both within group agreement (r wg(j) = 0.98), and rater consistency (ICC[1] = 0.94 and ICC[2] = 0.97) were high (Woehr et al., 2015). In addition, the mean correlation between interviewers (r = 0.66) was comparable to the interviewer reliability reported for previous leadership interviews (Huffcutt, Weekley, et al., 2001). We averaged interviewers' combined ratings across interview questions to obtain one score for each leadership behavior construct and one score for the overall interview across the three leadership behavior constructs respectively.

Role-player ratings
Role-players rated supervisors' situational leader effectiveness with a 4-item scale from Van Knippenberg and Van Knippenberg (2005) in two behavioral leadership tasks. An example item is "I think that this supervisor works effectively". Items were rated on a scale from 1 (strongly disagree) to 7 (strongly agree). To ensure the reliability of these ratings, multiple role-players (i.e., two role-players in each of the two leadership tasks) rated each supervisor. Both within group agreement (r wg(j) = 0.85), and rater consistency (ICC[1] = 0.49 and ICC[2] = 0.79) corresponded to values reported in the literature (Woehr et al., 2015). Thus, we averaged ratings across role-players and items to form one score of situational leader effectiveness.

Behavioral codings
An independent sample of coders recorded supervisor's task-, relations-, and change-oriented behaviors based on videos of the two simulated leadership tasks. Coders used the INTERACT video coding software (Mangold International, 2010), similar to other behavioral research (e.g., Kauffeld & Meyers, 2009;Lehmann-Willenbrock & Allen, 2014). Using this software, coders watched the videos and each time they identified a relevant behavior, they pressed a computer key that was programmed to represent the respective leadership category. At the same time, coders marked the time period (i.e., how long supervisors engaged in each behavior). Hence, behavioral codings obtained in this study represent the amount of time that each supervisor engaged in each leadership behavior.
Coders were three graduate students in psychology who were blind to all other data collected in this study. Before coding the videos, they had to complete 10 h of training. To assess the reliability of behavioral codings, the first 30 videos of the sample were independently coded by all three coders. We calculated ICCs for absolute agreement using the two-way random effects model (McGraw & Wong, 1996). Coder consistency ranged from ICC(1) = 0.69 and ICC(2) = 0.87 for changeorientation to ICC(1) = 0.92 and ICC(2) = 0.97 for task-orientation. Given that coder consistency was good and in line with previous research, coders then proceeded to code different videos so that each video was coded by one coder (

Results
Hypothesis 1 stated that a factor model specifying task-, relations-, and change-oriented leadership as distinct factors would best represent the internal data structure of interview ratings. To test this hypothesis, we conducted a set of CFAs and tested four potential models: Model I was a common multitrait-multimethod model with three correlated dimension factors (i.e., the leadership behavior constructs) and two correlated method factors (i.e., past-behavior questions and future-behavior questions). We chose to start with this model because it most adequately reflects how the interview was designed. Model II included only the three correlated dimensions factors and was nested in Model I. Model III included only the two correlated method factors and was also nested in Model I. Model IV included the two correlated method factors and one overall dimension factor. Thus, Model IV was not nested in Model I. We chose to also test this model given that interview ratings are often averaged across different (leadership) dimensions to form one overall score (e.g., Krajewski et al., 2006).
Before testing these models, we parceled interview ratings in order to reduce model complexity and to achieve an acceptable ratio of sample size to estimated parameters (Bentler & Chou, 1987). The interview consisted of 42 ratings (i.e., interview ratings of three leadership behavior constructs each rated in seven past-behavior and seven future-behavior questions). Thus, we had seven ratings of each leadership construct for each interview type. We assigned these seven ratings per leadership construct and interview type to three parcels. 1 Therefore, the analyses of all factor models were based on 18 parcels in total: Each dimension factor (i.e., latent leadership behavior construct) was measured with six parcels and each method factor (i.e., latent type of interview question) was measured with nine parcels.
We computed factor models with Mplus version 7.11 (Muthén & Muthén, 2013). Model I fit the data well, χ 2 (113) = 136.65, p = .064, TLI = 0.97, CFI = 0.98, RMSEA = 0.04, SRMR = 0.04, according to standards for model evaluation (e.g., Antonakis, Bendahan, Jacquart, & Lalive, 2010;Schermelleh-Engel, Moosbrugger, & Müller, 2003). Model comparisons using the chi-squared test further revealed that Model I fit the data significantly better than Model II, Δχ 2 (19) = 72.29, p < .001, and that Model I fit the data significantly better than Model III, Δχ 2 (21) = 296.59, p < .001. We could not test directly whether Model I fit the data better than Model IV given that Model IV was not nested in Model I. However, Model IV did not show good fit, χ 2 (116) = 250.11, p < .001, and the AIC was lower for Model I (5068.37) than for Model IV (5175.83). Therefore, the model that provided the best fit to the data was Model I, which incorporated the three leadership behavior constructs as correlated dimension factors, and the two types of interview questions as correlated method factors. Consequently, Hypothesis 1 was supported. Fig. 1 shows the best fitting model. Similar to previous interview studies (e.g., Ingold, Kleinmann, König, & Melchers, 2015;Morgeson, Reider, & Campion, 2005), we averaged ratings across question types for each leadership behavior construct before testing further hypotheses and research questions. We decided on this approach given that ratings of past-oriented and future-oriented interview questions showed high correlations. Specifically, observed correlations between question types were r = 0.62 (for task-orientation), r = 0.62 (for relationsorientation), and r = 0.63 (for change-orientation) which exceeded findings from previous interview studies (i.e., the meta-analytic correlation between construct-matched past-oriented and future-oriented interview questions is r = 0.40; Culbertson et al., 2017).
Hypothesis 2 posited that supervisors who achieved a higher overall interview score (i.e., interview ratings averaged across the three leadership behavior constructs) would have a higher reported annual income. Therefore, the overall interview score should have incremental validity over and above characteristics of the organization, leader traits, and questionnairebased self-ratings and subordinate ratings of leadership behaviors. Given that skewness and kurtosis scores (4.02 and 19.91) of annual income indicated a positively skewed distribution, we log-transformed the income variable (e.g., Tabachnick & Fidell, 2007). The overall interview score correlated positively with supervisors' log-transformed reported annual income (r = 0.23, p = .004), as shown in Table 1. The overall interview score was also correlated positively with annual income when the variable was not log-transformed (r = 0.20, p = .038). As shown in Table 2 (see results  for Model 2), hierarchical regression analysis further indicated that the overall interview score explained a significant proportion of variance in supervisors' log-transformed reported annual income beyond characteristics of the organization, leader traits, and questionnaire-based self-ratings and subordinate ratings of leadership behaviors, ΔR 2 = 0.04, F (1, 104) = 6.26, p = .014. Hence, Hypothesis 2 was supported. To further determine the relative contribution of all predictors in explaining variance in annual income, we conducted relative weights analyses with the relaimpo package for the R environment (Grömping, 2006). This type of analysis is recommended when predictors are intercorrelated (Johnson, 2000), which is often the case when analyzing leadership variables. As can also be seen in Table 2, relative weights analysis revealed that the overall interview score accounted for 20.7% of the variance explained in annual income, while the other predictors explained between 1.0% (emotional intelligence) and 24.7% (leadership experience) of variance.
Hypothesis 3 assumed that supervisors achieved a higher overall interview score would obtain higher ratings of situational leader effectiveness in behavioral leadership tasks. As can be seen in Table 1, the overall interview score correlated significantly with ratings of situational leader effectiveness (r = 0.23, p = .004). Hierarchical regression analysis further indicated that the overall interview score explained a significant proportion of variance in ratings of situational leader effectiveness beyond leader traits (including demographic variables), questionnaire-based self-ratings and subordinate ratings of leadership behaviors, ΔR 2 = 0.04, F (1, 143) = 7.38, p = .007, see Table 2 (results for Model 4). As such, Hypothesis 3 was supported. Furthermore, relative weights analysis demonstrated that 45.4% of the variance explained in ratings of situational leader effectiveness could be attributed to the interview overall score, while the other predictors explained between 0.8% (age) and 20.6% (core self-evaluation) of the variance.
Hypothesis 4 stated that supervisors with a higher overall interview score would have subordinates reporting a higher level of general wellbeing. In support of this, the interview score correlated significantly with subordinates' general well-being (r = 0.25, p = .002). Further analyses demonstrated that the overall interview score explained a significant proportion of variance in subordinates' well-being beyond leader traits (including demographic variables), questionnaire-based self-ratings and subordinate ratings of leadership behaviors, ΔR 2 = 0.03, F (1, 143) = 5.806, p = .017, as shown in Table 2 (see results for Model 6). Accordingly, Hypothesis 4 was supported. In addition, relative weights analysis demonstrated that 20.1% of the variance explained in subordinates' well-being could be attributed to the overall interview score, while the other predictors explained between 0.1% (gender) and 53.8% (overall score of subordinate ratings of leadership behaviors) of the variance. 2 Hypothesis 5 expected supervisors with a higher overall interview score to have subordinates with higher levels of affective organizational commitment. In line with this, the interview score correlated significantly with subordinates' affective commitment (r = 0.21, p = .009). As shown in Table 2 (see results for Model 8), analyses demonstrated that the overall interview score explained a significant proportion of variance in subordinates' affective commitment beyond leader traits (including demographic variables), questionnaire-based self-ratings and subordinate ratings of leadership behaviors, ΔR 2 = 0.02, F (1, 143) = 4.233, p = .040. Hence, Hypothesis 5 was supported. Relative weights analysis further revealed that 15% of the variance explained in subordinates' affective commitment could be attributed to the overall interview score, while the other predictors explained between 1.2% (core self-evaluations) and 66.7% (overall score of subordinate ratings of leadership behaviors). 2 Following best practice recommendations on handling control variables in leadership research (e.g., Bernerth, Cole, Taylor, & Walker, 2018), we additionally tested Hypotheses 2 to 5 (a) without demographic variables (age, gender, leadership experience) and (b) without demographic variables and self-rated leader traits (core-self-evaluations, emotional intelligence). In each case, the interview overall score remained a significant predictor of all examined leadership outcomes. Finally, as a research question, we explored whether interview ratings of leadership behavior constructs relate to other measures of leadership behavior (i.e. self-ratings, subordinate ratings, and behavioral codings of the same constructs, see Table 1). Interview ratings showed small, non-significant correlations with questionnaire-based self-ratings for task-orientation, r = 0.02, p = .818, relations-orientation, r = 0.14, p = .079, and change-orientation, r = 0.12, p = .157. These results correspond to previous research which found little or no Task-orientation  Parcel 1   Task-orientation  Parcel 2   Task-orientation  Parcel 3 Task-orientation Parcel 4 Task-orientation Parcel 5 Task-orientation Parcel 6 Relations-orientation Parcel 1 Relations-orientation Parcel 2 Relations-orientation Parcel 3 Relations-orientation Parcel 4 Relations-orientation Parcel 5 Relations-orientation Parcel 6 Change-orientation Parcel 1 Change-orientation Parcel 2 Change-orientation Parcel 3 Change-orientation Parcel 4 Change-orientation Parcel 5 Change-orientation Parcel 6 Taskorientation

Discussion
Leadership research has called for new approaches to measuring leadership behaviors (Antonakis et al., 2016;DeRue et al., 2011;Hunter et al., 2007). The present study addressed this call by thoroughly investigating the potential of the structured interview method for assessing leadership behavior constructs, thereby providing initial validity evidence.
As a first major contribution, the present study showed that the overall score of an interview designed to assess leadership behavior constructs predicts a variety of outcomes including supervisors' annual income and role-players' ratings of situational leader effectiveness, as well as subordinates' well-being and affective organizational commitment. Thereby, this study expands previous research on the validity of structured interviews for leadership positions which so far has predominantly focused on supervisors' performance as outcome measure (Huffcutt, Weekley, et al., 2001;Krajewski et al., 2006) while neglecting subordinate outcomes. The present finding is of particular practical interest because it implies that a structured leadership interview assessing established leadership constructs can help to identify those leaders who will have a positive impact on their subordinates.
In predicting these different outcomes, interview ratings further explained variance over and above questionnaire-based self-ratings and subordinate ratings of leadership behaviors. This highlights the benefits of a context-specific assessment of leadership behaviors. Both previous research on situational judgment tests (SJTs; Peus et al., 2013) and this study support the assumption that situational assessments (i.e., providing a specific situation as a framing for rating leadership behaviors) explain variance in leadership outcomes over traditional assessments of leadership that do not refer to a specific context (i.e., conventional leadership questionnaires).
As a second major contribution, this study is among the first to find that the data structure of interview ratings reflects the interview dimensions (i.e., constructs) that the interview was designed to measure. The main reason for this finding may be that the present interview was designed to assess broad and well-defined constructs from leadership research (Yukl, 2012). Consequently, this finding provides new opportunities for developing further structured interviews to assess other relevant leadership constructs (e.g., instrumental leadership, ethical leadership, or charismatic leadership; Antonakis et al., 2016;Brown, Treviño, & Harrison, 2005). Broadly speaking, the structured interview can be an appealing option in all contexts where conventional leadership questionnaires cannot be used such as leader selection (due to selfraters' motivational biases and given that ratings from subordinates are often simply not available) or in certain research settings (where common source variance between the assessed leadership behaviors and leadership outcomes may limit predictive inference; see for example Antonakis et al., 2010).
As a third and final contribution, the present findings improve our understanding of the interview method by shedding light on interrelations between interview ratings and other leadership measures. Specifically, this study is the first to examine how interview ratings of leadership behavior constructs relate to self-ratings, subordinate ratings and behavioral codings of the same leadership constructs. Empirically, we found that interview ratings correspond to behavioral codings, but hardly to self-ratings and subordinate ratings of leadership.
A conceptual explanation for these less intuitive findings at first sight draws from the distinction between maximum and typical performance (Sackett, Zedeck, & Fogli, 1988). Applying the definitions of these performance types to leadership behavior, maximum performance is how supervisors interact with their subordinates when they devote full attention and effort to their leadership role, while typical performance is how supervisors interact with their subordinates on a regular basis (see also Klehe & Anderson, 2007;Sackett, 2007). Following Sackett et al. (1988), individuals are likely to demonstrate maximum performance when they are aware that they are being evaluated and instructed to maximize their effort, and when the evaluation occurs over a short time period so that the individual remains fully focused on their performance. These criteria seem to apply to both interview ratings and behavioral codings of leadership behaviors in simulated situations: In both contexts, supervisors were explicitly evaluated by interviewers/role-players who were physically present in the situation. In addition, both measures referred to leadership behaviors within a specific time frame. On the other hand, self-ratings and subordinate ratings of leadership behaviors seem to tap more into typical performance, given respondents refer to long-term behavioral tendencies and are not instructed to evaluate leadership behavior in a specific situation. Consequently, interview ratings (and also behavioral codings) may be more likely to capture how supervisors act when they are at their best behavior and therefore only modestly correspond to self-ratings and subordinate ratings which describe more typical behaviors.
When it comes to predicting leadership outcomes comprehensively, it may be of interest to assess both maximum and typical leadership behavior. For example, the present findings show that more distal leadership outcomes such as subordinate well-being are predicted by both interview ratings (as maximum performance measure) and subordinate ratings (as typical performance measure). Conceptually, this makes sense when considering that subordinates are affected by how the supervisor behaves in critical situations (when maximum effort is needed) and by how the supervisor behaves in everyday work life situations (when maximum effort is not needed).

Practical implications
Overall, this study provides encouraging evidence for the use of structured interview method in the field of leader selection and development. Regarding leader selection, the finding that interview ratings demonstrate incremental validity beyond leader traits, self-ratings and subordinate ratings of leadership behaviors suggests that the costs of conducting an interview (e.g., training interviewers, time required for administering interviews as compared to administering questionnaires) may be outweighed by its benefits to validity.
Regarding implications for leader development, the most important finding is that interview ratings are not redundant to but may meaningfully complement other leadership measures. In a developmental program, it is often required to provide supervisors with differentiated feedback on their behaviors. In this context, interview ratings of leadership behaviors could offer an additional perspective to self-ratings and ratings from other sources (i.e., ratings from subordinates, peers, and superiors as part of a 360°feedback intervention).

Limitations
This study is not without limitations. While the interview score predicts a wide range of leadership outcomes, we cannot infer from the results how interview ratings relate to supervisors' behavior in their everyday work-life. Results illustrated that those supervisors who obtain higher ratings in the leadership interview had a higher income (with or without considering demographic controls and characteristics of the organization), engaged actively in more leadership behaviors in simulated leadership tasks, were perceived as more effective in these leadership tasks by role-players, and had subordinates who reported higher levels of general well-being and affective organizational commitment. At the same time, interview ratings did not reflect how supervisors perceived themselves or their subordinates' perceptions of supervisors' leadership behaviors. Additionally, the study design did not allow us to record and code how supervisors interact with their actual subordinates at the workplace. Hence, research is warranted to understand how interview ratings relate to how leaders actually behave at their specific jobs.
In addition, we would like to note that measurement error (i.e., using variables that are not perfectly reliable) might have biased the results to some extent (e.g., Ree & Carretta, 2006). In our analyses, we did not account for error variance when we parceled interview ratings in CFAs and when we predicted leadership outcomes without modeling predictors as latent variables. Thus, some caution is warranted as the present results could slightly overestimate or underestimate the true relationship of interview ratings with leadership outcomes (e.g., Bollen, 1989). Still, we decided for this more practical analytic approach due to the complexity of the interview data structure and due to a limited sample size (e.g., Bentler & Chou, 1987).
Finally, self-selection bias could potentially threaten the generalizability of our findings. It is possible that only those supervisors who already invest many resources in being "good leaders" might have participated in the leadership assessment program. However, we primarily recruited supervisors through their organizations (e.g., by contacting the HR departments of local organizations). These organizations often decided to have all their supervisors participate in the program, or the HR department decided (together with the respective supervisors) who would benefit most from the program and should therefore participate. Consequently, supervisors took part in the present study for various reasons, which lessens concerns about self-selection biases.

Future research
We see the present study as a promising starting point to intertwine the research avenues of interview and leadership research. In general, more diversity in the assessment of leadership behavior constructs may allow for a more differentiated understanding of how leaders behave and why they do so. The present study has shown that it is a fruitful approach to use structured interviews as one complementary approach to traditional leadership questionnaires. We suggest that future research adapts and explores the potential of further assessment instruments from personnel selection and development (i.e., virtual interviews, assessment center exercises, video-based SJTs, etc.) for measuring constructs from leadership theory that are typically assessed via questionnaires. This would equip research with a broader choice of tools for assessing leaders.
Along these lines, we encourage future research to systematically compare different approaches to measuring leadership behavior to deepen our understanding about how leadership measures function. In particular, comparing measures that vary only with regard to one specific methodological factor such as the information source (e.g., whether the same interview questions are rated by interviewers or by supervisors or by subordinates) may allow for examining which features of the respective measurement approach drive (a) convergence between measures and (b) the validity of different measures. In other words, examining this would generate more detailed knowledge on why certain measures hardly correspond to each other (e.g., interview ratings and self-ratings; see also Van Iddekinge et al., 2004) and why certain measures are more predictive of leadership outcomes than others.

Conclusion
Evidence from this study illustrates that structured interviews help to identify leaders who have a positive impact on their subordinates. In addition, interview ratings of leadership behavior constructs predicted criteria beyond a number of other relevant predictors including selfratings and subordinate ratings of the same leadership constructs. Thus, the interview method shows potential to meaningfully complement existing leadership measures in research and practice. "Think of a situation in which it was very important that employees performed well for the success of your organization, but you had the impression that certain employees were dissatisfied, and your team did not work as efficiently or as quickly as you expected. Please describe both the situation and your exact response in detail." Notes:

_________________________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________________________
Please rate the extent of the respective leadership behavior construct by ticking the appropriate number: "Imagine you vowed to make a case for hiring an additional employee to reduce your team's workload. Due to the general poor economic situation, the necessary financial funds were not granted and you are unable to deliver on your promise to the team. Because of a new assignment that your team has received, the workload is likely to increase even more. Please describe how exactly you would proceed in this situation."