An Item-Response Theory Approach to Safety An Item-Response Theory Approach to Safety Climate Measurement: The Liberty Mutual Safety Climate Measurement: The Liberty Mutual Safety Climate Short Scales Climate Short Scales

Zohar and Luria ’ s (2005) safety climate (SC) scale, measuring organization-and group-level SC each with 16 items, is widely used in research and practice. To improve the utility of the SC scale, we shortened the original full-length SC scales. Item response theory (IRT) analysis was conducted using a sample of 29,179 frontline workers from various industries. Based on graded response models, we shortened the original scales in two ways: (1) selecting items with above-average discriminating ability (i.e. o ﬀ ering more than 6.25% of the original total scale information), resulting in 8-item organization-level and 11-item group-level SC scales; and (2) selecting the most informative items that together retain at least 30% of original scale information, resulting in 4-item organization-level and 4-item group-level SC scales. All four shortened scales had acceptable reliability ( ≥ 0.89) and high correlations ( ≥ 0.95) with the original scale scores. The shortened scales will be valuable for academic research and practical survey implementation in improving occupational safety.


Safety climate
Safety climate research has been ongoing for more than 35 years, since Zohar published his seminal work in 1980 defining this construct as workers' shared perceptions regarding their organization's policies, procedures, and practices in relation to the value and importance of safety within that organization (Zohar, 1980;Griffin and Neal, 2000;Zohar, 2000Zohar, , 2002Zohar, , 2003. The study of safety climate is based on perceptions of workers, with the major factors relating to (a) management commitment to safety and (b) communication pertaining to safety as a true priority from top management and direct supervisors (Dejoy et al., 2004). Prior research has stated that safety climate is a multilevel construct encompassing two managerial levels: (1) organization-level safety climate, which refers to employees' perceptions of the company's or top management's commitment to and prioritization of safety, and (2) group-level safety climate, meaning employees' perceptions of their direct supervisors' commitment to and prioritization of safety (e.g., Zohar and Luria, 2005;Huang et al., 2013a,b). Several meta-analyses have provided robust evidence that safety climate is one of the best leading indicators of organizational safety outcomes, such as frequency or severity of injury incidents (Christian et al., 2009;Beus et al., 2010;Nahrgang et al., 2011). Overall, safety climate influences employees' motivation and knowledge to act in a safe manner, which in turn lead to safer behaviors and fewer accidents and injuries (Griffin and Neal, 2000;Christian et al., 2009).
Since the inception of safety climate research, many safety climate scales have been developed and validated in the scientific literature. One of the most widely used safety climate scales published in the field, which has robust evidence of reliability and validity, is a generic safety climate scale developed by Zohar and Luria (2005). Their scale includes 32 total items: 16 items to measure organization-level safety climate and 16 items to measure group-level safety climate. In Zohar and Luria's (2005) study, the Cronbach's alpha of the scale was 0.92 for organizational-level safety climate (OSC) and 0.95 for group-level safety climate (GSC). In terms of criterion-related validity, OSC was correlated with safety audit/observation scores at 0.46, and GSC was correlated with safety behavior observations at 0.38. According to Google Scholar (retrieved January, 2017), their paper has been cited by nearly 800 publications, many of which use their measure. For example, one of the heavily cited papers (Johnson, 2007) found that GSC was significantly correlated with injury frequency at −0.50 and safety behaviors at 0.78. Examining OSC, Martínez-Córcoles et al. (2011) found a correlation with safety behaviors at 0.43, while Brondino et al. (2012) found correlations with safety compliance and safety participation ranging from 0.27 to 0.36. Due to its increasingly high usage in research and practice, the current study focuses on increasing the utility of this scale by shortening the number of items required while maximizing information provided.

Length of safety climate scales
Safety researchers are frequently faced with a dilemma in field research: whether to use brief measures or longer, more exhaustive and thorough measures. A longer measure can capture a fuller range of construct content and variance of interest, whereas a brief measure can boost both participant engagement and the efficiency of data collection. There are times when a longer scale is preferable, but shorter scales may be more effective in other cases.
Overall, a survey instrument should not overwhelm respondents with too many questions. Previous research has demonstrated that survey length can negatively impact response rates (e.g., Crawford et al., 2001). By shortening the length of a survey, individuals may be more likely to perceive that they have time to participate in survey research, even when they do not feel participation will directly benefit themselves (Woods and Hampson, 2005). Furthermore, in cases where measures contain many items focused on a very similar topic, many participants may interpret items as redundant and may have negative reactions toward the overall survey assessment (Wanous et al., 1997).
An additional issue with longer measures is that their use can limit the nature of models that can be tested to explore relations among various constructs (Fisher et al., 2016). Zohar and Luria's (2005) generic safety climate scale includes 32 items, which is a fairly long measurement scale. Despite the existence of this psychometrically solid and widely accepted scale, Zohar (2010) stated that more work is needed to explore how safety climate emerges and how safety climate is influenced or changed (i.e., which factors contribute to the development of safety climate perceptions). In order to fill this gap, researchers need to collect additional data on many other variables simultaneously with safety climate. With the current length of the safety climate scale, it is challenging to achieve this goal within realistic limitations that researchers face. In order to further explore potential factors influencing safety climate, a shorter and valid generic safety climate scale is needed.

Item response theory (IRT)
We propose an Item Response Theory (IRT) approach because it assesses multiple psychometric features of individual scale items. In comparison, Classical Test Theory (CTT) places more emphasis on the scale's composite score. IRT is a probabilistic non-linear modeling technique for developing and evaluating psychological measurement scales. For example, it can be posited that items of a scale are designed to assess a certain psychological attribute (e.g., safety perception) such that endorsing higher values on the items suggests a stronger underlying psychological attribute (e.g., stronger safety perception). If respondents give undiscriminating endorsements to an item when they indeed differ in terms of the underlying psychological attribute, the item should be deemed improper as a measure of the psychological attribute. To this end, IRT calculates the respondents' probability of endorsing particular response options of each scale item and estimates each item's ability to differentiate respondents, which can be used for strategic tailoring of lengthy psychological scales.
It needs to be noted that even though IRT has been frequently used with educational and psychological tests which have correct or wrong answers, it can be applied to Likert scale-based measures (i.e., item with ordered categoricalpolytonomousresponse options) of psychological trait/attribute such as perceived job security (Probst, 2003) and personality (e.g., Reise and Henson, 2000). Likewise, higher levels of underlying trait/attribute are assumed to lead to higher probabilities of stronger endorsement (e.g., choosing the category 'strongly agree' on a 5-point Likert scale). IRT is free from limitations faced by conventional linear regression-based development and validation techniques such as circular sample dependency of item/person statistics (Fan, 1998). Furthermore, IRT considers the differentiating/discrimination ability and difficulty of each item as information to be incorporated in the scale. It allows researchers to more efficiently assemble the items that offer the most information for measuring the targeted underlying trait/ attribute.
The unique parameters offered by IRT, such as slope and difficulty parameters, can be derived based on the probability of responses, which is illustrated by the item option response functions (ORFs). For a fivepoint Likert scale, each item has five response options. In the polytomous IRT model, ORFs are used to describe participants' response patterns. Each option has an ORF curve, with the x-axis representing the trait being measured (θ) and the y-axis representing the probability of endorsing this particular option; an ORF thus depicts the relationship between the participants' trait and their responses to an item.
The slope, discrimination, or differentiation parameter determines the slope of the option response functions (ORF) for each item. Every item will have one slope parameter. If all other difficulty parameters are equal, items with high slope parameters will have smaller overlap of θ values between the option response functions, representing better differentiation. In the current study, the slope parameter represents each item's sensitivity to the overall level of safety climate.
The difficulty parameter determines the location of the ORF along the θ axis and indicates on which part of the range of θ the item is most informative, or the θ value at which people have a 50% chance of selecting specified responses (i.e., the cutoff points that separate the response option categories). In the current study, each item was rated on a 5-point Likert scale. Therefore, each item has four ORFs and four difficulty parameters (i.e., the cutoff points that separate response 1 from responses 2-5, responses 1-2 from 3 to 5, responses 1-3 from 4 to 5, and, finally, responses 1-4 from 5). These four difficulty parameters jointly indicate the overall difficulty of an item. In the current study, the item's difficulty represents whether an item is more informative (i.e., sensitive in differentiating the level/strength of estimated target trait) at lower or higher ranges of safety climate scores.
The item information curve (IIC) for each item is a function of both the slope and difficulty parameters. The amount of information that a particular item provides depends on both the size of the slope parameter and the spread of the category thresholds. An IIC represents the amount of information provided by a specific item across the entire continuum of the latent construct of interest. The area of the IIC above the x-axis (θ) equals the item information. If an item has a larger amount of item information, the item has higher discriminating ability to differentiate respondents along the θ axis. Depending on the slope and difficulty parameters, the amounts of information offered by items will differ. By aggregating the IICs of items in a measure, the test information function (TIF) for a scale can be generated. Similar to IICs, the area of the TIF above the θ axis equals the total test information. If a scale has a larger amount of total test information, the scale score has higher discriminating ability along the latent θ value.
Overall, the current study aims to utilize IRT to shorten Zohar and Luria's (2005) 32-item safety climate scale. Both slope and difficulty parameters for each item in the existing scale were calculated, and all information available was carefully considered to decide on the best items to include in the final shortened scales and the ideal number of items to include. The new, shortened scales are expected to benefit future safety climate research and practice by allowing for more diverse data collection opportunities and addressing concerns that organiza-tions and participants may have with implementation of a longer scale, while maintaining the usefulness of the existing measure.

Participants and data collection procedure
Safety climate survey data were collected online as part of an evaluation package for customers of a safety consulting group. The service consultants invited their corporate customers to participate in the survey. After an organization agreed to participate, all employees of the company were invited to participate in the online safety climate survey administered by the research team. Example items include: "Top management at this company tries to continually improve safety levels in each department," and "My direct supervisor discusses how to improve safety with us." The items were all on a 5-point Likert scale (1 = strongly disagree to 5 = strongly agree). Raw data were handled by only the research team, and the lead consultant received only a report with analyzed, aggregated data to share with the customer. No identifiable personal information was collected from participants.
Survey data were collected from 29,185 frontline employees of 46 companies from various industries (e.g., manufacturing, construction, and transportation). Six respondents did not answer more than 50% of the scale questions, so they were excluded from the analysis, leaving a final sample for analysis of 29,179 participants. Company size ranged from 45 to 12,000, with an average of 1274 employees. The withincompany response rate ranged from 30.16% to 98.83%, with an average of 62.39%.

IRT analysis
IRT analyses were performed with the R open source package LTM (Latent Trait Modeling) developed by Rizopoulos (2006). IRT assumes the scale items are measuring a single construct, representing the target trait. Hence, unidimensionality of the OSC and GSC scales were individually examined before running IRT analyses. Both discrimination and difficulty parameters for every item of the safety climate scale were calculated. The discrimination (or differentiating) parameter represents the slope of the ORFs that capture the relationship between the latent construct (i.e., overall safety climate perception) and the probability of endorsing a particular response option for each item's response options. The standardized discrimination parameter (e.g., zscore) can be used to judge the statistical significance of the item's traitdifferentiating capacity such that if it is greater than 1.96, it is significant at p < 0.05. The difficulty parameters determine the location of the ORF along the axis of θ (i.e., latent trait; representing overall level of safety climate perception).
Based on the discrimination and difficulty parameters, the Item Information Curve (IIC) for each of the 32 items can be generated. The IIC shows the distribution of information an item provides on a continuum of the estimated level of the latent trait, θ. The area of IIC above the θ axis represents the amount of information provided by a specific item across the entire continuum of the latent trait of interest. An item typically offers a larger amount of item information if it has a greater discriminating parameter (i.e., steeper slopes) and a broader range of difficulty parameters along the θ axis.
The Test Information Function (TIF) for a scale can be generated by aggregating all the IICs of the items included in the scale. Similar to IIC, the area of TIF above the θ axis equals the total test information. Our aim was to shorten the original scales by selecting the items that provided the most information, while also ensuring the TIFs of the shortened scales maintained a shape that was similar to those of the original scales. Two approaches were used to determine how many items should be included in the shortened scales, as described below.
2.2.1.1. Shortening via item information criteria. We first shortened the original OSC and GSC scales by selecting items that offered aboveaverage information because an item with more information can more precisely differentiate the overall level of OSC or GSC based on respondents' ratings on the item. The amount of the information is indicated by the area under the item information function curve across the θ axis. For a 16-item scale, if each item is assumed to differentiate the level of OSC or GSC by an equal amount, each item should provide 6.25% of the total test information (i.e., 100% divided by 16 items). In reality, some items have better discriminating ability than others. In other words, they provide more than 6.25% of the total test information. Therefore, we shortened the original OSC and GSC scales by selecting items that had better than average discriminating ability (i.e. providing more than 6.25% of total test information).
2.2.1.2. Shortening via total test information. At the same time, in order to give companies more flexibility in the scale length they select, the original OSC and GSC scales were further shortened and made more concise by selecting the most discriminating items that, in total, retained at least 30% of the original total scale information (c.f., 100% information by entire 16 items, respectively for OSC and GSC scales). Put differently, we retained items with the highest percentages of information until the sum of item information was equal to or greater than 30%. It should be noted that the 30% criterion was chosen in consideration of the minimum number of items (i.e., over three; Kenny, 2016) needed to ensure model identification in a confirmatory factor analysis (CFA) and acceptable reliability of the scale (Cortina, 1993). We tested the correlations between scores of these more concise scales and the original scales to examine the representativeness of the shorter versions and to justify the appropriateness of using the criterion of retaining at least 30% of total scale information (see Section 2.2.3).

Reliability test
We calculated the Cronbach's alpha of all shortened scales to determine the reliability of the shortened versions of the safety climate scales. The generally accepted criterion for good internal consistency (i.e., Cronbach's alpha = 0.70) was used (Nunnally and Bernstein, 1994).

Validity test
After we created and calculated the mean scores of the shortened versions of the safety climate scales, we then examined the convergent validity of the shortened and original scales by calculating the correlation between the scales' mean scores. Generally, a correlation between two variables of greater than 0.80 (Brown, 2006) or 0.85 (Kenny, 1979) indicates the two variables are measuring the same construct. Because the validity of the original Zohar and Luria (2005) safety climate scale has been demonstrated in various previously-published scientific articles (e.g., Zohar and Luria, 2005), if these two scales are demonstrated to measure the same construct (i.e., correlation coefficients between scores on the shortened and original versions fall above the recommended values), we are able to infer the validity of the IRT-based shortened version of the safety climate scale.

Supplemental test for robustness
We further cross-validated the results by running analyses with 50% of the dataset (Davison and Hinkley, 1997) to examine the consistency of results regarding which items are most discriminating. We randomly selected 50% of respondents in each company to create two companylevel stratified split-half samples. We ran the IRT analyses using the two split-half samples and compared the discrimination and difficulty parameters. When results are consistent and robust across the split-half samples, we report the results using only the whole sample.

Basic descriptive
The mean OSC and GSC scores for the original, full-length scales were 3.95 (SD = 0.76) and 3.97 (SD = 0.79), respectively. Tables 1 and 2 list the option endorsement percentages (percentage of respondents who endorsed specified options 1-5 on a 5-point Likert scale for each item), mean score, and standard deviation for each item of the OSC and GSC scales, respectively.

Unidimensionality
We tested the unidimensionality of the OSC and GSC scales using Mplus 6.1 (Muthén and Muthén, 2010). Results of a confirmatory factor analysis (CFA) showed good model fit for a one-factor model of the OSC scale,

IRT model testing
We fit the items using graded response models (GRM; Samejima, 1997) because the OSC and GSC were all based on polytomous responses (i.e., five response options). GRM estimates one slope parameter and four difficulty parameters for each five-option item of the original scales. Two GRM models were estimated and compared for each scale: (1) a parsimonious GRM that specified an equal discrimination parameter for all of the items; and (2) a full GRM that freely estimated a discrimination parameter for each item. The first model was nested within the second model. Therefore, comparison of the change in −2*loglikelihood (−2*LL, which is based on a Chi-square distribution) can evaluate which model fit better.
For the OSC scale, the parsimonious GRM yielded a −2*LL value of For the two full GRM models, we also examined the model-data fit. The value of χ 2 /df for all possible item pairs and item triples of both OSC and GSC scales were less than 1, which indicates the two full GRM models fit well to the data (Chernyshenko et al., 2001).

IRT parameters and information
Tables 3 and 4 list the parameter and information results of the full GRM models for OSC and GSC items, respectively. Fig. 1a and b depict the item information curve for each item of the OSC and GSC scales, respectively. Fig. 2a and b solid lines show the total test information function for the OSC and GSC scales, respectively.
For the 16 OSC items, the discrimination parameters ranged from 1.98 to 3.35, and the percentage of total test information each item provided ranged from 4.28% to 8.81%. This is consistent with previous model comparison results and indicates considerable variation in the OSC items' discrimination ability. The difficulty parameters reflected a sizeable range of the underlying construct, OSC (−2.74 to 0.92), indicating that the OSC scale was generally more useful in identifying companies with poor to average OSC safety climate scores than very high OSC scores (i.e., approximately 1SD+ mean range).
Results for the 16 GSC items were quite similar. The discrimination parameters ranged from 1.70 to 3.77, and the percentage of total test information each item provided ranged from 2.74% to 8.31%. This is consistent with previous model comparison results and indicates considerable variation in the GSC items' discrimination ability. The difficulty parameters reflected a sizeable range of the underlying construct, GSC (−2.86 to 0.86), indicating that the GSC scale was generally more useful in identifying companies with poor and average GSC scores than very high level of safety climate (i.e., approximately 1SD + mean range).

3.3.3.
Item selection for the shortened scales 3.3.3.1. Item information criteria method. First, we shortened the scales by selecting items that had above-average discriminating ability (i.e. provided more than 6.25% of total test information), as described  Zohar and Luria (2005). above. The shortened OSC scale included eight items: items 11,3,9,14,16,12,6, and 13 (descending order of information provided). This shortened OSC scale retained 56.94% of the total test information of the original scale. Reliability of the shortened 8-item OSC scale was 0.94. The difficulty parameters ranged from −2.65 to 0.85.The shortened GSC scale included 11 items: items 10, 4, 3, 9, 5, 13, 6, 2, 14, 11, and 15 (descending order of information provided). This shortened GSC scale retained 77.71% of the total test information of the original scale. Reliability of the shortened 11-item GSC was 0.97. The difficulty parameters ranged from −2.70 to 0.80. The dashed lines in Fig. 2a and b demonstrate the test information functions of the shortened OSC and GSC scales, respectively. More specifically, they show how well the ratings on given sets of safety climate scale items are capable of precisely differentiating respondents with different levels of overall safety climate perceptions. According to the figures, the test information function curves of the two shortened scales are similar to those of the original scales in both shape and coverage across the safety climate continuum, which indicates that they are representative of the original scales. Although the shortening of the scale inevitably results in the shrinkage of area under the curves, which is the amount of scale information, the shrinkage was relatively less substantial considering the sizeable number of items that were removed. Also, general trends of the estimated safety climate level and scale information relationship were similar (see 3.3.4), suggesting that item reduction did not distort the original scales.

Total test information method.
Because the shortened OSC and GSC scales together have 19 items, which may still be too long for some applications, we further shortened the original OSC and GSC scales. To provide more scale length options, we selected the most discriminating items that, in total, retained at least 30% of the original total test information. Based on this criterion, the more concise OSC scale included four items: items 11, 3, 9, and 14, which together retained 30.29% of the total test information of the original scale. Reliability of the four-item OSC scale was 0.89. The difficulty parameters ranged from −2.65 to 0.63.The more concise GSC scale included four items: items 10, 4, 3, and 9, which together retained 30.88% of the total test information of the original scale. Reliability of the four-item GSC scale was 0.92. The difficulty parameters ranged from −2.57 to 0.63.
The dotted lines in Fig. 2a and b depict the test information  (5) 6.98% Note: Bold indicates that the item was selected for the shortened 8-item scale; Italics (rank 1-4 in Value column) indicate that the item was selected for the more concise 4-item scale. GRM = graded response models; OSC = organization-level safety climate; OSC1-OSC16 refer to the original 16 items in Zohar and Luria (2005). functions of these more concise OSC and GSC scales, respectively. The figures show that the test information function curves of the two more concise scales had shapes and coverage across the safety climate continuum similar to the original scales.

Preliminary validity evidence of the shortened scales
Results of the bivariate Pearson correlations between the original full-length scales and shortened scales, using their mean scores, are listed in Table 5. All the correlations were greater than 0.95 and significant (p < 0.01). Given that Zohar's original scales were predictive of important safety outcomes, the shortened scale scores should also be significantly related to those outcomes.

Supplemental analyses − split-half test for robustness
We further cross-validated the results by comparing IRT results of two split-half samples. Split-half sample A randomly selected 50% of the respondents from each company for a total number of 14589. Sample B consisted of the unselected 50% of respondents from each company for a total number of 14590. Results of the IRT analyses using the two split-half samples were consistent and robust across the two samples: the Pearson correlation coefficients of the slope and difficulty parameters for each item were all significantly correlated between sample A and sample B, p < 0.05. Furthermore, when using the two shortening scale methods described, the selected items remained the same for the two split-half samples. Therefore, we report the results using only the whole sample.

Discussion
The primary goal of the current study was to shorten Zohar and Luria's (2005) 32-item safety climate scale, which includes 16 items for organization-level safety climate (OSC) and 16 items for group-level Y.-h. Huang et al. Accident Analysis and Prevention 103 (2017) 96-104 safety climate (GSC), using an item response theory (IRT) analytical approach. We expect that a shortened safety climate scale will increase the practical utility of safety climate assessments by reducing respondent burden and increasing face validity, especially for users who are concerned with the amount of time needed for survey administration and the measurement integrity (e.g., reliability and validity). Moreover, a shortened safety climate scale would more likely allow researchers and practitioners to incorporate additional constructs into their survey assessment to advance the literature by, for example, examining and expanding the nomological network of safety climate. Based on a series of IRT analyses using survey responses gathered from nearly 30,000 employees representing 46 companies in various industries, the discrimination parameters revealed that all OSC and GSC items in Zohar and Luria's (2005) original scale were able to effectively discriminate (or differentiate) between high and low levels of safety climate. However, the difficulty parameters indicated that, overall, the OSC and GSC items were more useful in identifying companies with poor and average safety climate scores than those with high safety climate scores.
Item information for each item was then computed as a function of both discrimination and difficulty parameters. We adopted two different procedures in shortening the OSC and GSC by 1) identifying items with above-average discriminating ability (i.e., items providing more than 6.25% of total test information) and 2) developing more concise scales that in total retained at least 30% of the original total test information, thus creating two shortened versions of the OSC scale and two shortened versions of the GSC scale.
The first procedure resulted in eight OSC items and eleven GSC items that each had above-average discriminating ability (i.e., over 6.25%) and, respectively, retained 56.94% and 77.71% of total test information (see Tables 3 and 4). In addition, these 8-item OSC and 11item GSC scales both had acceptable Cronbach's alpha estimates (0.94 and 0.97 respectively) and significant correlations with the original scale scores, thus supporting the reliability of these shortened OSC and GSC scales.
The second procedure identified four OSC and four GSC items from the original scale that are needed to retain at least 30% of the original total test information. These 4-item OSC and 4-item GSC scales also had acceptable reliability estimates (0.89 and 0.92, respectively) and significant correlations with the original scale scores.
Depending on measurement needs and objectives, some users may prefer the 8-item OSC and 11-item GSC shortened versions, while others may prefer the 4-item OSC and 4-item GSC shortened versions. It is important to note that we are not arguing that one length is superior to the other; we adopted two lengths to provide researchers and practitioners with two different shortened scale options that they can choose from based on measurement purposes/objectives, study design, and available resources (e.g., time).
The current study makes important contributions to the literature, organizations, and safety professional communities in several ways. First, Zohar (2010) highlighted that gaps exist in our understanding of how safety climate emerges and how it is influenced. The shortened versions of OSC and GSC scales identified in the current study would allow researchers and practitioners to incorporate additional constructs into their survey instruments which could potentially explain the emergence or changes in safety climate. In other words, the use of shortened safety climate scales has the potential to increase the chances of expanding our understanding of the relationships between safety climate and other constructs.
Second, the shortened OSC and GSC scales identified in the current study are expected to broaden the usage of safety climate assessment in field settings while retaining acceptable levels of scale information. For example, a company would more likely be able to incorporate the shortened OSC and GSC scales into their existing employee assessments (e.g., employee opinion surveys) and, thus, increase understanding of safety climate in their organization.
The current study also has limitations that highlight directions for future research. First, even though we used a relatively large sample representing a number of companies, biases may exist in the survey responses because it is typically more common for organizations that prioritize safety to participate in safety climate assessments. For example, our results showed that the study participants scores were around 4 (mean OSC = 3.95, mean GSC = 3.97) out of a 5-point Likert scale, suggesting the possibility of that the sample was biased toward people who perceived a positive SC. However, as mentioned earlier, IRT parameters are not dependent on the level of target trait (i.e., SC) of the sample (Fan, 1998;Baker, 2001). In other words, the sample does not impact the estimate of the IRT parameters. Second, we were not able to collect data on safety outcomes to validate our shortened scales. However, Zohar and Luria's original scale (2005) has been demonstrated to have good validity with quite a few safety outcomes. Because our four shortened OSC and GSC scale scores were strongly related to the original scale scores (r > .95), we believe that our shortened scales have good validity and can be used to predict safety outcomes. Future studies can consider collecting responses on  safety outcomes (e.g., self-reported safety behaviors and objective workers' compensation data) in order to establish criterion-related validity of the shortened scales. Third, the range of the difficulty parameters of our shortened scales focuses on the low end of safety climate, which is similar to the range of difficulty parameters from the original scale items (see Tables 3 and 4).
The low end difficulty range shows that our selected items are more useful in differentiating companies with poor, average, and better than average safety climate (less than +1 standard deviation), which is where safety improvement is most needed. However, these items were less efficient in differentiating the companies with highest safety climate (top 20%). Although this might be a minor issue, given safety climate assessment is commonly used for identifying companies with low safety climate for safety promotion, future studies may consider adding items with difficulty parameters in the higher end.
In conclusion, using an IRT analytical approach, the current study developed shortened versions of Zohar and Luria's (2005) 16-item OSC and 16-item GSC scales. Specifically, we identified 8 OSC and 11 GSC items with above-average discriminating ability, and further selected 4 OSC and 4 GSC items that retained at least 30% of the original total test information. It is our expectation that these shortened safety climate scales will increase the utility of safety climate assessments in both research and practice.
Note: The original 16 items are from Zohar and Luria (2005). The 11-item and 4-item shortened scales are referred to as the Liberty Mutual Safety Climate Short Scales.