Abstract

Introduction. Student evaluation of teachers’ effectiveness is one of the most common tools used as a measure of teaching performance and accountability by various universities across the globe. The major purpose of this study was to evaluate the validity and underlying structure of students’ evaluation of the higher education teaching effectiveness scale used by all public universities in Ethiopia. Methodology. Data collected from 1397 students at Debe Markos University were used for this analysis. Cronbach’s alpha values and average interitem correlation were used to study the internal consistency reliability of the scale. Composite reliability, average variance extracted, hetero trait-mono-trait ratio, maximum shared variance, average shared variance, and interconstruct correlations were used to assess the construct validity of the scale, and exploratory factor analysis and confirmatory factor analysis were performed with 20 items to test the hypothesis which introduced a four-dimensional construct for teachers’ evaluation scale. We used different goodness-of-fit indices to measure the fit of the models. Results. The scale was shown to have good internal consistency and convergent validity but lacked discriminant validity. Furthermore, confirmatory factor analysis indicated that the four-factor model produced inadequate fit indices, revealing that the original factor structure of the scale changed. Conclusions. The results showed that Student Evaluation of Teaching Effectiveness did not measure what it was supposed to be measuring. Moreover, the exploratory factor analysis and confirmatory factor analysis results indicate that a two-dimensional model is better than the four-dimensional model to explain the data structure, which places limitations on its use.

1. Introduction

Reliability and validity, jointly called the “psychometric properties” of measurement scales, are the two most important and fundamental features in the evaluation of any measurement scale [14]. The evidence of validity and reliability are prerequisites to assure the integrity and quality of a measurement scale [5].

Student evaluation of teachers’ effectiveness (SETE) is commonly used to measure teaching performance and accountability by various universities across the globe [6,7]. If carefully developed and systematically used, teacher evaluation is believed to have the potential to enhance teachers’ professional development, thereby improving students’ achievement [8]. Hence, the scales used to assess teacher performance should be accurate and exhaustive, allowing the results to provide useful information about teachers’ teaching effectiveness.

The effectiveness of an education system largely depends on the effectiveness of its teachers, which in turn has a large influence on student learning [9]. As a result, measuring teachers’ effectiveness is an important vehicle for promoting educational quality [911], which in turn enhances the quality of graduates [12].

Currently, student evaluation of teaching effectiveness is a common practice in almost every institution of higher education globally [13]. Over the years, however, different SETE scales have been proposed and developed. Consequently, numerous well-designed and validated instruments are available to measure higher education teachers’ teaching effectiveness [7]such as the Students’ Evaluation of Teaching Effectiveness Rating Scale [14], the Student Course Experience Questionnaire [15], the questionnaire for student evaluation of teaching [16], and the Teaching Proficiency Item Pool [17]. The development of the SETE scale is an ongoing process to develop a psychometrically sound scale that measures teacher effectiveness in higher education taking into account the dynamics of the characteristics of effective teaching.

A number of studies have been conducted to address the various elements of SETE scale [7,18,19]. Researchers claim that the SETE scale should capture multiple aspects (dimensions) of good teaching practices [7]. Some of the studies have asserted that the SETE scales need to be one-dimensional [18,20], whereas others believe it to be multidimensional [7, 1931].

The variations in the content and the number of dimensions are attributed to the absence of agreement concerning the number and nature of these dimensions, which should be based on both theory and empirical testing [7]. Identifying the characteristics of effective teaching, which is a prerequisite for the construction of SETE scales, is also a possible reason for the variation in the SETE scale. Moreover, different institutions have different educational visions and policies, thereby developing SETE scales that are consistent with their preferences.

In the search for educational quality in Ethiopia, various attempts have been undertaken to generate meaningful and accurate indices of teacher effectiveness [32]. To this end, the Ministry of Science and Higher Education (formerly Ministry of Education) in Ethiopia identified four competencies of teacher effectiveness that served as the conceptual basis for this study: subject matter knowledge (core competency), professional competency, ethical competency, and time management [33]. The first two indices are related to the instructional effectiveness of teachers, whereas the next two indices are related to the teacher’s personal quality. Each of these dimensions focuses on a key aspect of a teacher’s professional qualification or responsibilities. As a result, it is critical to determine if the scale measures the intended competences or construct accurately and consistently [14].

Nevertheless, no study has been conducted on the psychometric properties and validity of the SETE scale used by Ethiopian higher education institutions despite many faculty members questioning the validity and reliability of SETE results for many years. The scale was not evaluated by independent experts and the target population (students) to verify that the items adequately measure the domain of interest. Pretesting has not been made to assess the extent to which items reflect the constructs of interest. In addition, the SETE scale was not evaluated to test the dimensionality, reliability, and validity. Rather, to the researchers’ best knowledge, the factor structure of the scale was constructed merely via discourse between subject matter experts. With the researchers’ sufficient experience in the study area, no previous studies have been conducted to investigate the factor structure (dimensionality), internal consistency, and validity of the SETE scale. Therefore, the use of student evaluation of teachers’ effectiveness scale claimed to have many problems concerning reliability, validity, dimensionality, and potential bias. Thus, this study was carried out with the major purpose of evaluating the reliability, validity, and underlying factor structure of the SETE scale. More specifically, the study aimed to determine whether the scale could indeed measure the unobservable construct/domain that was supposed to measure or check if the scale revealed an equivalent factor structure with what was established by the experts, and to test the convergent and discriminant validity of students’ evaluation of SETE.

Specifically, this search sought to answer the following questions:Does a SETE scale demonstrate adequate reliability at scale and item levels?To what extent does the SETE scale show construct and discriminant validity?Is the factor dimensionality of the SETE scale appropriate to measure higher education teaching effectiveness?

2. Methodology

2.1. Population and Context of the Research

Debre Markos University in Ethiopia is one of the public Universities founded by the Ethiopian Federal government in 2007. The university is located in East Gojjam, Amhara National Regional State, 300 km in northwest of the capital Addis Ababa. Currently, the university runs 51 bachelor’s, 47 master’s, and 2 Ph.D programs in regular, continuing, and distance education streams. There are more than 1556 academic staff and 1600 administrative staff in the university to serve over 30000 students and the community at large.

In Ethiopian higher education institutions, the application of the SETE is carried out at the end of the semester, before the final exams are administered, and the students know their final grades. All teachers are evaluated by the students in the same semester.

2.2. Local Context of SETE Development Process

The Students’ Evaluation of Teaching Effectiveness (SETE) scale Table 1 is one of the three harmonized scales used to measure teachers’ effectiveness. These scales were developed by the Ethiopian Ministry of Science and Higher Education. A group of subject matter experts developed the SETE scale, which comprised 20 items and judged the dimensions to be four: subject matter knowledge, professional skills, ethical quality, and time management [34]. From the four constructs, knowledge of the subject matter was considered as the core competence.

Three bodies are involved in assessing teachers’ competencies: students, peers, and immediate supervisors [34]. The students’ evaluation of teaching effectiveness accounted for 50% of the total evaluation, and the remaining 30% and 20% of the evaluations were accounted for by immediate supervisors and colleagues, respectively. The SETE scale items have five-point Likert scales, of which only one alternative may be chosen. Scores range from 1 to 5, where: 1 = “strongly disagree,” 2 = “disagree,” 3 = “neutral,” 4 = “agree,” and 5 = “strongly agree.” The 20 items that make up the SETE scale were broadly structured to reflect two teaching effectiveness factors or constructs: 14 for core and professional competency constructs and the remaining four items were related to the ethical and time management constructs.

To ensure the relevance of the items to the general principles of teaching in higher education settings, the development of the scale went through several steps to receive feedback from different stakeholders. To do so, different focus group discussions were held with department heads, students, and college deans.

2.3. The Data and Study Participants

This study is a secondary analysis carried out on data from the teachers’ assessment survey, which was undertaken at Debre Markos at the end of every semester to monitor the performance of teachers concerning teaching and research work activities. The following steps were followed to collect (extract) data for this study. In the first step, the teachers to be included in the sample were randomly selected. Then an excel data abstraction tool was prepared to record and manage the teacher’s assessment score. To assist with data abstraction and data entry process, a total of 10 data collectors, one from each sampled department were selected. The data collection process was supervised by the quality assurance office of the university. The data were collected anonymously. The evaluation records of 1397students were randomly selected from a population of 5257 regular students who were active in the 2018/2019 academic year. For lower costs and smaller prediction errors, a multistage stratified random sampling was employed to select teachers’ evaluation records. We followed the following steps; in the first step, we divided the population of teachers into homogeneous, mutually exclusive subgroups called colleges/faculty. In the second stage, a sample of departments was randomly taken. In the third step, teachers were stratified by their sex. Finally, a sample of teachers was randomly selected for each sex category and then their evaluation records were extracted.

The probability proportional to the size sampling method was used for selecting teachers from each department. Accordingly, 92 (78.6%) of the teachers were male and the remaining 25(21.4%) were females.

2.4. Data Analysis
2.4.1. Reliability and Validity of the Scale

Descriptive measures such as Cronbach’s alpha and average interitem correlations were used to assess internal consistency reliability.

We used the CFA method to test the convergent validity, discriminant validity, and nomological validity of a measurement model [35]. Convergent validity measures the extent to which different measures of the same construct converge or strongly correlate with one another, whereas discriminant validity is the extent to which measures of different constructs diverge or minimally correlate with one another [36]. Convergent validity comprises composite reliability (CR) and average variance extracted (AVE). CR, which indicates the shared variance among the observed variables of a latent construct, was applied to test the degree to which the indicator variables converged and shared the proportion of variance [35]. This is calculated using.where λi is the completely standardized loading for the ith indicator, δi is the variance of the error term for the ith indicator, and p is the number of indicators.

Moreover, the average variance extracted (AVE) represents the average amount of variance of constructs, which is explained by its indicator variables relative to the overall variance of its indicators. This is similar to the explained variance in EFA, as it measures the average variance in the items that a construct manages to explain [37]. A higher AVE value indicates lower error variance. The AVE for the jth construct, denoted by Cj is defined using:where λjk is the indicator loading and θjk is the error variance of the kth indicator (k = 1,..., Kj) of the jth construct score (Cj). Kj is the number of indicators of the jth construct Cj. If all indicators are standardized (i.e., having a mean of 0 and a variance of 1), equation (2) simplifies to (3).

In this case, the AVE is the same as the average squared standardized loading and is equivalent to the mean value of the indicator reliabilities. Now, let be the correlation coefficient between the construct scores of constructs Ci and Cj. The squared interconstruct correlation indicates the proportion of variance that constructs Ci and Cj have.

According to the Fornell–Larcker criterion [38], discriminant validity is established if the condition in equation (4) holds.

That is, the square root of AVE should be greater than the interconstruct correlations for all constructs. Discriminant validity can also be evaluated using the maximum shared variance (MSV) and average shared variance (ASV), which measure the maximum variance and average variance among constructs, respectively. Both measures should be lower than the AVE for all constructs to confirm discriminant validity [39].

2.4.2. The Heterotrait-Monotrait Ratio Approach

Henseler et al. [39] suggested using the heterotrait-monotrait ratio (HTMT) of correlations, which is the average of the heterotrait-monotrait method correlations (i.e., the correlations of indicators across constructs measuring different phenomena), relative to the average of the mono-trait-hetero method correlations (i.e., the correlations of indicators within the same construct). Because there are two mono-trait-hetero method submatrices, we take the geometric mean of their average correlations. Consequently, the HTMT of constructs Ci and Cj with Ki and Kj indicators can be formulated as .where the numerator and the denominator in equation (5) represent the average hetero trait-hetero method and the geometric mean of the average mono-trait-hetero method, the correlation of construct Ci, and the average mono-trait-hetero method correlation of construct Cj, respectively.

2.4.3. Exploratory Factor Analysis

Exploratory factor analysis (EFA) is appropriate when the goal of research is to create a measurement scale that reflects a meaningful underlying construct(s) represented in the observed variables [40]. It is a popular approach to test whether item-level discriminant validity is established by assessing cross-loading [39].

In EFA, the challenge is determining the required number of factors to retain a sufficient amount of variance and, at the same time, to achieve a substantial reduction in dimensionality [41,42]. Several methods are available for determining the number of components or factors for EFA, but they do not always lead to the same or even similar results. Despite the importance of factor retention decisions and extensive research on methods for making retention decisions, there is no consensus on the appropriate criteria to use [43].

2.4.4. Confirmatory Factor Analysis

A confirmatory factor analysis (CFA), which has wide applications in the area of scale development and construct validation [35], was used to determine the validity of the factor structure of the teaching effectiveness assessment scale used by students. Confirmatory factor analysis (CFA) is a popular structural equation model that provides the simplest explanation of how observed and latent variables are related to assumed latent variables [44]. CFA provides a more explicit framework for confirming prior notions about the factor structure of scales [45]. It has two components. The first is a measurement model that explores the relations between a set of observed variables, also called manifest variables (items in our case), to a usually smaller set of latent variables (factors or constructs). The second is a structural model that explores the relationship between latent variables through a series of recursive and nonrecursive relationships. In this study, a four-factor measurement model was specified to test the validity and reliability of the observed indicator items measured on the knowledge of the subject matter (core competency), professional skills (competency), ethical quality, and time management constructs. Professional competency here refers to the degree to which teachers are utilizing their knowledge, skills, and good judgment related to their teaching activities to render tasks with acceptable quality.

Confirmatory factor analysis was carried out using the lavaan package version 0.6–7 [46] in R statistical software version for Windows [47]. By examining three critical sets of results—parameter estimates, fit index, and potential modification indices—researchers formally tested the measurement hypotheses, and they can modify the hypotheses to be more consistent with the actual structure of participants’ responses to the scale.

3. Results

3.1. Preliminary Data Analysis

Prior to the analysis, we examined missing values and outliers. The missing values of the corresponding variables were imputed by median values. Figure 1 shows a graphical visualization of missing values, which is produced by Visdat package [48]. The figure provides the pattern and percentages of missing value distribution. It also shows the locations of missingness that occurred in the data. From Figure 1, there were 3.2% missing values and 96.8% present values in the dataset. Missing data on item level was low, except items Core5 (6.9%), Core (9.3%), and Ethic15 (7.3%). From the figure, it is apparent that the pattern of missingness is random.

For variables measured on an ordinal scale, neither the assumption of normality nor the continuity property is met [49]. The results presented in Table 2 show that the skewness measures are significantly negative in all items, indicating that maximum values are more common than smaller values. Kurtosis exceeds the reference value of the normal distribution (equal to 3) for the majority of competency components, suggesting the existence of heavy tails compared to the Gaussian distribution. This leptokurtic behavior confirms a typical distribution that exhibited fatter tails than the normal distribution. When the assumption of normality is severely violated, the diagonally weighted least squares (DWLS) method, which is a robust WLS method [50], was used as it provides more accurate parameter estimates [4951].

3.2. Reliability of the SETE Scale

Table 2 presents the means, standard deviations of items, and internal reliability coefficients for the factors/constructs. Accordingly, all reliability coefficient estimates of alpha except time management skill are above the traditional cutoff of 0.70, revealing that the three teaching competency dimensions/factors have sufficient internal consistency. That is, the reliability of subject matter knowledge (core competency) (Cronbach’s alpha = 0.88), professional competency (Cronbach’s alpha = 0.89), and ethical competency (Cronbach’s alpha = 0.83).

Furthermore, the corrected item-total correlation ranged from 0.44 to 0.85, which exceeded the accepted cutoff of 0.40 proposed by Nunnally [52], indicating that each item was related to the corresponding components of the SETE scale.

In addition, the values of the “reliability if an item is dropped” show a lower or equal value to the alpha value for all variables of the three factors, indicating that all items in all factors contribute positively to the internal consistency of the factors [53]. In Table 2, it is also revealed that the means of the items scale ranged from 3.707 to 4.719, while the standard deviations of the items were from 0.70 to 1.46, indicating a narrow spread around the mean.

The average interitem correlation (AIIC) was computed from the interitem correlation matrix. Correlation matrix presented in Figure 2. The ideal range for the AIIC value is between the values 0.20 and 0.40 [54]. Piedmont (2014) claimed that an AIIC score of less than 0.20 indicates that the items are not well correlated and do not measure the same construct or factor, whereas an AIIC score greater than 0.40 suggests that the items in the same construct are redundant.

Accordingly, the average interitem correlations (AIIC) for core competency, professional competency, ethical quality, and time management skill constructs were 0.58, 0.52, 0.59, and 0.48, respectively, suggesting that all constructs of the SETE scale contain items that measure the constructs in the same way. However, it seems that some of the items in each competency component are redundant.

3.3. Validity of the SETE Scale

The discriminant and convergent validity of the scale were tested using various techniques. The interitem correlation matrix presented in Figure 1 was used for the first visual diagnosis of the items and scale structure. The results displayed in the figure provide evidence that our items in each factor or construct had a high correlation, implying that the items in each construct were related, indicating that the convergent validity of the scale was assured. Composite reliability (CR) and average variance extracted (AVE) were used to test the extent to which the indicator variables converged and shared the proportion of variance. According to Adedeji et al. (2017), a cutoff point of 0.7 or above for CR is required to establish that the indicator items are reliable, and a minimum value of 0.5, which is required for AVE. Furthermore, CR values higher than the AVE are required to establish convergent validity. Accordingly, Tables 3 and 4 present the convergent validity and discriminant validity assessment results. CR values for CC (0.88), PC (0.88), EC(0.85), and TM (0.67) are all above 0.7 (the cutoff point), fulfilling the required threshold. This confirms that convergent validity is established. Moreover, convergent validity is established when CR is higher than AVE, and the AVE is higher than 0.5 [45,55]. These conditions were confirmed in this study; consequently, the convergent validity of the scale was verified. Furthermore, from the lavaan output presented in Table 5, all items appeared to be significantly associated with their respective constructs, which provides additional evidence of convergent validity. Discriminant validity analyzes how well the constructs are distinct and uncorrelated. The scale faces a discriminant validity problem if the items correlate more highly with variables outside their parent factor than with the variables within their parent factor; that is, the latent factor is better explained by some other variables (from a different factor) than by its observed variables. We used the Fornell–Lacker criterion [38], which compares the square root of the average variance extracted (AVE) with the correlation of latent constructs to assess discriminant validity. The interconstruct correlations among the four constructs are shown in Table 3. A strong correlation between them is evidence of their dependence on one another. Accordingly, based on the estimates presented in Table 3, the square root of AVE is less than the interconstruct correlation. Furthermore, the results presented in Table 4 indicate that the maximum shared variance is greater than the average shared variance, and the average shared variance is greater than the average variance extracted (i.e., MSV > AVE and ASV > AVE). Consequently, both results justify the establishment of discriminant validity. The highlighted cells in Table 4 show the HTMT ratio of the correlation between the two constructs, which is calculated using equation (5), as proposed by Henseler et al. [39]. Accordingly, the HTMT values are above the suggested threshold of 0.85 [56], revealing that discriminant validity does not exist between the two reflective constructs, which supports the above finding. In conclusion, the scale faced a discriminant validity problem.

Discriminant validity was also checked by comparing the loading of an item across different constructs. If all items loaded more highly on the construct that they were measuring than on any other construct in the model, discriminant validity was met [57]. According to the EFA output presented in Table 4, considerable cross-loadings were observed. The nomological validity of the scale was checked by examining the significance of the construct correlation value between construct (interconstruct) variables in the model [35]. Accordingly, the 95% confidence interval for interconstruct correlations in Table 3 does not contain 1, implying a statistically significant interconstruct correlation. This shows the poor nomological validity of the SETE scale.

3.4. The Factor Structure of the Scale

In this analysis, we used twofold cross-validation (CV) such that the data were divided into two random samples. The first half of the dataset with 699 observations (called the training data) was used to find the possible factor structure of the SETE scale using exploratory factor analysis, and the second half having 698 observations (called the testing data) was used to verify the factor structure of the scale.

3.4.1. Exploratory Factor Analysis

An exploratory factor analysis (EFA) using the varimax-rotated component method was performed on the training data to check if item grouping was consistent with the proposed theory, that is, to test the structural validity. Before conducting factor analysis, the item-to-item correlation was examined by conducting the Kaiser–Meyer–Olkin (KMO) test and Bartlett’s test for sphericity to see if there is a certain redundancy between the variables that we can summarize with the factors. The value of KMO was 0.96 and Bartlett’s test of Sphericity produced , which are wonderful values. Thus, all variables could be considered for EFA [45,58].

We applied the scree plot test [59] and parallel analysis [60] to determine the required number of factors to retain. The rule for scree plots is to retain the factors above the point where the curve starts to level off (inflection point) and eliminate any factor below the inflection point [61]. From the scree plot (left panel in Figure 3), the first two factors of the scale have eigen values greater than one. Parallel analysis offers a more objective way to assess the appropriate number of components, where factors with adjusted eigenvalues greater than one are retained [62]. Both methods suggest retaining two factors.

The preliminary exploratory factor analysis (EFA) results, described in Table 6, revealed that the item variables are not significantly grouped under the respective factors, as theoretically defined. Hence, the factor structure of the scale is not consistent with the proposed understanding of the intention of the experts who devised the scale, indicating that the results from the EFA did not support the theoretical factor structure. The h2 column in the table represents the value of communality, which must be higher than 0.3. The root mean square of the residuals (RMSR) was 0.05. Additionally, the root mean square of the residuals (RMSR = 0.05) is less than 0.1, verifying that the retained factors are appropriate for describing the correlation structure. From the results presented in Table 6, all items demonstrated high loading, ranging from 0.48 to 0.85, implying that all items are considered as important. Items in italics are loaded in the second factor. The factor loading of the first factor ranged from 0.48 to 0.77, while the factor loadings of the second factor ranged from 0.59 to 0.85. The analysis output includes the explained variance ratio. The first factor explained 31% of the total variance. The second factor explained 25% of the total variance. Hence, the two-factor construct explains 56% of the total variance. The analysis output includes the interfactor correlation after the explained variance ratio section. Based on our qualitative judgment of item content, the less serious nature of cross loading, and the expected association of factors, we decided to keep these items and assign them to the factor in which they showed stronger factor loadings and were found the most relevant. However, the two items “core 5” and “profe7” load equally on the two items. Hence, we decided to remove them.

3.4.2. Results of CFA for the Underlying Structure

The models analyzed were identified, which means that there should be more observations than the parameters to be estimated [63].

Confirmatory factor analysis was carried out to check if the number of factors (or constructs) and the indicator variables conformed to what was expected based on the theory. Multiple fit indices were used to evaluate whether the models adequately reflected the observed data. Moreover, the two models were compared to assess if they had an identical fit. We used the “Testing” data for this purpose.

Figure 4 presents the path diagram of the confirmatory factor analysis for two-factor (left panel) and four-factor (right panel) models, where a single-headed arrow is used to imply a direction of the assumed causal influence, and double-headed arrows are used to represent the covariance between two latent variables (factors).

From the path diagram for the two-factor model, the measurement error ranged between 0.36 (Profe10) and 0.61 (Profe13 and Profe13). Similarly, the four-factor model produced a measurement error ranging between 0.30 (Ethic17) and 0.59 (TM19). The increase in the measurement error for the two-factor model is due to specifying a relatively less number of factors than expected [43].

For the two-factor model, it was thus deduced that the squared coefficient of multiple correlations or the amount of variance explained by the latent variable fell within a range between 0.75 and 0.48. Similarly, all factor loadings had values equal to or greater than 0.61 (p13). The correlations between latent constructs ranged between 0.73 and 0.87. The interconstruct correlations of core competency with professional competency, ethical quality, and time management were 0.87, 0.8, and 0.7, respectively. Similarly, interconstruct correlations of professional competency with ethical quality were 0.78 and 0.87, whereas interconstruct correlation between ethical quality and time management was 0.73.

Table 5 shows that the standardized coefficients for the two-factor model are significant at the 0.001 level, implying that all items are significantly correlated with their respective constructs. Because the domain is standardized (mean = 0, SD = 1), the coefficients are interpreted as the increase (or decrease) in the score of an item for every standard deviation increase in the factor/construct. For example, β = 0.69, that is, for every standard deviation increase in core competency, “Core1” increases by 0.69. In addition, in the SETE scale, the “ profe12” item had the highest association with its construct (β = 1). The values in Table 7 can be interpreted similarly.

3.4.3. Model Fit and Comparison

The appropriateness of the measurement model in comparison with the data was examined first. The best model should have a relative chi-square () value close to 1.

We also used the comparative fit index (CFI) and Tucker–Lewis index (TLI) and RMSEA to measure whether the model fits the data better than a more restricted baseline model. However, the cutoff values for these indices are arbitrary, and the meaning of “good” fit and its relationship with fit indices are not well understood [64].

The absolute and comparative fit indices for the two-factor and four-factor CFA models are presented in Table 8. The comparative fit parameters for the four-factor model, CFI (0.89) and TLI (0.87) are less than the acceptable cutoff point of 0.90, which is relatively poor fit [65]. The comparative fit indices of the four-factor model are 0.088 (RMSEA) and 0.06 (SRMR), which are considered an indication of fair fit [66].

However, for the two-factor model, the comparative fit indices, CFI (0.999), and TLI (0.999) were greater than the 0.90 threshold, indicating an improvement of the tested four-factor model in a relative sense. We also found that the SRMR (0.056) had a good fit (<0.06), and RMSEA (0.008) had a good fit (<0.05), indicating that the two-factor model fits well to the data [67].

In conclusion, the two-dimensional model provided improved goodness-of fit indices than the four-factor model, implying that the two-factor model fits the data better than the four-factor model.

Test of comparison of the two models to explain the factor structure of the scale showed a nonsignificant value, showing that the four-factor model did not do a better job than the two-factor model. Moreover, for the AIC, a value of 29661.43 was obtained for the two-factor model and a value of 29677.60 for the four-factor model. Thus, the two-factor model should be preferred (smaller AIC).

4. Discussion

In educational institutions, evaluating teachers’ effectiveness is similar to evaluating students’ learning [31]. Student evaluations of teachers’ effectiveness are a current and controversial topic in higher education and research. Many stakeholders, including teachers, are doubtful of SETE’s effectiveness and validity for both formative and summative purposes [7,68]. Thus, the primary goal of this study was to look into the psychometric properties of the students’ assessment of the SETE scale, which is used by Ethiopian higher education institutions.

From the results, the SETE scale was shown to have good internal consistency and good convergent validity. This result complements the findings of [7,1930, 69], although the dimensionality and number of items of these scales are unrelated. However, unlike the student evaluation of higher education teachers’ effectiveness scale developed by [18,19,21,22,25,31,70], the SETE scale used by Ethiopian higher education faced a validity problem. Moreover, the CFA results showed poor fit indices, revealing that the underlying four-factor structure for the SETE scale is insufficient to explain the data structure. This is because the SETE scale was developed based on the evaluation on theoretical grounds. However, its development should have gone through quantitative exploration in addition to the experts’ evaluation on theoretical grounds, which is one of the criteria to ensure content validity. Scale development is not a straightforward endeavor [71]. Hinkin [72] pointed out three phases of scale development to create a rigorous scale: item development (consisting of steps of identification of the domains, item generation, and content validity or theoretical analysis), scale development (including steps pretesting the items in the scale, survey administration and sample size, item reduction analysis, extraction of factors), and scale evaluation (consisting of tests of dimensionality, reliability, and validity). According to researchers’ ample experience during the development of the SETE scale, however, its development fails to follow the procedures used by Hinkin [72].

5. Conclusion

The construction of valid and reliable scales requires systematic research, in which both theoretical knowledge and empirical data should play an important role. This study is the first attempt to assess the validity of the SETE scale, which is used by Ethiopian higher education institutions. The current study attempted to provide evidence of convergent validity, discriminant validity, and nomological validity of the SETE scale that Ethiopian public higher education institutions used to evaluate their teachers’ performance. Accordingly, the scale lacks both discriminant and nomological validity despite its convergent validity, revealing that the SETE scale does not appear to discriminate well among the constructs it measures.

Although further research is needed to confirm these results based on multicenter data, the two-factor model with 18 items yielded a better factor structure of the SETE scale. This is because the dimensionality of the scale was developed based on the opinion of experts only; it did not necessarily measure the important competency components of the teachers. Overall, the findings indicate that the SETE scale cannot be used to effectively assess teachers’ teaching effectiveness unless further improvements are made to the scale and its development process.

This work has practical, theoretical, and policy implications for a variety of stakeholders at various levels. In practice, this research can assist higher education institutions and the Ministry of Science and Higher Education in identifying the SETE scale’s psychometric gaps. As a result, it can be used as a framework for improving the instrument’s reliability and validity in order to clearly measure teachers’ effectiveness and, as a result, propose interventions to increase teachers’ performance and motivation. The findings of this study can also be used to offer new knowledge and concepts on the assessment of teachers’ performance and pedagogical competencies in higher education, especially in Ethiopia. As previously stated, no investigation on the psychometric features of the SETE scale has been done in Ethiopia.

6. Research Limitations and Future Directions

The results of this study should be considered in the light of these limitations. One limitation was that although the scale is harmonized and used by all public universities, this analysis used data from a single university, which may not be generalizable to the remaining public universities across the country. Hence, this study emphasizes the need to obtain large amounts of data from multiple universities to further strengthen the outcomes of the study. The study also assumed that students rated their teachers with no bias or prejudice. However, it is well perceived from experience that students who receive higher grades in the course rate teachers more favorably, whereas low-grade achievers revenge their teachers in the form of low teacher ratings. Other factors such as time of evaluation, physical attractiveness of the teacher, course difficulty, age, and the teacher’s personality influence student ratings [28,73]. Despite its convenience, the current study used one dataset for both PCA and CFA; hence, further studies are needed to validate both the SETE scale framework and measures. Careful planning of the validation process should be carried out with large data to obtain stronger evidence on the findings and develop a scale that measures teaching effectiveness appropriately. Furthermore, analysis at a different point in time needs to be carried out to test the test-retest reliability of the scale. Although the maximum number of items per scale will depend on the complexity of the variable being measured, increasing the number of items per scale improves the scale’s richness to capture more information [74]. However, the “Time management” subscale has only two items, which is another limitation of this study.

Appendix

Student Evaluations of Teacher’s Effectiveness Scale

Dear student. Please check in the boxes indicating how you evaluate your teachers for this semester altogether.

Date: _______ Your field of study: _______Year: ______ Your gender: ______ Course: _______

Based on the evaluation point, rate by placing a circle on any of the ranks indicated, ranging from very low to very high. Note: VL = very low; L = low; M = medium = M; H = high; VH = very high; NA = not applicable.

Abbreviations

CFA:Confirmatory factor analysis
CR:Composite reliability
EFA:Exploratory factor analysis
Ev:Eigen value
HTMT:Heterotrait-monotrait
PCA:Principal component analysis
SETE:Student evaluation of teaching effectiveness.

Data Availability

The datasets analyzed in this study are available from the corresponding author on reasonable request.

Ethical Approval

The researchers have got permission from Debre Markos University Quality Assurance Office, to use the data without fabrication and falsification. As the study was based on secondary data, informed consent was not obtained from the study participants, but the anonymity and the confidentiality of the data were assured.

Conflicts of Interest

The authors declare that none of them has a financial or personal relationship with other people or organizations that could inappropriately influence or bias the content of the paper.

Authors’ Contributions

MGA contributed to the study concept and design of the statistical methodology, performed the analysis and interpretation of the data, and wrote the first draft of the manuscript. AAD and DMF contributed to the study by critically revising the manuscript. All the authors read and approved the final manuscript.

Acknowledgments

The authors are grateful to the Debre Markos University Quality Assurance Office for permission to use the data. This work was financially supported by Debre Markos University.

Supplementary Materials

R codes used for assessing scale validity (doc 354 KB) are included. (Supplementary Materials)