Scoring Divergent Thinking Tests by Computer With a Semantics-Based Algorithm

Divergent thinking (DT) tests are useful for the assessment of creative potentials. This article reports the semantics-based algorithmic (SBA) method for assessing DT. This algorithm is fully automated: Examinees receive DT questions on a computer or mobile device and their ideas are immediately compared with norms and semantic networks. This investigation compared the scores generated by the SBA method with the traditional methods of scoring DT (i.e., fluency, originality, and flexibility). Data were collected from 250 examinees using the “Many Uses Test” of DT. The most important finding involved the flexibility scores from both scoring methods. This was critical because semantic networks are based on conceptual structures, and thus a high SBA score should be highly correlated with the traditional flexibility score from DT tests. Results confirmed this correlation (r = .74). This supports the use of algorithmic scoring of DT. The nearly-immediate computation time required by SBA method may make it the method of choice, especially when it comes to moderate- and large-scale DT assessment investigations. Correlations between SBA scores and GPA were insignificant, providing evidence of the discriminant and construct validity of SBA scores. Limitations of the present study and directions for future research are offered.

would be especially clear when there are large samples of examinees and when the DT testing is online (e.g., mTurk). The present study is the first to compare a large SBA with the traditional method for scoring DT.

Method Participants and Data Collection
The sample consisted of 250 participants (107 female, 141 male, 1 preferred not to disclose, 1 didn't answer).
The mean age of participants was 33.65 years (SD = 10.97), and the average grade point average (GPA) was 3.36 on 4-point scale (SD = 0.47). A computerized online system (http://mturk.com) was employed to randomly select participants, who were then prompted to complete provided tasks and the survey. Participants were paid $1.5. Only participants with English as a primary language were accepted.

Task Description and Testing Procedure
The Many Uses test was used to assess divergent thinking (www.creativitytestingservices). This is very much like the Alternate Uses test of Guilford (1968) and the Uses test from Wallach and Kogan (1965). It contained three items (i.e., "toothbrush," "tire," "spoon,") which were presented one at a time. It was given to participants without any limitations on response time or output. Participants were told to type in as many ideas as they could and to take their time. Instructions were paraphrased from earlier research on DT. A computerized online system for creativity assessment called Creativity Index Testing (http://cit.sparcit.com) was employed to administer the DT test. After finishing the DT test, participants were given a survey asking about age, gender, and GPA.

DT Scoring
Responses of three Many Uses items were scored using the semantics-based algorithmic (SBA) scoring method, described earlier, and the standard DT scoring method from the Runco Creativity Assessment Battery (rCAB; 2011). Descriptions of scores, generated by both methods, are provided below.

Traditional DT Scores
The fluency score was computed as the number of answers given to each DT task. The standard DT flexibility score was computed following the method of Runco (1985): Each idea was assigned to an a priori conceptual category (one set of categories for each task) and the score calculated as the number of categories used by the individual. This method has demonstrated good inter-rater and inter-item reliability in numerous investigations (see Runco, 2013). The traditional DT originality score was computed from the statistical infrequency of an answer within the pool of answers. If an idea was unique, it got 100pts. If an idea was given three times, it got (100/3=) 33.3pts. If an idea was given hundred times, it got (100/100=) 1pt.

SBA Fluency Score
The SBA fluency score was computed from the number of answers given by the individual. This coincides with standardized DT fluency score. For example, if a participant gave five uses of a "tire," his or her SBA fluency score was five. As is typical, the quality of the ideas is ignored when scoring fluency. Fluency is just a measure of ideational productivity.

SBA Flexibility Score
The SBA flexibility score was computed in the following manner. For each item response, which may consist of one or more discrete ideas, the number of categories into which these answers fall is determined. Each response is analyzed in its entirety, but associations with particular discrete parts of each (e.g., single words) are recognized and processed. The number of flexibility categories is literally digitally computed, which is exactly why this is an algorithmic method. The semantic statistic for any pair of ideas is an algorithmic statistic that reflects semantic similarity between two answers. This method has been used successfully by Acar and Runco (Acar & Runco, 2014), though they relied on three semantic networks when calculating semantic statistics. The system used in the present research utilized 12 semantic networks.

SBA Item Originality Score
The SBA originality (SBAIRO) score was computed as an average of all semantic association statistics. These statistics capture how far apart ideas given by any one individual are in the semantic networks. The algorithmic originality score was adjusted by the idea association frequency rate (i.e., frequency of usage in Wikipedia [www.wikipedia.com, 2014]). This adjustment is much like traditional DT scoring where an idea is scored based on frequency of occurrence, which is why the SBA originality index represents a kind of originality. SBAIRO score cannot be computed if the item is non-verbal because it requires a verbal starting point as well as a verbal response, and nonverbal tests do not have verbal starting points. They have figural or visual starting points. This is why the Many Uses test was chosen for the present research. It has very clear verbal starting points (i.e., the stimulus object, such as "toothbrush," "tire," or "spoon"). The Many Uses test is also quite similar to Alternate Uses and other uses tests from other DT batteries, including that of Guilford (1968) and Torrance (1995). The MUT only differs from AUT in the objects named in the instructions. Here used tire, toothbrush, and spoon were used instead of Guilford's (1968) "list uses for a brick" or Wallach and Kogan's (1965) "list uses for a shoe." Thus very similar results would be expected from AUT and MUT, though of course that is an empirical question (for future research).

Analyses
First, the reliability of all computed scores was analyzed. Next, associations between SBA and traditional DT scoring methods were analyzed with correlational methods. Then, analyses were conducted to understand correlations of SBA scores with GPA score and response times.

Reliability
Descriptive statistics for all scores are shown in Table 1. Standardized DT and SBA scores yielded satisfactory reliability alpha coefficients, as shown in Table 2. Alphas of SBA scores were slightly lower than that of standardized DT scores, but still at adequate levels. And as more data are processed over time, the underlying semantic statistics of SBA method are expected to become more robust, which will no doubt lead to improved reliability of the SBA scores.
The standardized DT originality score was less reliable compared than the other scores, perhaps reflecting its sensitivity to sample sizes. Indeed, this score is based on infrequency -a participant's performance relative to others in the given sample -hence it can vary greatly depending on the composition of the group. Still, the alphas indicate an acceptable level of inter-item reliability, even for originality. Also, the use of such "local norms" (that is, only comparing ideas to others who took exactly the same tests relying on exactly the same procedure) has many advantages, most of important of which is that ideas from one participant are not compared to much older responses from very different normative samples. The use of local norms is very common in DT testing (Runco, 1991(Runco, , 2013Wallach & Kogan, 1965).

SBA Scores vs DT Scores
The Pearson's product-moment correlation coefficients between the SBA scores and standardized DT scores are provided in Table 3. The first important observation from Table 3 is a significant correlation between fluency and both the traditional DT flexibility and SBA flexibility scores (rs = .79, .86, respectively, ps < .0001). Indeed, the more the person ideates, the more likely he or she will diverge and find new angles, thus increasing the diversity of generated ideas. Turning to the hypotheses, the analyses confirmed that the SBA flexibility score was correlated to the DT flexibility score at r = .74 (p < .0001), providing strong evidence for the concurrent validity of the SBA flexibility score. This result provides a level of confidence, needed for the use of SBA flexibility scores instead of standardized DT flexibility scores, specifically in cases of mid-and large-scale studies, where the manual computation of DT flexibility scores is not feasible.
Also interesting was that the other SBA scores showed small, but noticeable correlations (rs = .33 and .36, respectively, ps < .0001) with the traditional DT originality score. One explanation of this is the statistical nature of all three scores. It can be hypothesized that if one had an opportunity to collect an infinitely large number of responses, the DT originality score would have converged to account for a significant portion of information carried by semantic association characteristics. This presents a very interesting direction for future research, but it is beyond the scope of the present investigation.
The SBA item-response originality score demonstrated an independent nature, as the only noticeable correlation it had was with DT originality score (r = .36, p < .0001), which was already discussed. The correlation between SBA flexibility and SBA originality scores was non-significant (r = -.01), which indicates that the two SBA scores seem to measure two independent aspects of divergent thinking.

SBA Scores vs GPA
The product moment correlation between SBA flexibility and GPA scores for all participants was non-significant (r = .20). However, it was observed that for participants with a GPA score less than 3.0, the correlation increased to r = .26 (p = .001). All other sufficiently large groupings failed to uncover any significant correlations. The correlation between SBA originality and GPA scores was also non-significant (r = -.06), and no sufficiently large groupings (by GPA level) yielded significant correlations. The correlation of GPA with traditional DT flexibility and DT originality were r = .08 and r = -.06, but non-significant.

SBA Scores vs Response Time
The product moment correlation between SBA flexibility scores and the amount of time each participant spent completing the DT test (consisted of 3 Many Uses items) was significant (r = .46, p < .0001). However, further analysis revealed that for participants who spent at least 178 seconds completing a given DT test, such a correlation diminished (r = .24) and remained non-significant for other time ranges above the given threshold. For SBA originality scores, the correlation with corresponding response times was non-significant (r = .07), and remained non-significant for all sufficiently large time ranges, suggesting that the originality of the responses was independent is required to produce ideas on this particular DT test. Quite a bit of previous research has used the same method as used here, with GPA used as an estimate of general knowledge (e.g., Runco & Smith, 1992). Earlier research has also shown that online testing, such as that used here, provides results that are quite similar to face-to-face testing (Hass, 2015).
The moderate correlations among DT indices are noteworthy given the debate over the use of fluency alone when assessing creative potential. Various investigations have found high correlations between fluency and originality and flexibility, and one interpretation was that this could be used to justify relying on fluency alone. The correlations of the present investigation support the use of a DT profile rather than fluency alone. This is especially true of the fluency-originality correlations, which were quite low (.3 and -.04). Admittedly the suggestion of relying on fluency alone was already questionable, given (a) theories of DT, which include various dimensions and not just fluency (Guilford, 1968;Torrance, 1995), (b) psychometric evidence that the variance attributable to originality or flexibility remains reliable even when the overlap with fluency is covaried (Runco & Albert, 1985), (c) additional data showing that explicit instructions can alter originality and flexibility without changing fluency (Runco & Okuda, 1991), and (d) theories of creativity which emphasize originality (and not fluency).
The correlation between SBA flexibility and SBA originality scores was non-significant (r = -.01), which indicates that the two SBA scores seem to measure two independent aspects of divergent thinking. That might come as a surprise, at least given that earlier investigations often find them more highly correlated. Then again, theory suggests that they should be relatively independent, as uncovered here. The fact that the present findings are entirely in line with creativity theory is laudable, even if the flexibility-originality correlation reported here was different (and better) than that reported in previous research.
Interestingly, the two flexibility estimates, one from the computer scoring system and one from the traditional scoring system, were in good agreement (i.e., correlated), but the two originality estimates (computer and tradi- Scoring Divergent Thinking Algorithmically 216 tional) were not related very much at all. This is actually what was expected. That was because both the computer and traditional scoring of flexibility rely on semantic categories, so they certainly should be correlated. Even the traditional system for flexibility uses semantic categories, though the assignment of ideas to particular categories is a human decision, while in the case of the algorithmic system the determination of categories is computerized (i.e., based on comparisons with semantic networks). The good agreement of computer and traditional flexibility was thus expected because both systems rely on semantic categories. The computer and traditional originality indices, on the other hand, were each based on different norms and different logic. The computer originality score was based on semantic distance. This is really all a computer can do-compare ideas given by examinees with existing data, and in particular with semantic networks (e.g., WordNet or Word Association Network). The traditional originality score, on the other hand, does not compare ideas given by examinees with any existing norms.
It compared ideas given by examinees with ideas given by other examinees! Thus the low agreement between the computer and the traditional originality indices is not at all surprising.
It does leave us with a question, namely, which is the better originality score, computerized/algorithmic or traditional/human. Very likely, the traditional one is the best choice. That is because it is tied to theories of creativity (Guilford, 1968;Runco & Acar, 2012;Torrance, 1995) where originality is defined as thinking in a novel or unique fashion (i.e., unlike other people). That being said, the ideal may be to use both originality scores, though it might be a good idea to interpret the computer originality index as something like semantic distance. The semantic distance index may not replace the traditional originality score, but it may prove to provide useful additional information.
And like the computer flexibility score, the computer "originality score" (i.e., semantic distance) is cost efficient.
Scoring requires little or no human time to be invested.
There is a larger issue, in addition to cost-efficiency of scoring methods. This larger issue arose when questions were directed at the traditional originality score, the crux being that originality scores are "sample dependent" because points are given by comparing one person's ideas only with other examinees who took the test(s) at the same time. This question is easily refuted. In actuality, the so-called sample dependency of traditional originality is a virtue or strength rather than a problem. That is because divergent thinking tests have only moderate generalizability (Runco, Abdulla, & Paek, in press) and they are much more sensitive to testing conditions than academic and intelligence tests (Runco, 2013;Wallach & Kogan, 1965). For these reasons it is not fair to compare one person's ideas to people who took different DT tests or people who took them under different testing conditions.
The only fair comparison, for calculating uniqueness and originality, is between one examinee and everyone else in the same sample (i.e., the others who took the test under the same conditions). In short, there are advantages to using "local norms" for calculating the traditional originality scores. Originality could be based on judgments instead of objective novelty, but differences among judges (e.g., Runco, McCarthy, & Svensen, 1994; gidity and conceptual ruts that so often derail problem solving. All of this makes the findings presented here about flexibility that much more important.
The correlations of response time to traditional DT flexibility and DT originality scores were mildly disconcerting.
Then again, even the largest of these indicated approximately 20% of the variance was explained by time on task. This is not unreasonable if, as previous research suggested, time on task represents the contribution of intrinsic motivation (Plucker, Runco, & Lim, 2006). The logic here is that, given a choice, an examinee will invests more time only when interested in the tasks at hand. Still, additional research should investigate the relationship of time with DT. The present results suggested that there might be a curvilinear relationship, which in turn implies that there is an optimal amount of time for DT. Research on this question would be very practical, given that it would indicate if there is an optimal amount for testing. Other practical implications include cost-efficiency. As noted above, the algorithmic method using semantic networks is highly cost efficient in that it is immediate and requires no manual computation. It appears that it will lead to similar decisions as those reached with a traditional scoring method. The present results were based on one moderately sized sample, and one test of DT, but results are encouraging and indicate that additional research is warranted.

Funding
The authors have no funding to report.

Competing Interests
Author KB is co-founder and Chief Technology Officer of Sparcit, LLC. Sparcit developed the semantic-based algorithm used in this research. Author MAR is owner of Creativity Testing Services, LLC, which publishes the Many Uses test which was examined in this research.