Enhancing user creativity: Semantic measures for idea generation

Human creativity generates novel ideas to solve real-world problems. This thereby grants us the power to transform the surrounding world and extend our human attributes beyond what is currently possible. Creative ideas are not just new and unexpected, but are also successful in providing solutions that are useful, efficient and valuable. Thus, creativity optimizes the use of available resources and increases wealth. The origin of human creativity, however, is poorly understood, and semantic measures that could predict the success of generated ideas are currently unknown. Here, we analyze a dataset of design problem-solving conversations in real-world settings by using 49 semantic measures based on WordNet 3.1 and demonstrate that a divergence of semantic similarity, an increased information content, and a decreased polysemy predict the success of generated ideas. The first feedback from clients also enhances information content and leads to a divergence of successful ideas in creative problem solving. These results advance cognitive science by identifying real-world processes in human problem solving that are relevant to the success of produced solutions and provide tools for real-time monitoring of problem solving, student training and skill acquisition. A selected subset of information content (IC S\'anchez-Batet) and semantic similarity (Lin/S\'anchez-Batet) measures, which are both statistically powerful and computationally fast, could support the development of technologies for computer-assisted enhancements of human creativity or for the implementation of creativity in machines endowed with general artificial intelligence.


Introduction
Creativity is the intellectual ability to create, invent, and discover, which brings novel relations, entities, and/or unexpected solutions into existence [1] . Creative thinking involves cognition (the mental act of acquiring knowledge and understanding through thought, experience, and senses), production, and evaluation [2] . We first become aware of the problems with which we are confronted, then produce solutions to those problems, and finally evaluate how good our solutions are. Each act of creation involves all three processes-cognition, production, and evaluation [2] . According to J. P. Guilford, who first introduced the terms convergence and divergence in the context of creative thinking, productive thinking can be divided into convergent and divergent thinking; the former which can generate one correct answer, and the latter which goes off in different directions without producing a unique answer [2] . Although currently there is no general consensus on the definition of convergent and divergent thinking, modern theories of creativity tend to have the following perspectives. Convergent thinking is regarded as analytical and conducive to disregarding causal relationships between items already thought to be related, whereas divergent thinking is viewed as associative and conducive to unearthing similarities or correlations between items that were not thought to be related previously [3][4][5] .
Both convergent and divergent thinking are used to model the structure of intellect [6] . With regard to the nature of intelligence and originality, two general problem-solving behaviors were identified, those of the converger and those of the diverger, who exhibit convergent and divergent styles of reasoning/thinking, respectively [7] . The distinction between convergent and divergent thinkers is done based on the dimensions of scoring high on closed-ended intelligence tests versus scoring high on open-ended tests of word meanings or object uses [7] . The converger/diverger distinction also applies in cognitive styles and learning strategies [8] . Dualprocessing accounts of human thinking see convergent and divergent styles as reflective/analytic and reflexive/intuitive, respectively [9] , which is in line with current theories of creative cognition involving generation and exploration phases [10] . The convergent thinking style is assumed to induce a systematic, focused process- ing mode, whereas divergent thinking is suspected to induce a holistic, flexible task processing mode [11] .
Psychological accounts that consider convergent and divergent production as separate and independent dimensions of human cognitive ability allow one to think of creative problem solvers as divergers rather than convergers [12] , and to associate creativity with divergent thought that combines distant concepts together [13] . Focusing only on either convergent or divergent thinking, however, may inhibit the full understanding of creativity [14] . Viewing convergent production as a rational and logical process, and divergent production as an intuitive and imaginative process, creates the danger of oversimplification and confusion between intelligence and creativity. Instead, it should be recognized that there are parallel aspects or lines of thought that come together toward the end of the design process, making the design а matter of integration [14] . Since convergent and divergent thinking frequently occur together in a total act of problem solving [2] , creativity may demand not only divergent thinking, but also convergent thinking [15,16] . For example, deliberate techniques to activate human imagination rely on the elimination of criticism in favor of the divergent generation of a higher number of ideas. The process of deferred judgment in problem solving defers the evaluation of ideas and options until a maximum number of ideas are produced, thereby separating divergent thinking from subsequent convergent thinking [17] . This sequence of divergent and convergent thinking is classified as ideation-evaluation, where ideation refers to nonjudgmental imaginative thinking and evaluation to an application of judgment to the generated options during ideation [17] . Such accounts of creativity treat divergence and convergence as subsequent and iterated processes [18] , particularly in that order. More recent accounts of creativity, however, highlight the interwoven role of both convergent and divergent thinking [15,19,20] . This interweaving has been identified in two ways. The analytic approach to creative problem solving based on linkography showed that convergent and divergent thinking are so frequent at the cognitive scale that they occur concurrently in the ideation phase of creative design [15] . The computational approach demonstrated that a computer program (comRAT-C), which uses consecutive divergence and convergence, generates results on a common creativity test comparable to the results obtained with humans [20] . Hence, the creative problem solver or designer may need to learn, articulate, and use both convergent and divergent skills in equal proportions [14] .
The concurrent occurrence of convergent and divergent thinking in creative problem solving raises several important questions. Is it possible to evaluate convergence and divergence in problemsolving conversations in an objective manner? How do convergence and divergence relate to different participants in a problemsolving activity? Are there particular moments in the process of real-world problem solving where a definitive change from convergence to divergence, or vice versa, occurs? How do convergence and divergence relate to the success of different ideas that are generated and developed in the process of problem solving? Could semantic measures predict the future success of generated ideas, and can they be reverse-engineered to steer generated ideas toward success in technological applications, such as in computerassisted enhancements of human creativity or implementations of creativity in machines endowed with artificial intelligence?
We hypothesized that semantic measures can be used to evaluate convergence and divergence in creative thinking, changes in convergence/divergence can be detected in regard to different features of the problem-solving process, including participant roles, successfulness of ideas, first feedback from client, or first evaluation by client or instructor, and semantic measures can be identified whose dynamics reliably predicts the success or failure of generated ideas. To test our hypotheses we analyzed the transcripts of design review conversations recorded in real-world educational settings at Purdue University, West Lafayette, Indiana, in 2013 [21] . The conversations between design students, instructors, and real clients, with regard to a given design task, consisted of up to 5 sessions ( Table 1 ) that included the generation of ideas by the student, external feedback from the client, first evaluation by the client or instructor, and evaluation of the ideas by the client. The problem-solving conversations were analyzed in terms of participant role, successfulness of ideas, first feedback from client, or first evaluation by client or instructor using the average values of 49 semantic measures quantifying the level of abstraction (1 measure), polysemy (1 measure) or information content (7 measures) of each noun, or the semantic similarity (40 measures) between any two nouns in the constructed semantic networks based on WordNet 3.1 [22] .

Design review conversations
Real-world conversations are an outstanding source to gain insights into the constructs of problem solving and decision making. To study human reasoning and problem solving, we focused on design review conversation sessions in real-world educational settings. The conversation sessions were between students and experienced instructors, and each session was used to teach and assess the student's reasoning and problem solving with regard to a given design task for a real client. The experimental dataset of design review conversations employed in this study was provided as a part of the 10th Design Thinking Research Symposium [21] . Here, we analyzed two subsets, with participants (students) majoring in Industrial Design: Junior (J): 1 instructor, 7 students (indicated with J1-J7), and 10 other stakeholders (4 clients and 6 experts) and Graduate (G): 1 instructor, 6 students (indicated with G1-G6), and 6 other stakeholders (2 clients and 4 other students).
The experimental dataset included data collected either from the same students and teams over time (although not always possible) or from multiple students and teams [21] . In addition, effort s were made to be gender inclusive. All data were collected in situ in natural environments rather than controlled environments. In some cases, the design reviews were conducted in environments well insulated from disruptive noises, surrounding activities, and lighting changes; in other cases, these conditions were not possible to achieve. When disruptions occurred, most were less than a minute in duration. Because English was a second language for a number of the participants, there were some light accents in the digital recordings [21] . The purpose of the conversations was for the instructor to notice both promising and problematic aspects in the student work and to help the students deal with possible challenges encountered [21] . At the end of these conversations, the students developed a solution (design concept for a product or service) that answered the problem posed in the task given initially.
Computational quantification of the results was based on the digital recordings and the corresponding written transcripts of the conversations. Because our main focus was on studying ideas in creative problem solving, we had explicitly defined the term idea as a formulated creative solution (product concept) to the given design problem (including product name, drawings of the product, principle of action, target group, etc.) [23,24] . As an example, on the graduate project "Outside the Laundry Room," some of the generated ideas were "Laundry Rocker," "Clothes Cube," "Drying Rack," "Tree Breeze," "Washer Bicycle," etc. Our criterion for a minimal conversation was a conversation containing at least 15 nouns. Since on average 13.4% of the words in the conversation were nouns, an average minimal conversation contained ≈ 110 words. The reported results were per student and solution (idea). Table 1 Students and design review conversations in the Industrial Design Junior (J) and Graduate (G) subsets. Division of conversations (C1-C5) for comparative analyses (1-4) into groups is indicated as follows: 1a, student; 1b, instructor; 2a, successful; 2b, unsuccessful; 3a, before first feedback; 3b, after first feedback; 4a, before first evaluation; 4b, after first evaluation. For empty cells no video data or transcripts were provided in the dataset.

Comparison between student thinking and instructor thinking
On the basis of the participant roles, the speech in the conversations was divided into speech by students or speech by instructors. Instructors were defined as those giving feedback or critique that were not only persons directly appointed as instructors in the particular setting, but also clients, sometimes other students acting or criticizing as instructors, or other stakeholders present on the intermediate or the final meetings. If there were several instructors in a conversation, their speech was taken together. For this comparison, the J and G subsets contained 7 and 6 subject cases, respectively, for a total of 13 cases. For both students and instructors, 39 conversation transcripts were each analyzed ( Table 1 ).

Comparison between successful ideas and unsuccessful ideas
Conversations were divided into 2 groups: those related to unsuccessful ideas and those related to successful ideas. The unsuccessful were ideas that had not been developed to the end or had been disregarded in the problem-solving process, whereas successful ideas were those that had been developed to the end. The final evaluation of successful ideas was performed by the clients. For each student, only one of the generated ideas was the successful one. The same conversation was divided into a part or parts that concerned one or more unsuccessful ideas, and a part that concerned the successful idea. These divisions were made on sentence breaks. When two ideas were compared in one sentence, the sentence was considered to belong to the idea that was main for the comparison. In rare cases, if the main idea could not be identified, the sentence was not included in the analysis. The division of the text in the conversation transcripts between different ideas was assisted by the available slides in the dataset containing drawings of the generated ideas (product concepts), product names, principle of action, etc. For this comparison, the J and G subsets contained 7 and 5 subject cases, respectively, for a total of 12 cases. One case in the G subset was omitted because of missing data (slides with design sketches for client review) pertaining to unsuccessful ideas. For the 12 subject cases, the J subset contained conversations pertaining to 22 unsuccessful and 7 successful ideas; the G subset contained conversations pertaining to 19 unsuccessful and 5 successful ideas. In total, conversations pertaining to 41 unsuccessful ideas and 12 successful ideas were analyzed ( Table 1 ).

Comparison of ideas before and after first feedback
Conversations were divided into 2 groups: containing ideas before and after first feedback. The division was based on a predefined point, which was the first feedback from the client (a stakeholder that was not a student or appointed as an instructor). For this comparison, the J and G subsets contained 7 and 5 subject cases, respectively, for a total of 12 cases. One case in the G subset was omitted due to missing data for ideas after the first feedback. For the 12 subject cases, the before first feedback group contained 25 conversation transcripts, whereas the after first feedback group contained 24 conversation transcripts ( Table 1 ). The effect of first feedback on the time dynamics of successful ideas was assessed on 7 successful ideas (G4, G5, G6, J3, J5, J6, and J7) that had sufficiently long conversations to allow for the division into 6 time points comprising 2 sets of 3 time points before and after the first feedback.

Comparison of ideas before and after first evaluation
Conversations were divided into 2 groups: containing ideas before and after the first evaluation. The division was based on a predefined point, which was the first evaluation performed by the instructor (for the J subset) or the client (for the G subset). At the time of first evaluation, some of the generated ideas were discarded as unsuccessful. Those ideas that passed the first evaluation were developed further, mainly with a focus on details rather than on change of the main characteristics. In the G subset, two or more ideas passed the first evaluation, whereas in the J subset, only the successful idea passed the first evaluation. For this comparison, the J and G subsets contained 6 and 5 subject cases, respectively, for a total of 11 cases. One case of the J subset and one case of the G subset were omitted because of missing data for ideas after the first evaluation. For the 11 subject cases, the before first evaluation group contained 22 conversation transcripts, whereas the after first evaluation group contained 13 conversation transcripts ( Table 1 ). The effect of first evaluation on the time dynamics of successful ideas was assessed on 8 successful ideas (G4, G5, G6, J1, J3, J4, J6, and J7) that had sufficiently long conversations to allow for the division into 6 time points comprising 2 sets of 3 time points before and after the first evaluation.

Modeling with semantic networks
In psychology, semantic networks depict human memory as an associative system wherein each concept can lead to many other relevant concepts [25] . In artificial intelligence, the semantic networks are computational structures that represent meaning in a simplified way within a certain region of conceptual space. The semantic networks consist of nodes and links. Each node stands for a specific concept, and each link, whereby one concept is accessed from another, indicates a type of semantic connection [25] . Semantic networks can be used to computationally model conceptual associations and structures [26,27] . In this study, to construct semantic networks of nouns used in the conversations, we first cleaned the transcripts of the conversations for any indications of non-verbal expressions, such as "[Laughter]," speaker names and all the time stamps. As a second step, we processed the textual data using part-of-speech tagging performed by the Natural Language Toolkit (NLTK) [28] with the TextBlob library [29] . Then, we extracted only the nouns, both singular and plural. With the use of Python scripts, we processed all the nouns by converting the plural forms to singular and by removing nouns that were not listed in WordNet. In total, only 8 nouns were removed, which comprised 0.2% of all nouns that were analyzed. Finally, we analyzed the constructed semantic networks using WordNet 3.1.

Analysis of time dynamics of semantic measures
For graph analysis, we used Wolfram Mathematica, a mathematical symbolic computation program developed by Wolfram Research (Champaign, Illinois). The average level of abstraction, polysemy, information content and semantic similarity in the semantic network were computed using WordGraph 3.1, a toolset that implements the WordNet 3.1 is-a hierarchy of nouns as a directed acyclic graph, allowing for efficient computation of various graphtheoretic measures in Wolfram Mathematica. The is-a relationship between noun synsets (sets of synonyms) organizes WordNet 3.1 into a hierarchical structure wherein if synset A is a kind of synset B, then A is the hyponym of B, and B is the hypernym of A. As an example, the synset {cognition, knowledge, noesis} is a kind of {psychological_feature}.
The level of abstraction is negatively related to the depth of the noun in the taxonomy in a way that the root noun "entity" is the most abstract, whereas the deepest nouns in the taxonomy are least abstract [30] . The complement of the level of abstraction to unity is a measure of word concreteness.
The polysemy counts the number of meanings of each word, and its log-transformed value measures the bits of missing information that are needed by the listener to correctly understand the intended meaning of a given word.
The semantic similarity of pairs of nouns was calculated using five path-based similarity formulas by Al-Mubaid-Nguyen [38] , Leacock-Chodorow [39] , Li et al. [40] , Rada et al. [41] , or Wu-Palmer [42] and five IC-based similarity formulas by Jiang-Conrath [43] , Lin [44] , Meng et al. [45] , Resnik [46] , or Zhou et al. [47] , each of which could be combined with any of the seven IC formulas, thereby generating 35 IC-based similarity measures. Because WordNet 3.1 as a database is much richer than a mathematical graph, we created and employed WordGraph 3.1, a custom toolset for Wolfram Mathematica that allows for fast and efficient computation of all graph-theoretic measures related to the is-a hierarchy of nouns.
To test whether convergent or divergent thinking could be quantified through convergence or divergence of semantic similarity, we assessed the change of the average semantic similarity in time. Convergence in the semantic networks was defined as an increase in the average semantic similarity in time (positive slope of the trend line), whereas divergence as a decrease in the average semantic similarity in time (negative slope of the trend line). To obtain 3 time points for analysis of time dynamics for each subject ( Table 1 ), we joined the conversation transcripts pertaining to each group or idea and then divided the resulting conjoined conversations into 3 equal parts based on word count. This division was made into whole sentences in such a way that no time point of the conversation contained less than 5 nouns. Then, we assessed the time dynamics using linear trend lines. Because only nouns in the conversations were used for the construction of semantic networks, each time point had to contain at least 5 nouns to obtain a proper average semantic similarity.

Semantic measures based on WordNet 3.1
The calculation of semantic measures based on WordNet 3.1 ( https://wordnet.princeton.edu/ ) was performed with the Word-Graph 3.1 custom toolset for Wolfram Mathematica. The structure of WordGraph 3.1 is isomorphic to the is-a hierarchy of nouns in WordNet 3.1, implying that all mathematical expressions defined in WordGraph 3.1 also hold for WordNet 3.1. The nouns in WordGraph 3.1 were represented by 158,441 case-sensitive word vertices (including spelling variations, abbreviations, acronyms, and loanwords from other languages) and 82,192 meaning vertices, in which each word could have more than one meaning (polysemy) and each meaning could be expressed by more than one word (synset). WordGraph 3.1 consists of two subgraphs, subgraph M , which contains 84,505 hypernym → hyponym edges between meaning vertices, and subgraph W , which contains 189,555 word → meaning edges between word vertices and each of their meaning vertices.
Several graph-theoretic functions were used as follows: Subvertices( G, x ): the subvertices of a vertex x in a directed graph G are all vertices in G that have a finite directed path from x . Thus, every vertex is a subvertex of itself. Subsumers( G, x ): the subsumers of a vertex x in a directed graph G are all vertices in G that have a finite directed path to x . Thus, every vertex is a subsumer of itself. Leaves( G, x ): a leaf in a directed graph G is a vertex with a vertex out-degree of zero. In other words, the leaf does not have outgoing edges. The leaves of a vertex x in a directed graph G are all subvertices of x with a vertex out-degree of zero. Because every vertex is a subvertex of itself, it follows that the number of leaves of each leaf in G is 1. ShortestPathDistance( G, x, y ): the shortest path distance between a vertex x and a vertex y in a directed graph G is the minimal number of edges needed for a trip from x to y . The shortest path distance is infinite ∞ if there is no path from x to y . In general, the shortest path distance from x to y is not the same as the shortest path distance from y to x ; these distances are equal in undirected (bidirectional) graphs. Depth( G, x ): the depth of a vertex x in a rooted directed graph G is 1 + the shortest path distance from the root vertex r to x . Thus, the depth of the root vertex is 1. VertexEccentricity( G, x ): the vertex eccentricity of a vertex x in a directed graph G is the length of the longest of all the shortest paths from the vertex x to every other vertex in the graph G . MaxDepth( G ): the maximal depth of a rooted directed graph G is 1 + the vertex eccentricity of the root vertex r . IncidenceList( G, x ): gives a list of all edges (incoming, outgoing, or undirected) incident to a vertex x in a graph G .
With the use of the above graph-theoretic functions, semantic functions were constructed that take words as arguments and return values that depend only on the relationship between the word arguments and the meanings subgraph M ( Fig. 1 ). Two graph operators were used: R ( G ) reverses the direction of all directed edges in the graph G , and U ( G ) converts all directed edges in the graph G into undirected (bidirectional) edges.
| f ( x )|: gives the number of elements contained by the list f ( x ).
Polysemy (x ) = | IncidenceList (W, x ) | : gives the number of all the meaning vertices that are 1 edge away from a given word x ( Fig. 1 (A)). Depth( x ): gives the shortest path distance between the root meaning vertex corresponding to the word "entity" and a word x in the graph M ∪ IncidenceList[ R ( W ), x ] ( Fig. 1 (A, C)). Thus, the depth of the word "entity" is 1. , excluding x itself since it is a word subsumer ( Fig. 1 (A)). Subvertices( x ): gives a list of the meaning subvertices of the word x in the graph M ∪ IncidenceList( W, x ), excluding x itself since it is a word subvertex ( Fig. 1 (B)). Leaves( x ): gives a list of the leaves of the word x in the graph M ∪ IncidenceList( W, x ) ( Fig. 1 (B)). Commonness( x ): the commonness of a word x in the graph G = Fig. 1 (B)). LCS( x, y ): for x = y gives the lowest common subsumer of a word x and a word y in the graph Fig. 1 (C)). The lowest common subsumer is a meaning vertex with maximal depth in the taxonomy among all vertices z that minimize the sum If there is a tie between two or more common subsumers of x and y , which are equally deep in the taxonomy, the uniqueness of LCS( x, y ) is ensured by taking the meaning vertex with the lowest entry number in WordNet 3.1.
Depth[LCS( x, y )]: gives the shortest path distance between the root word "entity" and a meaning vertex LCS( x, y ) in the graph M ∪ IncidenceList ( W , "entity") ( Fig. 1 (C)). Depth[LCS( x, y )]: gives the shortest path distance between the root word "entity" and a meaning vertex LCS( x, y ) in the graph M ∪ IncidenceList ( W , "entity") ( Fig. 1 (C)). Distance( x, y ): for x = y gives the shortest path distance between a word x and a word y in the graph , y ] minus 2 edges to subtract edge contribution outside of the meanings subgraph M ( Fig. 1 (D)). For the calculation of intrinsic information content of nouns, were used several constants that are specific for WordNet 3.1: Max _ vertices : total number of meaning vertices is 82,192. Max _ leaves : total number of leaves is 65,031. Max _ depth : maximal depth of the taxonomy is 19. Min _ commonness : minimal commonness of the word "Saint Ambrose" is 1/35. Max _ commonness : maximal commonness of the root word "entity" is 6863.6.

Information content (IC) measures
The intrinsic information content (IC) of a word x in WordNet 3.1 was computed using seven different formulas: IC by Blanchard et al. [31] , normalized in the interval [0,1], is IC by Meng et al. [32] IC IC by Sánchez et al. [33] , normalized in the interval [0,1], is IC by Yuan et al. [36] IC IC by Zhou et al. [37] IC(x ) =

Path-based similarity measures
The semantic similarity between a pair of words x and y such that x = y was computed using five different path-based similarity formulas: Al-Mubaid-Nguyen similarity [38] , normalized in the interval [0,1], is Leacock-Chodorow similarity [39] , normalized in the interval [0,1], is Li et al. similarity [40] , normalized in the interval Wu-Palmer similarity [42] , normalized in the interval [0,1], is

IC-based similarity measures
The semantic similarity between a pair of words x and y such that x = y was computed using five different IC-based similarity formulas, each of which was combined with every of the seven IC formulas thereby generating a total of 35 different IC-based similarity measures: Jiang-Conrath similarity [43] sim (x, Lin similarity [44] sim (x, Meng similarity [45] sim (x, Resnik similarity [46] sim (x, y ) = IC [ LCS (x, y ) ] Zhou similarity [47] sim (x, y ) = 1 −

Statistics
Statistical analyses of the constructed semantic networks were performed using SPSS ver. 23 (IBM Corporation, New York, USA). To reduce type I errors, the time dynamics of semantic measures were analyzed with only two a priori planned linear contrasts [48] for the idea type (sensitive to vertical shifts of the trend lines) or the interaction between idea type and time (sensitive to differences in the slopes of the trend lines). Because semantic similarity was calculated with 40 different formulas and information content with 7 different formulas, possible differences in semantic similarity or information content were analyzed with three-factor repeatedmeasures analysis of variance (rANOVA), where the idea type was set as a factor with 2 levels, the time was set as a factor with 3 levels, and the formula type was set as a factor with 40 or 7 levels, respectively. Differences in the average level of abstraction, polysemy, or individual measures of information content or semantic similarity were analyzed with two-factor rANOVAs, where the idea type and time were the two only factors. The implementation of the repeated-measures experimental design controlled for factors that cause variability between subjects, thereby simplifying the effects of the primary factors (ideas and time) and enhancing the power of the performed statistical tests. Pearson correlation analyses and hierarchical clustering of semantic similarity and IC measures were performed in R ver. 3.3.2 (R Foundation for Statistical Computing, Vienna, Austria). For all tests, the significance threshold was set at 0.05.

Student and instructor thinking are similar in terms of semantic measures
With regard to creative thinking, our primary interest was focused on semantic similarity because as a two-argument function, it is able to evaluate the relationship between pairs of vertices in the constructed semantic networks. In addition, the average of semantic similarity is more informative than is the average of single-argument functions, such as information content, A comparison between the student and instructor speech in the problem-solving conversations did not show significant differences in semantic similarity (three-factor rANOVA: F 1,12 < 0.3, P > 0.58; Fig. 2 (A)), information content (three-factor rANOVA: F 1,12 < 0.2, P > 0.65; Fig. 2 (B)), polysemy ( F 1,12 < 0.6, P > 0.46; Fig. 2 (C)), or level of abstraction ( F 1,12 < 0.9, P > 0.38; Fig. 2 (D)); this could be because all of the ideas originating from the student or the instructor were commented upon by both participants. To reduce Type II errors, we also confirmed that the linear contrasts in individual two-factor rANOVAs were not significant for each of the 40 semantic similarity measures ( F 1,12 < 0.9, P > 0.37) and each of the 7 information content measures ( F 1,12 < 0.8, P > 0.40). These results justify our decision to further analyze both student and instructor speech jointly with regard to different types of ideas contained in the conversations.

Divergence of semantic similarity predicts the success of ideas
Creative ideas should be novel, unexpected, or surprising, and provide solutions that are useful, efficient, and valuable [49,50] . The success of generated ideas in creative problem solving depends not only on the final judgment by the client who decides which idea is the most creative, but also on the prior decisions made by the designer not to drop the idea in face of constraints on available physical resources. Thus, while success and creativity are not the same, the ultimate goal of design practice is to find solutions that are both creative and successful. To determine whether different types of thinking are responsible for the success of some of the generated ideas and the failure of others, we have com-pared the time dynamics of semantic measures in the conversations pertaining to successful or unsuccessful ideas. Three-factor rANOVA detected a significant crossover interaction between idea type and time ( F 1,11 = 11.4, P = 0.006), where successful ideas exhibited divergence and unsuccessful ideas exhibited convergence of semantic similarity ( Fig. 3 (A)). The information content manifested a trend toward significant crossover interaction ( F 1,11 = 4.0, P = 0.072), where successful ideas increased and unsuccessful ideas decreased their information content in time ( Fig. 3 (B)). The polysemy exhibited crossover interaction decreasing in time for successful ideas ( F 1,11 = 12.8, P = 0.004; Fig. 3 (C)), whereas the average level of abstraction decreased in time but with only a trend toward significance ( F 1,11 = 4.6, P = 0.055; Fig. 3 (D)). Because design practice usually generates both successful and unsuccessful ideas, these results support models of concurrent divergent ideation and convergent evaluation in creative problem solving.

IC-based semantic similarity measures outperform path-based ones
The majority of 40 different semantic similarity formulas generated highly correlated outputs, which segregated them into clusters of purely IC-based, hybrid path/IC-based, and pathbased similarity measures ( Fig. 4 ). Motivated by the significant difference detected in the time dynamics of semantic similarity between successful and unsuccessful ideas, we performed post hoc linear contrasts in individual two-factor rA-NOVAs and ranked the 40 semantic similarity measures by the observed statistical power ( Fig. 5 ; Table 2 ). The best performance was achieved by purely IC-based similarity measures using the formulas by Lin ( F 1,11 > 10.6, P < 0.008, power > 0.84), Table 2 Statistics from the post hoc two-factor rANOVAs (linear contrasts of idea * time interaction) used to rank the 40 semantic similarity measures and trend line parameters ( y = kt + b ) for successful ideas ( k 1 , b 1 ) and unsuccessful ideas ( k 2 , b 2 ) at 3 time points t = {1,2,3} in the conversations.     cally significant. Among the IC formulas, the best overall performance was achieved by the cluster of Sánchez-Batet, Blanchard and Seco, which exhibited highly correlated IC values ( r > 0.93, P < 0.001; Fig. 6 ).
Having ranked the IC formulas ( Fig. 5 ), we also performed individual two-factor rANOVAs for each of the 7 IC measures. The information content of nouns increased/decreased in time for successful/unsuccessful ideas exhibiting a crossover interaction as shown by IC Sánchez-Batet ( F 1,11 = 6.2, P = 0.03), with 4 other IC measures by Blanchard, Meng, Seco and Zhou manifesting a trend toward significance ( F 1,11 > 3.8, P < 0.076). Because the first-ranked IC measure by Sánchez-Batet was significantly changed in the post hoc tests, we interpreted the trend-like significance from the corresponding three-factor rANOVA as a Type II error due to inclusion in the analysis of IC measures that compound the word information content with path-based information (such as the depth of the word in the taxonomy).

Effect of first evaluation on creative problem solving
Ideas before first evaluation are subject to change, with new features added and initial features omitted, whereas ideas after first evaluation do not change their main features, only their details. Considering this, we also tested the effects of first evaluation by client or instructor upon problem solving. Conversations containing both successful and unsuccessful ideas before and after first evaluation did not exhibit different time dynamics in any of the 40 semantic similarity measures (two-factor rANOVAs: F 1,10 < 2.7, P > 0.14; Fig. 9 (A)), in any of the 7 information content measures (two-factor rANOVAs: F 1,10 < 0.9, P > 0.38; Fig. 9 (B)), polysemy ( F 1,10 < 3.8, P > 0.08; Fig. 9 (C)), or the average level of abstraction ( F 1,10 < 0.1, P > 0.76; Fig. 9 (D)). Analyzing the time dynamics of only successful ideas also showed a lack of effect upon 39 of 40 semantic similarity measures (three-factor rANOVA: F 1,7 = 2.9, P = 0.131; two-factor rANOVAs: F 1,7 < 4.9, P > 0.063; Fig. 10 (A)), 7 information content measures (three-factor rANOVA: F 1,7 = 3.1, P = 0.124; two-factor rANOVAs: F 1,7 < 4.7, P > 0.067; Fig. 10 (B)), polysemy ( F 1,7 = 3.8, P = 0.093; Fig. 10 (C)), and the average level of abstraction ( F 1,7 = 5.0, P = 0.06; Fig. 10 (D)). Only the semantic similarity measure by Rada showed an enhanced divergence after first evaluation ( F 1,7 = 6.0, P = 0.044), but we interpreted this as a Type I error since the path-based similarity measures were the weakest in terms of statistical power ( Fig. 5 ). These results suggest that the first evaluation had a minimal effect upon those ideas that were not dropped but developed further.

Implications for cognitive science of creativity
The presented findings advance cognitive science by showing that convergence and divergence of semantic similarity, as well as time dynamics of information content, polysemy, and level of abstraction, could be evaluated objectively for problem-solving conversations in academic settings and be used to monitor the probability of success of different ideas that are generated and developed in the process of problem solving in view of improving stu-   dent training, creative thinking and skill acquisition. The observed convergence of semantic similarity for unsuccessful ideas and divergence for successful ideas parallel the psychological definitions of convergent/divergent thinking that associate creativity with divergent thought [3][4][5] . Thus, the convergence or divergence of semantic similarity in verbalized thoughts could be interpreted as a faithful reflection of the underlying cognitive processes, including convergent (analytical) or divergent (associative) thinking. Given the correspondence between convergence/divergence of semantic similarity and convergent/divergent thinking, our results, with regard to successful/unsuccessful ideas, provide extra support to recent accounts of concurrent occurrence of convergent and divergent thinking in creative problem solving [12,15,19] .
Psychological accounts of creative thinking and problem solving describe divergent generation of novelty and convergent exploration, evaluation or elimination of the introduced novelty [19] . The opposite trend line slopes for successful and unsuccessful ideas found in the studied design review conversations can be well explained by difference in the rates of divergent production and convergent elimination of novelty. Thus, convergent (analytical) and divergent (associative) cognitive processes, quantified through time dynamics of semantic similarity, appear to be the main factors that shape the evolution and determine the outcome of generated ideas.
Language is a powerful data source for the analysis of mental processes, such as design and creative problem solving. Extracting meaningful results about the cognitive processes underlying human creativity from recorded design conversations, however, is a challenging task because not all aspects of human creative skills are verbalized or represented at a consciously accessible level [25] . Semantic networks address the latter problem by providing a structured representation of not only the explicitly verbalized con-cepts contained in the conversations [26,27] , but also of the inexplicitly imaged virtual concepts (connecting the verbalized concepts), which are extracted from available lexical databases that are independent of the designer's background [51] . In our methods, we have used WordNet 3.1 as a lexical database and have constructed semantic networks containing only nouns. Working with a single lexical category (nouns) was necessitated by the fact that WordNet consists of four subnets, one each for nouns, verbs, adjectives, and adverbs, with only a few cross-subnet pointers [22] . Besides nouns being the largest and deepest hierarchical taxonomy in WordNet, our choice to construct semantic networks of nouns had been motivated by previous findings that showed how: noun phrases are useful surrogates for measuring early phases of the mechanical design process in educational settings [52] , networks of nouns act as stimuli for idea generation in creative problem solving [53] , nounnoun combinations and noun-noun relations play essential role in designing [54] , and similarity/dissimilarity of noun-noun combinations is related to creativity through yielding emergent properties of generated ideas [55] . Noteworthy, disambiguation of noun senses is not done for the construction of semantic networks because nouns used to describe creative design ideas may acquire new senses different from dictionary-defined ones and polysemy may be responsible for the association of ideas previously thought to be unrelated [56,57] . The effectiveness of semantic networks of nouns for constructive simulation of difficult-to-observe designthinking processes and investigation of creativity in conceptual design was validated in previous studies using different sets of experimental data [26,27,51,58] .
The temporal factor is not a prerequisite for applying semantic network analysis to text data, however, determining the slope of convergence/divergence is essential if the objective is to understand dynamic processes or to achieve dynamic control of artificial intelligence applications. The temporal resolution of the method for studying cognitive processes in humans is limited by the speed of verbalization and the sparsity of nouns in the sentences. A possible inclusion of more lexical categories in the semantic analysis would increase the temporal resolution by allowing verbal reports to be divided into smaller pieces of text, but for practical realization this will require further extensive information theoretic research on how semantic similarity could be meaningfully defined for combinations of lexical categories, such as verbs and nouns, which form separate hierarchical taxonomies in WordNet.

Implications for artificial intelligence research
Implementing creativity in machines endowed with artificial intelligence requires mechanisms for generation of conceptual space within which creative activity occurs and algorithms for exploration or transformation of the conceptual space [59] . The most serious challenge, however, is considered not the production of novel ideas, but their automated evaluation [50] . For example, machines could explore structured conceptual spaces and combine or transform ideas in new ways, but then arrive at solutions that are of no interest or value to humans. Since creativity requires both novelty and a positive evaluation of the product, the engineering of creative machines is conditional on the availability of algorithms that could compute the poor quality of newly generated ideas, thereby allowing ideas to be dropped or amended accordingly [50] .
Linkography is a method for analyzing decisions and activities that occur during a design work session by parsing the design conversations into a large number of small steps called design moves, some of which are then interrelated through backlinks to previous moves or forelinks to future moves. The most significant elements in a linkograph are critical moves, which are particularly rich in links. The percentage of critical moves and the link index (the ratio between the number of links and the number of moves) are positively correlated with creativity. The ideas considered most meaningful (successful ideas) have a significantly higher number of links than other ideas [60] . Information theoretic approach to measuring creativity in linkography has further shown that the Shannon entropy H of the linkograph is not directly correlated to the design outcome, however, the slope of the rate of change in entropy (second derivative in time of the entropy curve, d 2 H d t 2 ) for highscoring design sessions (successful ideas) is positive, whereas for low-scoring design sessions (unsuccessful ideas) is negative [61] .
Here, we have analyzed design review conversations at the level of individual words and extracted nouns from the corresponding text transcripts through computer automated natural language processing. With the use of semantic networks of nouns constructed at different times, we studied the time dynamics of 49 semantic measures that quantitatively evaluated the content of generated ideas in creative problem solving. We found that the creative ideas, which are judged as successful by the client, exhibit distinct dynamics including divergence of semantic similarity, increased information content and decreased polysemy in time. These findings are susceptible to reverse-engineering and could be useful for the development of machines endowed with general artificial intelligence that are capable of using language (words) and abstract concepts (meanings) to assist in solving problems that are currently reserved only for humans [62] . A foreseeable application would be to use divergence of Lin/Sánchez-Batet semantic similarity in computer-assisted enhancement of human creativity wherein a software proposes a set of possible solutions or transformations of generated ideas and the human designer chooses which of the proposed ideas to drop and which to transform further. As an example, consider a design task described by the set of nouns {bird, crayon, desk, hand, paper} whose average semantic similarity is 0.39. The software computes four possible solutions that change the average similarity of the set when added to it, namely, drawing (0.40), sketch (0.39), greeting_card (0.35), origami (0.29), and proposes origami as the most creative solution as it is the most divergent. If the designer rejects the idea, the software proposes greet-ing_card as the second best choice, and so on. Divergence of semantic similarity could be monitored and used to supplement existing systems for support of user creativity [63][64][65] . Accumulated experience with software that enhances human creativity could help optimize the evaluation function for dynamic transformation of semantic similarity and information content of generated ideas up to the point wherein the computer-assisted design products are invariably more successful than products designed without computer aid. If such an optimized evaluation function is arrived at, creative machines could be able to evaluate their generated solutions at different stages without human help, and steer a selected design solution toward success through consecutive transformations; human designers would then act as clients who run design tasks with slightly different initial constraints on the design problem and at the end choose the computer product that best satisfies their personal preference.

Future work
Having established a method for the quantitative evaluation of convergence/divergence in creative problem solving and design, we are planning to utilize it for the development of artificial intelligence applications, the most promising of which are software for the computer-assisted enhancement of human creativity and bot-automated design education in massive open online courses (MOOCs), wherein a few instructors are assisted by artificial agents that provide feedback on the design work for thousands of students. We are also interested in cross-validating our results with the use of conversation transcripts from the design process of professional design teams in which the instructor-student paradigm is not applicable, and testing whether semantic measure analysis of online texts in social media or social networks could predict future human behavior.

Ethics statement
The authors have signed Data-Use Agreements to Dr. Robin Adams (Purdue University) for accessing the Purdue DTRS Design Review Conversations Database, thereby agreeing not to reveal personal identifiers in published results and not to create any commercial products.