Updating verbal fluency analysis for the 21st century: Applications for psychiatry

Evaluating patients' verbal fluency by counting the number of unique words (e.g., animals) produced in a short-period (e.g., 1-3 min) is one of the most widely employed cognitive tests in psychiatric research. We introduce new methods to analyze fluency output that leverage modern computational language technology. This enables moving beyond simple word counts to charting the temporal dynamics of speech and objectively quantifying the semantic relationship of the utterances. These metrics can greatly expand the current psychiatric research toolkit and can help refine clinical theories regarding the nature of putative language differences in patients.


Introduction
Language is affected by a large number of cortical disorders, and in psychiatric disorders specifically it is the medium through which many symptoms are expressed and thus measured.Skilled clinicians can detect and assess such symptoms intuitively in their daily practice, but numerous tests have also been developed to be easily administered.Although these tests were not designed to understand serious mental illness per se, due to the ease of administration they are frequently employed to shed additional light on the nature of the presenting language anomalies.We investigated how emerging technology could be leveraged for new opportunities in both the administration and analysis of the verbal fluency task, and how descriptions of high temporal resolution could inspire new insights into the dynamical nature of verbal fluency task performance.
The category verbal fluency task is one of the most widely used language tasks in psychiatric research.In this task, participants are asked to produce as many exemplars to a few noun cues (e.g., animals) for a specified duration (e.g., one minute) for each category.The experimenter then typically writes down all the exemplars and assigns a point for each unique exemplar produced.Such an operationalization ignores the fundamental fact that even on such a simple task there is a remarkable amount of structure and temporal information (Bousfield and Sedgewick, 1944;Bousfield et al., 1954).However, the exact timing between utterances has rarely been formally examined.

Moving beyond stopwatches and pencils to automatic speech transcription with accurate timing measurements
By collecting speech output digitally it is possible to use automatic transcription methods which work because of the statistical properties of language (e.g., word frequencies) derived from large scale language corpora.Although currently available tools are not designed specifically to analyze the verbal fluency task, when we calibrate the statistical model on relevant task words (i.e., what words are likely to occur, such as "elephant", "squirrel" and "giraffe") we have found that the automatic speech to text transcription accuracy is significantly improved (with a mere 6% word error rate - Holmlund et al., 2019).Additionally, it is possible to time-stamp each word utterance (using forced alignment tools).Although the current temporal resolution of publicly available speech recognition services are on the order of ± 100 ms and thus not really adequate for charting the flow of thought, a higher temporal resolution on the order of ± 10 ms is possible to develop inhouse for specific tasks.Therefore, the use of automatic speech recognition combined with accurate temporal markers of the utterances can radically transform the manner in which verbal fluency data are analyzed.Traditionally psychiatry has concerned itself with analyzing the meaning of what people say, using a variety of theories and (arguably subjective) hand-coding methods.This is motivated by the notion that examining associations will provide some insight into the connection of ideas.The underlying theory is that well-organized and closely associated ideas and concepts will be generated faster (i.e., that meaning and temporal dynamics are intertwined).Today such conceptualizations of semantic associations can be formally examined by the use of large scale language corpora.These corpora can be leveraged to create vector representations of individual words and build semantic spaces which can then be used as tools to compute semantic distances between Case #2 decreases over the course of the trial (e.g., "chicken and giraffe", "giraffe and slug"), but ends with words that more commonly occur together in sentences ("bug and rabbit").The dotted line shows a trial from another participant with more consistently similar response words.Panel D: The semantic similarity between two successive words is related to the inter-word delay.All word pairs produced by the patients are presented as a point in the scatterplot, where word pairs that are more similar are higher up the vertical axis, and pairs produced with longer inter-word delays are further to the right on the horizontal axis.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)words and concepts.In psychiatry there has been a growing use of such computational methods to derive a variety of complex natural language processing assays, for example it has been shown that thought disorder can be associated with lower word to word coherence (Elvevåg et al., 2007).Critically, it is possible to explore whether such metrics are predictive of clinical state changes in a manner that is similar to human clinicians, or perhaps even more sensitive (Bedi et al., 2015;Rosenstein et al., 2015;Foltz et al., 2016;Corcoran et al., 2018).

Methods
We sought to chart the temporal dynamics of speech production and objectively quantify the semantic relationship of response words in a category fluency task, collected as part of a large project that developed and implemented a mobile software application for remote, frequent and self-administered psychological assessment to monitor mental states using smart devices (Holmlund et al., 2019).Participants were given the following vocal prompt from the smart device: "Name as many animals as you can.Any kind of animal, as many different animals as you can think of.You have up to 1 min; start now".The task was administered between one and three times to a subset of participants.We analyzed responses from 24 male patients recruited from an inpatient population with substance use disorders and major depressive disorder comorbidity (mean age = 39.1 years, SD = 10.7).For comparison, we analyzed responses from 35 presumed healthy volunteer participants (who were undergraduate students recruited from a large public university), substantially younger (mean age = 19.5,SD = 1.5), and predominantly female (87.9%).
One minute long recordings of responses were made via the microphone in smart devices at a sample rate of 16,000 Hz and saved in a .flac-format.These recordings were transcribed and response-words timestamped with a forced temporal alignment procedure using the Kaldi speech recognition toolkit (Povey et al., 2011).Non-animal words (e.g., "Let's", "see", "what", "else") were removed, and the remaining words were lemmatized (i.e., converted to their stem, e.g., "cats" to "cat") with the Natural Language Toolkit (Loper and Bird, 2002).
We derived an index of semantic associations between word pairs using a set of publicly available GloVe word vectors (Pennington et al., 2014), computing relationships between vector representations of the individual responses.This method provides a quantified measure of the degree of semantic association between words, based on how they cooccur in similar contexts in a given corpus.To base the analysis on a corpus with a wide variety of animal-word sources, we used a set of pretrained word vectors calculated from approximately 42 billion tokens from the entire internet, courtesy of the Common Crawl project (Pennington et al., 2014).The GloVe word vectors were imported in a word2vec-format (Mikolov et al., 2013), and word-pair cosine similarity was derived using the Gensim python package (Řehůřek and Sojka, 2010).The range of the measure was 0-1, such that words that often co-occur such as "lion and tiger" got a score closer to one, while words that seldom occur together such as "giraffe and slug" got a score closer to zero.

Results
Overall, healthy participants generated 25 words (range = 8-36) and patients with substance use disorder and major depressive disorder comorbidity generated on average slightly fewer (mean = 19 words; range = 11-36; t(57) = −4.0;p < 0.001).To illustrate how the temporal trajectories can differ between individuals we chose two noteworthy examples, namely a healthy person who generated 31 words versus a patient who generated only 12 words in the one minute period (Fig. 1A).We also illustrate how much variability there can be in individual data as compared to group means (Fig. 1B).Over the course of a minute the word to word similarity can fluctuate considerably (Fig. 1C) as a function of whether there is a high level of similarity between successive word pairs or not.Not surprisingly, the semantic coherence between two successive words was related to the speed of speech (Fig. 1D).We found a significant negative correlation (Entire sample: r = −0.36,p < 0.001; patients: r = −0.38,p < 0.001) indicating the tendency for longer pauses between semantically dissimilar words.Thus, word pairs with a high similarity index, such as "cat and dog", were spoken very quickly and thus seldom had inter-word delays longer than three seconds.

Discussion
We have introduced preliminary analytic methods that showcase how the currently untapped temporal and semantic information in a simple category fluency task can be formally extracted.While the results are encouraging, much remains to be explored and established before widespread implementation.As an increasing number of statistical and mathematical approaches to language emerge that are applied within psychiatry (Elvevåg et al., 2017), it remains extremely important to validate and calibrate these methods (Foltz et al., 2016).Most notably, it is critical to establish that these assays and analytic methods are consistent across data and analysis platforms, that the subtle nuances of the various natural language processing techniques are not affecting the results in a manner that is unexpected, and that the value of the extra technical effort is established to be worthwhile.

T
1.2.Making sense of the semantic associations in free speech: measuring distances in semantic space

Fig. 1 .
Fig.1.The temporal sequence of utterances in example trials from the one-minute category fluency task.Panel A: Each individual response utterance is plotted on the timeline from the start of the trial (i.e., "0") to the end of the trial (i.e., "60").The left and right margins of the colored boxes represent the duration of the respective onsets and offsets of the responses, demonstrating how a verbal response of "pig" is shorter than a response of "armadillo".The vertical axis represents the word count, and periods with a quick succession of words results in a steeper trajectory.Panel B: The two cases (Case #1 and Case #2) are plotted alongside the individual sequences from the other participants in the two groups (healthy, patients).The distribution of total word counts suggests the expected pattern where healthy participants produced more words (green, mean = 25) as compared to patients with substance use disorders and major depressive disorder comorbidity (SUD-MDD, blue, mean = 19).Panel C: The similarity between successive responses can fluctuate over time.The full blue line shows how the similarity between responses from Case #2 decreases over the course of the trial (e.g., "chicken and giraffe", "giraffe and slug"), but ends with words that more commonly occur together in sentences ("bug and rabbit").The dotted line shows a trial from another participant with more consistently similar response words.Panel D: The semantic similarity between two successive words is related to the inter-word delay.All word pairs produced by the patients are presented as a point in the scatterplot, where word pairs that are more similar are higher up the vertical axis, and pairs produced with longer inter-word delays are further to the right on the horizontal axis.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)