Towards a temporospatial framework for measurements of disorganization in speech using semantic vectors

information - namely the situation of where something is spoken - has often only been inferred and never explicitly modeled. Addressing this situational issue opens up new possibilities for models with increased temporal resolution and contextual relevance. In this paper, direct visualizations of semantic distances are used to enable the inspection of examples of incoherent speech. Some common operationalizations of incoherence are illustrated, and suggestions are made for how temporal and spatial contextual information can be integrated in future implementations of measures of incoherence.


Introduction
"'The train of thought' can initially remain ordered, but later in many cases displays leaps, becomes disjointed, sometimes reaches complete incoherence". (Kraepelin, 1921, p59) 'Coherence' -and thus incoherence -is a multidimensional concept of discourse. Coherence can be considered at multiple levels of linguistic complexity (e.g., semantic, syntactic) and operationalizations of the concept makes assumptions about prior contexts. The methodology by which speech coherence is assessed has important clinical and research implications, and indeed the Diagnostic and Statistical Manual of Mental Disorders notes that the key symptom of disorganized thinking is "inferred from the individual's speech" (American Psychiatric Association, 2022). However, this evaluation is a complex process. The framework and terminology a field employs to capture the phenomenon of interest are often a legacy of the metaphors that are prevalent in society at the time (Lakoff and Johnson, 1980), and this is indeed the case with (modern) psychiatry which had its origins at around the birth of the steam train and method of communication by telegraph. The lingering clues of this remain to date in our use of terms such as derailment and pressured or telegraphic speech. Not surprisingly then, the conceptualization by Kraepelin a century ago of incoherence in speech can appear to us today more as a figurative literary description than as an objective quantification of a symptom. It is the premise of this paper that to reliably operationalize coherence of speech it will be crucial to move beyond the use of metaphors to the point where we have tools that are comparable and replicable (see Holmlund et al., 2021). The promise of this approach is replicability of measurements that are unambiguous of these putative in vivo complex thought processes, and generalizability of measures to diverse populations and communities.
The "leaps" in thoughts described by Kraepelin reveals an intuition about how language and the underlying cognitive processes can be conceptualized within a framework of time and space. A premise here is that thoughts are represented by the meaning of words, expressed via the medium of speech or writing at a specific time and generated within a particular physical location. There is a long tradition of conceptualizing the distance between thoughts from notions of a "cognitive map" (e. g., Tolman, 1948) as well as a "psychological space" (Shepard, 1987), which may well have a neurobiological basis (see e.g., Bellmund et al., 2018;Viganò and Piazza, 2020). If one accepts the premise that words represent the thoughts of the speaker, then the intuition of a "leap" in a psychological space becomes quantifiable as distances in these semantic spaces. Indeed, time and space can be quantified for scientific purposes with seconds and meters, but what about the meaning of a word? The language philosopher Ludwig Wittgenstein famously stated that "the meaning of a word is its use in the language" (Wittgenstein, 1953, section 43), hinting that systematic observation of communicative behavior provides a framework to examine meaning. Around the same time, the linguist John Rupert Firth stated that "you shall know a word by the company it keeps" (Firth, 1957). Together, these two position statements provide a philosophical foundation to the scientific study of meaning: by examining how words co-occur in language it is possible to quantify aspects of meaning, or semantics (and therefore the content of thoughts). Put differently, words that tend to occur in similar contexts are semantically related and thus should be close to one another in a derived word vector space. This has become known as the distributional hypothesis of semantics and leverages the distributions of words across large amounts of text (millions of written words) to derive semantic vector spaces. Distances in semantic vector spaces provide a useful analog to coherence in discourse (Foltz, 2007) in that leaps from one part of the space to another can provide an indication of how much change there is in the overall semantic content from one part of the discourse to the next. There are of course alternatives to the philosophical tradition of Wittgenstein (e.g., Sellars, 1963), and may be compelling arguments to be made that thought and language are separate systems (e.g., see Jackendoff, 1996). Nonetheless, viewing incoherence of speech through the lens of the above-mentioned philosophical framework has led to ideas and testable hypotheses on how word co-occurrence statistics can be different in speech from patients with schizophrenia, compared to healthy individuals.
Traditionally, such co-occurrence statistics have been derived from text corpora, and by employing mathematical techniques it is possible to obtain numerical (vector) representations of words, where the meanings of words are expressed as locations in high-dimensional "semantic spaces". Many mathematical techniques for the creation of word representations exist, and these options are increasing rapidly with the fast paced tempo of progress in the field of natural language processing. The first methods were presented in the late 1990s, notably with the "Hyperspace Analog to Language" (HAL; Burgess et al., 1998) and Latent Semantic Analysis (LSA; Foltz, 1996;Landauer and Dumais, 1997). More recent methods use deep neural networks and include word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), Embeddings from Language Models (ELMo; Peters et al., 2018), and Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2019). LSA performs a singular value decomposition on the word type by document matrix to obtain lower dimensional vectors of each of the types. Word2vec is a neural network-based word embedding model trained on a large corpus of text to predict either a word given its context (continuous bag of words; CBOW) or the context surrounding a given word (skip gram). ELMo and BERT are deep neural language models that are built on Long Short Term Memory (LSTM) and transformer architectures, respectively. Generally, coherence metrics are computed as the cosine distance between consecutive vector representations of words, windows of words, phrases, or sentences. From a practical perspective, the goal is to have something that is both useful (e.g., provides predictions that are informative to a clinician or researcher) and explainable (e.g., that the operationalizations use constructs tied to underlying neurocognitive functions (see Foltz et al., 2022)).
There is no 'one size fits all' approach to choosing the right operationalization of disorganization in speech. Some operationalizations are suited for short timescales, some for long, and others yet for cases where connectedness to contextual cues are needed. What these aforementioned methods have in common is that words that share related meanings are 'close' to one another in these spaces, such as "cattle" and "farm", whereas the words "cattle" and "soap" will have longer distances between them. These words come from the now classic example of incoherence in Andreasen's (1986;p.477) Thought, Language and Communication Scale where when asked what they thought about the current political issues such as the energy crisis the patient is quoted as responding: "They're destroying too many cattle and oil just to make soap. If we need soap when you can jump into a pool of water" [quote abbreviated by us]. A language model (LSA) illustrates the distances between these three words ("farm", "cattle" and "soap") in Fig. 1 Panel A. To borrow from Kraepelin's quote cited in our introduction, the "leap" between some words or concepts are quantifiably larger, here illustrated by the cascades of such arcs, with the higher values indicating less coherent speech. It can also be useful to compute the conceptual relatedness between sequences of words or sentences as illustrated in Fig. 1 Panel B. These semantic spaces therefore provide a domain where the "leaps" of words and thoughts can be variously quantified for scientific investigations, enabling us to compute distance measurements in time, physical space and semantic space. All of these three domains must be properly operationalized and understood for a scientific analysis framework of language disorganization to have true clinical translation value.
In this paper, three critical problems -and potential solutions -are discussed: First, the variability of operationalizations of the concept of coherence and of word meanings can create a confusing landscape of methods. Put differently, it is not easy to know how different measurements of distance in the semantic domain translate to the concept of "incoherence". Indeed, the field risks a "black-box" situation, where methods seem to work as intended (e.g., for classification of research participants as patients or non-patients), but that may be for unintended or frankly wrong reasons. Luckily, the extension of our measurements into "conceptual space" lends itself to precise visualizations. The "leaps" are measured as distances, and here we present illustrations of how these distances are captured in current computational methods (see Fig. 1). Second, current computational tools are based on methods from natural language processing of text, and lack crucial temporal information about when words are spoken. Unlike language in text format, spoken language varies in its temporal delivery. This temporal pattern contains critical information about how words are connected. Hence, language models based on text may miss this critical information. Third, if information about the situation in which speech behavior happens is missing, crucial contextual information will be lost. By accounting for the context where speech is happening in physical space it is possible to create more nuanced models of whether what is said is coherent or not, similar to how clinicians intuitively account for contextual factors. We initiate this account by discussing the most basic notion of contextual information, namely spatial location, but ultimately contextual clues such as speaker demographics, speaker motivation or purpose, and previous conversational topics can and should be utilized. Addressing these three problems regarding measurements of incoherence and disorganization in speech has the objective of carving a path towards more robust and universal methods of operationalizing incoherence. We will conclude that (1) interpretability of methods can be improved by explicit visualizations, (2) timestamping uttered words enables new and more nuanced information regarding the temporal aspect of coherence and (3) contextually anchored language models that incorporate situational information will allow more fine-grained information about whether or not speech is coherent within the limits of the local context (e.g., speech at a family gathering versus in an academic lecture hall).

Problem 1: incoherence in speech is conceptualized and computationally quantified in many different ways
Since the word "incoherence" can mean so many different things, this creates a problem for the specificity of our analyses. For example, to a clinician interviewing a patient with schizophreniainformed by Andreasen's definition in her Thought, Language and Communication Scale (Andreasen, 1986; p.477 -'A pattern of speech which is essentially incomprehensible at times') -incoherence refers to the comprehensibility of what is spoken. To a computational linguist examining incoherence at a discourse level, the term lexical cohesion (i.e., the opposite phenomenon) is often used, traditionally taken to mean that there is a sharing of semantically related or identical words in neighboring sentences (Halliday and Hasan, 1976) as well as syntactic markers indicating causal connections. Coherence, thus defined, is then examined with lexical and syntactic constraints, logical relations between concepts and events, and overall agreement with "world knowledge". To a neuroscientist, coherence in the brain can mean several different things depending upon whether their focus is cellular, circuitry or systems. Further, why coherence emerges and what causes it is a function of a variety of brain systems (e.g., Dapretto et al., 2005). Coherence measured by discourse using word embedding spaces -as is the focus of this paper -can also mean a variety of things.
Previously we suggested four approaches to compute semantic coherence using word embedding methods, namely using the semantic distance between one word and another; using the distances between larger units of language within a discourse; estimating how a person's answer relates to a question asked; and estimating how answers relate to another person's answer on the same question (Elvevåg et al., 2007). These approaches remain relevant today and can -with some generalizations -serve as the overarching categories within which the plethora of possible methods could fall. The first approach, word-to-word similarity ( Fig. 1, panel A), has been used in several different ways to quantify connectedness between adjacent word responses in verbal semantic fluency tests (e.g., Holmlund et al., 2019;Kim et al., 2019;Pauselli et al., 2018). A notable variant of single word-to-word coherence measurements was used by Corcoran et al. (2018), where the semantic distances between words with inter-word distances of 5 to 8 were used to predict psychosis onset in clinical high-risk youths. The second general approach involves examining distances within and between larger units of speech, such as sentence-to-sentence similarity, phraseto-phrase similarity or variants of using "windows" of text of various lengths (e.g., 6 words) with single- (Fig. 1, panel C) or dual-window variants (Fig. 1, panel D). Measuring coherence in longer units of connected discourse such as story recalls, free speech, and answering process questions has also uncovered differences between patients with schizophrenia and healthy volunteers (e.g., Bedi et al., 2015;Tang et al., 2021). The third general approach (combining the last two approaches mentioned earlier) involves comparing the semantic content of speech to some outside contextual information, such as the content of a preceding question (in discourse) or common speech in the same situation (e.g., other answers to the same question). This approach measures the topical coherence as to whether a response is related to a posed topic, as well as how much the discourse may deviate tangentially from the topic.
In addition to these main approaches, new variations of methods have been developed. The type of word vector spaces used have been updated over time, with the use of word2vec, GloVe (Iter et al., 2018), ELMo (Sarzynska-Wawer et al., 2021) and BERT (Tang et al., 2021). Also, the methods for computing semantic distance have seen innovation with explainability investigations into optimal moving-window sizes (Voppel et al., 2021), and the use of vector centroids (Xu et al., 2021) (for further studies see e.g., Iter et al., 2018;Just et al., 2019Just et al., , 2020. In essence, while varied, all approaches assess aspects of coherence. However, to operationalize a measure, it is necessary to understand the link between the output of the computational method with the neuropsychological phenomena being investigated (e.g., Foltz et al., 2022). A critical tool for clarifying the definition of incoherence being assessed is to make current verbal definitions visual. By improved visual representations of the methodology, it is possible to understand how coherence is computed, and therefore provide the user with a guide to decide if the specific way of operationalization is what they intended (i.e., if one is really measuring what one conceptualizes as incoherence). There are other studies that have illustrations to explain the methods beyond using just words and numbers (e.g., Hoffman et al., 2018). However, they often present the abstract principles behind the methods without providing illustrations of practical examples with real analyzed data (but notable exceptions include single-word analyses in verbal fluency studies, see e.g., Kim et al., 2019). The field will certainly benefit from coherence visualization software that can reliably and effectively demonstrate the resulting metrics "in situ" on transcripts or recordings. Such software efforts can generate a coherence plot for each and every datapoint in a study, ensuring complete transparency of the methods. Such transparency will nurture trust from clinicians provided the metrics are valid representations of incoherence. In a similar manner to the way a radiograph of a traumatized bone can reveal a fracture, a true "coherograph" can demonstrate where coherence breaks down in an utterance or a discourse. Whether or not it is possible to reach such a level of specificity and sensitivity remains to be seen. The methods have proven effective at enabling detection of group differences (i.e., patients versus nonpatients) in speech coherence, but it is possible that pinpointing incoherent parts of speech will be much more challenging.
In summary, proper visualizations can enable researchers and clinicians to understand where the source of (in)coherence is in a given segment of language. This understanding is critical in the choice of which computational approach to harness for a given language analysis as the varied operationalizations measure different facets of potential incoherence. In the future, visualizations can assist researchers and clinicians in understanding the methods they employ for specific segments of speech and aligning various implementations with the construct they are seeking to analyze. For clinicians, a dashboard-like output that pinpoints sections of incoherent speech, with metrics on how the patient's output relates to clinical reference materials can then aid in diagnosis and be a useful component of monitoring clinical states (see Fig. 28.2, p. 676, in Holmlund et al., 2020). The user interface of future clinical tools based on coherence metrics should therefore be created using established design principles and validated through interaction experiments (see; Rundo et al., 2020). Clearly, alignment of methods for coherence metrics would be useful, although it might limit innovation. While there is an increasing amount of consensus on the need to create standardized practices in this relatively new field, nonetheless, a simple agreement on methodological nomenclature and principles (e.g., word-level, sentencelevel, dual/single windows, window sizes) is presently missing and an imperative first step towards a useful framework.
3. Problem 2: the temporal dimension -operationalizations of language coherence have been developed on transcriptions of speech and therefore are missing crucial temporal information Wittgenstein (1953, section 108) noted broadly that "The 'use' of words is extended in time", and this extension in the temporal domain can provide important clues for how to improve computational metrics of incoherence. It has consequences for modeling speech, but importantly it means that transcripts of speech from a clinical setting are dissimilar T.B. Holmlund et al. to the reference material (i.e., training data for language models) since speech is not an identical process to writing. Indeed, spontaneous speech is typically quite different from written language in many critical respects. From a conceptual level, writing is typically the result of some advanced planning; writers are afforded the opportunity to formulate and revise their thoughts in a structured and coherent manner. Additionally, sentences and paragraphs represent easily delimited units of thought in writing. In contrast, speech occurs in "real time" where the timing of utterances reflects an unfolding thought process. Units of thoughts may flow from one to the next without obvious delimiters. While it is unclear exactly how this impacts coherence metrics, it is notable that "sentences" as defined by utterance length in spontaneous face-to-face conversations have been found to be much shorter (median of 5 words) compared to sentences in news broadcasts and political debates (median 12 and 16 respectively; Wiggers and Rothkrantz, 2007). Moreover, attempts to translate spontaneous speech into text format are generally quite difficult, with challenges from an excess of filler sounds such as "uh" and utterances abruptly terminated before full sentences are formed. For example, text relies heavily on standardized punctuations for segmentation. In free speech, on the other hand, defining sentences can be difficult, and our own experience with dependency parsers (e.g., OpenIE; Angeli et al., 2015) is that they can fail in spectacular ways on real clinical language data. Models are known to have a hard time getting nested dependencies correct, particularly when they are long (Lakretz et al., 2021). The "gold standard" parsing, namely in human transcription, is not without challenges. This is because punctuation is a tool for increasing readability and "sentences" are, as such, a product of a subjective evaluation by the writer or transcriptionist. Indeed, with an informal browsing of transcription instructions one can get the impression that punctuation is often left up to the discretion of the transcriber's sense of style, although methods for automated punctuation do exist (e.g., Tilk and Alumäe, 2016).
Evaluating the temporal relationships between words can provide critical information about how thought processes are occurring. To illustrate this point with reference to Fig. 1, consider Panel A. Despite the high arcs illustrating less coherence between "cattle" and "soap" it is not clear when these words were uttered and how close they were to each other within the conversation. Fig. 2, on the other hand, features these two words visualized as part of its original spoken (in this case spoken by the first author) and recorded utterance. As a first step, the visualizations using timestamped words make explicit the temporal spans over which the semantic distance measurements (i.e., a 6-word window coherence metric) are computed. This elucidates the timescale within which the comparison method is relevant (e.g., are the items compared 1 s apart, or 20 s?), and how words and semantic concepts are inter-related. Temporal information can allow for segmentation approaches that improve upon limitations by forcing punctuation solutions to spoken language. The timestamping of words further allows for new types of time series data, which can be analyzed with established signal processing methods (for a notable recent example of time-series analysis, see: Xu et al., 2022). An important limitation of traditional moving window techniques defined by n number of words is demonstrated in Fig. 2, namely that a window size of for example 6 words can span dramatically different distances in the temporal domain. Such issues can be detected, visualized and understood only by adding in temporal information. Future methods might consider using temporally defined windows (e.g., words within a 2 s window, a relevant time-scale to capture delta-band electrophysiological activity related to language understanding, see e.g., Lo et al., 2022), and thus ground the "use of words" within a specific temporal framework.
While most language models process language as a sequence of words or parts of words, some recent developments in modeling hold the Fig. 2. The exact time that a word is uttered can be detected using automatic speech recognition tools, and when the same sentence is time stamped it is clear that words do not come equally spaced as illustrated in Fig. 1. This means that windows of a size of six words can have completely different temporal extents in a moving window procedure (see Supplementary material - Fig. S2 -for an animated version of this process). The difference in sizes is well illustrated in the moving-window procedure, where both windows are 6 words, but the first window spans 1.5 s and another window spans 3 s. This is a challenge if procedures are to be connected to putative underlying physiological processes where neural activity is integrated over certain timespans.
promise of true integration of a more full range of temporal aspects of speech. Based on the architecture of the previously mentioned BERT model (Devlin et al., 2019), the HuBERT model directly processes audio waveform information rather than lexical information (Hsu et al., 2021). Such methods may in the future be able to capture the unique patterns and signals found in speech and massively improve the way quantitative tools can "listen" to speech in clinical settings. The new models may also aid in improving our ability to define when incoherent speech occurs, with output that alerts to temporally defined sections of recordings (e.g., in a 3 s window) that are incoherent with the previous utterances in a conversation. Interestingly, the approach taken with HuBERT has also been expanded to include both audio and video data in the same model (Shi et al., 2022), demonstrating that these powerful approaches can integrate information about various aspects of human behavior through time. As with all data-driven models, it will be crucial to have suitable training material. While the current HuBERT model is based on audiobooks (Hsu et al., 2021), it is plausible that models trained on spontaneous speech recordings will have more relevance for detecting patterns of incoherent speech in clinical settings. This, of course, is a matter to be evaluated by future experimental designs.

Problem 3: situational or contextual information is necessary to improve the sensitivity of coherence measurements in clinical settings
In conversation, a clinician has a clear sense of where the conversation is taking place (e.g., in a hospital ward versus an encounter in the street) and what the situation is (e.g., a serious admission interview versus a casual chat about the weather), and based on this information is able to form clear expectations of the language that will be produced in the current context. Recalling the philosophical standpoint that is foundational to the methods for quantification of word meaning, namely that "the meaning of a word is its use in the language" (Wittgenstein, 1953, section 43), there are important implications that contextual information has with respect to language usage and coherence. In short, people use words differently in different physical and sociocultural contexts. Indeed, by using a range of different coherence measures to examine cohorts of schizophrenia patients from three different countries, Parola et al. (2022) found generalizability to be limited across the languages, samples and measures. Obviously this has consequences for how language models should be built for clinical purposes, and for how those models should be utilized for increased sensitivity to signs of pathology in the speech of patients. If clinicians are to rely on and trust computational measurements of speech incoherence, then the ability to account for context will be crucial (see Fig. 3).
Currently, language models used for analyzing speech from psychiatric patients in the published literature are built on text from various sources, which may or may not represent the diversity of how language is used in varied situations, and this is a problem for the generalizability to clinical applications. The semantic space that is generated in a popular "off-the-shelf" pre-trained implementation of word2vec (available here: https://code.google.com/archive/p/word2vec/) is based on the "Google News dataset", which can serve as a powerful example. Words from news articles are typically used to convey information about issues of public interest, often with a broad geographic coverage, such as international politics. In this context, the word "oil" would most commonly be used in proximity to words such as "economy" and "pipeline". In contrast, in a psychiatric clinical setting, the words used Fig. 3. Semantic distance can also be measured between an utterance and some external context. In this example, words within a window (size = 5) are examined for their distance to the question posed before the incoherent speech examined in Figs. 1 and 2. Here, instead of direct comparisons between individual words, each word's distance (orange arcs and line) to a vectorized "context" (orange box and ball), namely the question, is visualized. Higher arcs mean words are less connected to the conceptual content of the question, indicating incoherence in the form of tangentiality [Note: these two concepts are not fully overlapping and are separated by Andreasen, 1986]. Choosing a small window size increases the resolution with which one can assess tangentiality: A small window can, in theory, pinpoint the sections of speech that are not connected to the question. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) are most often describing personal matters, such as symptoms or the history of an individual. In such a setting, the word "oil" would be less likely to be used, but if it was, it would be more likely to occur in a conversation about nutrition and co-occur with words like "soy" or "pasta". The result of such differences can serve to reduce the relevance of measurements of semantic distances, potentially leading to both overand underestimation of speech coherence. Further, in the construction of semantic spaces from language material from a broad cross-section of society, cultural biases unfortunately are also included in the models. These data-driven approaches can ideally be considered "neutral" representations of how language is used, but it turns out that they may include undesirable biases when it comes to certain groups (e.g., increasing barriers for people with disabilities; Hutchinson et al., 2020). Indeed, it has been demonstrated that coherence measurements are sensitive to cultural biases in the datasets, and can end up perpetuating such biases (Hitczenko et al., 2021).
Of course, models built on language with generic or wide-ranging topics do have their advantages. Importantly, they can cover a vast number of different words and topics. More context-specific models would be vulnerable to the occurrence of out-of-lexicon words, triggering results that are uninterpretable (or at least should not be interpreted). Such a problem can be counteracted by building hybrid models, where the "backbone" is built on a broad corpus, but the model for clinical application is fine-tuned to the specific place and situation in which it is to be applied. Such a fine-tuned approach holds promise of increased sensitivity in detecting incoherent speech, but it does come at the cost of decreased generalizability of methods. For example, a method developed and used at a rural clinic in one country will not be directly transferable and applicable in a more populous area even within that same language speakers within the same country.
Establishing methods that effectively and reliably incorporate contextual information for coherence measurements will need to fulfill several requirements. A first requirement would be extensive localized data collection, enabling language models tailored to the place and situation where they are to be used. A concrete example of clinical relevance could be recording speech from all clinical consultations conducted across psychiatric wards in a single city, and using the transcriptions to build novel models or improve existing models (e.g., by fine-tuning). Even if such models cannot be generalized to and used in other locations, the path towards robust clinical applications will Fig. 4. Merging information about temporal, spatial and semantic information is possible and opens up new opportunities for both research tools and clinical applications. This figure represents sound recordings (black audio waveforms) and illustrates how speech unfolds over a temporal axis, with temporal boundaries of individual words (derived from automatic speech recognition) marked as red boxes. On the vertical axis, information about the degree of semantic incoherence is exemplified with data from within-channel dual-window distances (black lines). Notable values in this domain can inform clinical decisions if properly operationalized. On the left-facing horizontal axis temporal information about word vectors can be expressed, and if such information is combined with physiological data (e.g., electrophysiological-or magnetic resonance imaging data), it may increase our understanding of what unfolds in the brain at the time incoherent utterances are made. Real-time processing of speech can also allow for biofeedback approaches that alert for pending breakdown of communication. On the right-facing horizontal axis the different speaker channels are placed on a spatial axis, indicating that the location (and ultimately a situation) of an utterance can be quantitatively determined in future systems. Combined, the temporospatial context can inform measurements of semantic coherence by making sure that the evaluation is relevant to the actual situation, not based on language from other contexts (e.g., what is common language in written news reports). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) depend on a uniform and consensus-based manner with which to build the models using appropriate localized data. Establishing consensus around this will take substantial effort, but there are already consortium approaches to standardize speech data collection for psychosis research (e.g., Discourse in psychosis' consortium -https://discourseinpsychosis. org/; see also : Palaniyappan et al., 2022). A second requirement would be a carefully constructed procedure for how to define and quantify the most important characteristics of the temporospatial context within the situation. That is to say, it will be critical to ascertain which aspects of a situation (e.g., location, time of day, weather, current events, and so on) are necessary to consider as potential influences on a patient's language output. Including those necessary aspects into subsequent quantitative analyses will require new types of multimodal models. Important steps have recently been taken towards that end, for example with Google's Pathway Language Model (PaLM; Chowdhery et al., 2022), that has been trained on an all-encompassing dataset of language production in digital form across languages and contexts (e.g., from conversations, books or computer code). Even more inclusive in terms of domains included in the training material, DeepMind's recent large transformer model, Gato, includes data from both language and images, and even certain quantifications of actions of robotic arms or items in computer games, all represented and modeled in a combined fashion across domains (Reed et al., 2022). Such a multimodal approach has obvious appeal for modeling human behavior and clinical assessments, where visual appearance, movement patterns and sounds are all important (Holler and Levinson, 2019). In short, to be able to capture how speech is incoherent in a clinical context a "world model" is needed, not just a language model. For these new powerful models to have applications in psychiatry, the crux will be to find ways for the resulting word (or audio, visual) embeddings to quantify and express the conceptual leaps or abnormal behavioral signs that best capture a clinical disorder.

Concluding remarks
The problems with defining incoherence presented in this paper showcase how future study designs and clinical tools will necessarily need exact definitions of coherence that optimally capture the interests of an investigator, whether they are a clinician, a computational linguist, a neuroscientist or a computer scientist. Importantly, it needs to become clear why a coherence measurement is conducted and what neuropsychological constructs or physiological processes the coherence measurements are representing (Foltz et al., 2022). Beyond study design, it will also be necessary and pragmatic to build language models that are based on temporally and contextually appropriate methods. Using models, for example, based on text from news feeds to define coherence in conversational speech during a psychiatric interview will be problematic. Transparency of how methods are producing results will be key in this regard, and information should be explicit for all users and stakeholders. Clinical relevance will be improved when methods can localize where and when coherence breaks down in speech, and this will increase understanding of disease in brain processes and the nature of language production. Ultimately these coherence metrics can be a part of larger systems for monitoring mental states in patients with schizophrenia. So, by unifying temporal, spatial and semantic information into the same framework (Fig. 4), exploring and defining the "distances" that are most relevant to describe pathological conditions in humans, progress is indeed possible. Indeed, this is in line with another, less famous quote from the philosophical foundation of computational semantic analysis: "We are talking about the spatial and temporal phenomenon of language, not about some non-spatial, non-temporal phantasm." (Wittgenstein, 1953, section 108).
Looking ahead, a possible futuristic scenario might be to leverage these superior temporal and contextually relevant measures of semantic incoherence for clinical intervention purposes. For example, it is conceivable that fast computation and real-time estimations of coherence could be a core component in biofeedback alerting speakers to increased disorganization in conversations, akin to more established audio-visual biofeedback solutions used in speech development and misarticulation (e.g., Byun and Hitchcock, 2012). Techniques for interventions aimed at short-timescale events such as phoneme utterances may not be directly transferable to longer and more complex events such as the entire discourse itself, but this technological approach could help pinpoint when and where communication is likely to break down. This may well prove to be useful for both patients and clinicians if developed carefully. Quite possibly there may be other permutations of this that are useful. However, whether or not it will prove useful, or even harmful, is a matter for rigorous examination in controlled clinical trials.

Role of the funding source
The funding source had no role in this publication.

Declaration of competing interest
None of the authors report conflicts of interest.