Text mining to improve screening for trauma-related symptoms in a global sample

.


Introduction
After the experiencing a stressful event such as a serious injury, sexual violence, or a life-threatening situation including an intensive care admission due to the Coronavirus disease-2019 , some people develop persistent mental health problems.Common problems include posttraumatic stress disorder (PTSD; de Vries and Olff, 2009), depression (Breslau et al., 2000), and sleeping problems (Milanak et al., 2019).Trauma-related disorders tend to persist for years when untreated (Kessler et al., 2017) and are related to many adverse outcomes such as unemployment, reduced quality of life, and suicidal behavior (Kessler et al., 2010, Pagotto et al., 2015).Trauma and trauma-related disorders also affect long-term physical health including cardiovascular disease and obesity (Kuhar and Kocjan, 2021).When accurately identified, many trauma-related disorders are effectively treatable with psychotherapy within a short timespan (Cuijpers et al., 2013, Mavranezouli et al., 2020, Trauer et al., 2015).Numerous studies have shown that the recent COVID-19 pandemic is related to increased prevalence rates of trauma-related disorders in healthcare workers (Marvaldi et al., 2021) and the general population (Arora et al., 2020).Since the pandemic (in)directly affects a large proportion of the population, many are at risk for developing trauma-related disorders.Hence, efficiently and accurately identifying those who (are likely to) suffer from trauma-related disorders is crucial to provide treatment and prevent other adverse outcomes.Considering trauma as a public health issue (Denckla et al., 2020, Magruder et al., 2017) emphasizes the need for accurate and easy screening.
Short screening instruments have been developed, such as the Global Psychotrauma Screen (Olff et al., 2020), which screens for many trauma-related disorders and provides a total trauma-related symptom score.Although the GPS is more efficient than a clinician-administered interview, developing even shorter screening instruments that do not require symptom inventarisation might increase the screening efficiency and lower the burden for respondents, especially when confronted with situations such as COVID-19.One possibility to improve the screening process for trauma-related disorders that showed promise in previous studies is the use of text mining techniques (He et al., 2012(He et al., , 2019)).The idea of these techniques is to use textual information respondents provide to predict their trauma-related symptoms and disorders.Textual information provided by respondents can be processed to extract language features, which may consist of the proportion of words referring to specific topics naturally emerging from the data, count of specific words (or combinations of consecutive words, i.e., n-grams), as well as features reflecting affective, social, cognitive, and perceptual processes as obtained using existing lexicons (e.g., the LIWC lexicon; Pennebaker et al., 2007).By design, many questionnaires assessing trauma-related disorders already include an open-ended question that asks participants to describe the stressful event currently affecting them the most.
Usually, this question is only used to establish whether the stressful event meets the 'A-criterion' for PTSD (APA, 2013).However, the language respondents use to describe the stressful experience may provide essential information predictive of their symptoms.For instance, many trauma-related symptoms directly refer back to this event (e.g., avoidance of triggers related to the event or emotional reactions related to reminders of the event).These symptoms might therefore be reflected in the response of participants when asked to describe this event (APA, 2013).
Previous studies have shown that textual information from selfstatements can be used to screen for trauma-related disorders such as PTSD (He et al., 2012(He et al., , 2019)).When combined with a few other relevant predictors of trauma-related symptoms such as item-based measures, this method might lead to efficient screening methods (He et al., 2019).In the current study, we use this method in a novel way by applying it to the text respondents used to describe the stressful event they experienced.

Current study
In the present study, we sought to determine the relevance of a set of language features extracted from textual descriptions of stressful events to predict trauma-related symptoms in a large global population.We also compared the relevance of language features for the purpose of prediction with that of self-reported demographics, event characteristics, and risk factors for trauma-related symptoms.Textual and selfreported information were assessed by administering an online survey including a validated instrument screening for a range of trauma-related psychological problems, namely the Global Psychotrauma Screen (GPS symptom score; Olff et al., 2020).We predicted trauma-related symptoms using a machine learning approach, consisting of the cross-validation of predictive models performed using three different algorithms: Elastic-Net, Random Forest, and a stacking ensemble of these algorithms (i.e., we use a "super learner" approach; van der Laan et al., 2007).We chose a machine learning approach instead of a standard parametric method because machine learning methods can help to identify hidden interactions and non-linearity among features in predicting the outcome, and because they can help reduce the risk of overestimation prediction performance (i.e., overfitting; for a review of application of machine-learning in studying PTSD and other trauma-related disorders, see Ramos-Lima et al., 2020) Using these methods, we examined the predictive power of the following set of features, either alone and or in combination with each other: (1) language features extracted from the open-ended section of the screener, (2) demographics, (3) event characteristics and (4) risk factors for trauma-related symptoms.Finally, we evaluated the contribution of these features for the purpose of creating a screening procedure for detecting probable PTSD diagnosis, assessed by splitting the sample according to an existing cut-off for PTSD.We also checked the role of the length of the description of the events provided by participants in improving or worsening the prediction accuracy.

Sample and procedure
We used data from the Global Psychotrauma Screen -Cross Cultural responses to COVID-19 study (GPS-CCC; www.global-psychotrauma.ne t/covid-19-projects).The main aim of the GPS-CCC study was to compare COVID-19-related stressful events with other types of stressful events around the globe (see main publication by Olff et al., 2021).For this study, 6114 participants from 88 countries were recruited during the COVID-19 pandemic between the 21st of March and the 3rd of September in 2020 through online advertisements from members of the Global Collaboration on Traumatic Stress (www.global-psychotrauma. net).Participants were included in the present study if they were years old or older (n excluded = 16) and if they provided an interpretable written text (n excluded = 1050), resulting in a final sample of participants (Mean age = 37.82 ± 13.3; Gender: 3766 female, male, 30 other).Currently, there are no ethical guidelines on where global web-based research should be reviewed and whether ethical approval is needed from all countries represented in the study (Looijmans et al., 2022).For studies involving interventions or physical measurements, it is recommended to apply for ethical review in all participating countries, while this is less obvious for online questionnaire studies (Looijmans et al., 2022).Since the current study only involved a brief anonymous online survey and did not include any intervention or physical measurements, we submitted our proposal to The Medical Ethical Review Committee of the Academic Medical Center Amsterdam.The Medical Ethical Review Committee of the Academic Medical Center Amsterdam exempted this study from formal review (W19_481 # 19.556).
The survey was provided online in 21 languages after consent of participants.The country IP address of participants was used to show the survey in the correct language automatically, but this could also be changed manually by the participants.The survey started with demographic questions about the gender, age, and country of residence of participants.Then, an open text field was presented where respondents were asked to describe the stressful event that currently affected them the most.This was followed by several questions about event characteristics (COVID-19-relatedness, work-relatedness, time since the event took place, whether the event was a repeated/prolonged event and whether it involved physical violence, sexual violence, emotional abuse, serious injury, life-threatening situations, sudden death of a loved one, and/or the respondents causing harm to others).Please note that for the purpose of this paper, only information about work-relatedness (Coded as: 0 = missing; 1 = not work-related event; 2 = work-related event), time since the event took place (Coded as: 0 = missing; 1 = within past month, 2= one month to half a year ago, 3 = half a year to a year ago, = longer than one year ago), and whether the event was a repeated/ prolonged event (Coded as: 0 =missing; 1 = single event; 2 = multiple events /prolonged event) were considered in the present study.Finally, participants filled out the Global Psychotrauma Screen (GPS) questionnaire about the stressful event they previously indicated that currently affects them the most.

Global Psychotrauma Screen (GPS)
The GPS consists of 22 items about trauma-related symptoms and risk factors with a binary (yes versus no) answer format (Olff et al., 2019).Trauma-related symptoms in the past month are assessed with D. Marengo et al.GPS items and can be summed in a GPS symptom score (range 0-17) with higher scores indicating more severe symptoms.These items include symptoms of PTSD (5 items), disturbances in self-organization (2 items), generalized anxiety disorder (2 items), depressive disorder (2 items), insomnia (1 item), self-harm (1 item), dissociation (2 items), substance abuse (1 item) and other stress-related problems (1 item).A GPS symptom score cut-off of 8 provided optimal sensitivity and specificity for a probable PTSD in previous validation studies (Frewen et al., 2021).The risk factors of the GPS include the occurrence of other stressful events in the past month, lack of social support in the past month, childhood trauma, psychiatric history, and lack of psychological resilience.Previous validation studies indicated that the GPS is a valid and reliable instrument for trauma-related symptoms (Frewen et al., 2021, Oe et al., 2020, Rossi et al., 2021).In the present sample, Cronbach's alpha indicated adequate internal consistency of the GPS symptom score (α = .88).Descriptive statistics for the GPS symptom score and risk factor variables are provided in Table S1 in the supplementary material.

Statistical analyses 2.3.1. Language data quality
As noted above, participants taking the survey were asked to provide a written description of the stressful event that currently affected them the most.Participants' written responses were then analyzed to extract information that could predict trauma-related symptoms.For example, we might expect the use of the term abuse to show a positive association with participants' current trauma-related symptoms, regardless of the language used, and thus be useful for prediction purposes.However, the automated methods we used to extract language features can only provide meaningful results when texts are presented in one common language (e.g., the word abuso, which is the Italian equivalent of the English word abuse, if not translated would be detected and counted as a different word).Because the majority of collected texts were written in English language, we chose the English as a common language framework, thus requiring non-English texts to be translated into English.For all non-English texts, automatic translations were performed using the googletranslate function available in the payed-for Google Workspace professional suite linked to an institutional account belonging to one of the authors.Please note that the Google Workspace platform is compliant with EU's General Data Protection Regulation (GDPR), meaning that data confidentiality is enforced on data processed through its translation services.Google makes it clear that "they will not make the content of the text that you send available to the public, or share it with anyone else, except as necessary to provide the Cloud Translation API service"; furthermore, they will not use it for model training purposes (https://workspace.google.com/learn-more/security/security-whitepaper/page-6.html#limited-data).Then, authors with the original language as primary language (n = 19) checked and rated all translations, corrected the translations when needed, and rated the corrected translations.Translations were rated based on the global evaluation parameters scale (Toledo Báez, 2010).This scale includes a score from 1 (transmission is mostly incoherent) to 5 (transmission is equal to one of an experienced professional translator).The automatic translations by Google Translate were, on average, scored 2.34 (SD = .75),which indicated that manual supervision was needed.After corrections, the translations were scored 4.37 (SD = .83),which corresponds to satisfactory translations.After these corrections, the text fields included on average 9.21 words (SD = 17.21;Range = 1-320; Median = 4).

Feature extraction
In order to extract information that could predict trauma-related symptoms from the text participants provided in the open text field, we extracted two types of features: closed-and open-vocabulary features.Closed-vocabulary analysis was performed on the participants' textual description of stressful events with the English version of the Linguistic Inquiry Word Count Dictionary (LIWC).For the purpose of this study, we employed the 2007 LIWC dictionary (Pennebaker et al., 2007), consisting of 64 theory-based dictionary categories allowing for the scoring of documents based on the number of words reflecting affective, social, cognitive, and perceptual processes, as well as use of function words such as pronouns, articles, conjunctions, verbs, and adverbs (see for complete list: Pennebaker et al., 2007).For each category, a score is computed as the ratio between the number of words belonging to the specific category over the total number of words in the document.
Open-vocabulary analyses were performed on the textual descriptions of the events and included both topic model analyses and the extraction of n-gram features.First, we implemented a topic model analysis using Latent Dirichlet Allocation (LDA) model.Before running the analyses, we converted all text to lowercase and removed English 'stopwords' (i.e., very frequent words with low specificity), punctuation, and numbers.No stemming was applied to the original text (i.e., words were left unchanged as opposed to being reduced to their word stem, or root form).In order to identify the optimal number of topics, we trained a set of competing LDA models with the following k numbers of topics: 5, 10, 15, 20, 25, 30, 35, 40, 45, and 50.Model training was performed on a random split including 90% of the sample, while 10% were used for model validation.The performance of the competing LDA models was compared by computing the perplexity statistic on the validation set (Wallach et al., 2009); the optimal number of topics was selected using the heuristic approach proposed by Zhao et al. (2015), which is based on examination of the rate of perplexity change (RPC) across LDA models.The coherence of LDA-derived topic-words association was also examined visually using word clouds.Eventually, based on the RPC heuristic procedure and semantic coherence (as assessed using word clouds), we selected k = 30 as the optimal number of topics for this dataset.Word clouds for the extracted topics are shown in the supplementary material (Figs.S1-S30).As the last step, the selected model was applied to all available documents to generate the topic proportion scores.LDA analyses were performed using the Mallet software, version 2.08 (McCallum, 2002).
Extraction of n-grams was performed using the StringToWordVector function available in the WEKA software (Hall et al., 2009).Before running the analyses, we converted all text to lowercase and removed numbers, while punctuations were treated as word delimiters.We extracted n-grams ranging from 1 to 3 words that were present in at least 1% of event descriptions (about 51 descriptions).For each event description, counts were obtained for 186 n-grams.For all features extracted from the text description of the stressful event, descriptive statistics are reported in Table S1 in the supplementary material.

Associations of self-report and language features with the GPS symptom score
Pearson's correlation coefficient was used to examine associations between trauma-related symptoms (indicated by the GPS symptom score) and features extracted from the textual description and the selfreport section of the event section of the GPS screener.p-values were adjusted using a Bonferroni correction (based on total number of examined features) to correct for multiple testing.Please note that categorical predictors (i.e., indicators of gender and event characteristics) were recoded in the form of dummy variables before performing quantitative analyses.

Using Machine learning to predict the GPS symptom score
The relationship between language features and the GPS symptom score was examined separately for each specific subset of language features, namely LIWC, n-grams, topics, and open vocabulary (topics + n-grams), and in a model combining all language features.In order to determine the relative contribution of language features, demographics, event-related characteristics, and risk factors in predicting the GPS symptom score, we tested the models including demographics, eventrelated characteristics, and risk factors either alone or in combination D. Marengo et al. with all the language features.First, we examined n = 5 models examining the predictive power of the following language features: (1) LIWC; (2) n-grams; (3) topics; (4) open vocabulary (topics + n-grams), and (5) all language features.Next, for each set of self-report variables (i.e., demographics, event-related characteristics, risk factors, and all selfreport information combined), we examined the following models: the model including the specific set of self-report data as features, and the models combining the specific set of self-report data with the five models based on (sub)sets of language features.
The analyses were performed using iterative splitting of the dataset into training and test sets, including respectively 90% and 10% of observations.The training sets were used to train the predictive models, while the test sets were used to evaluate the accuracy of predictions on unseen observations.In order to improve generalizability of results to unseen observations, the splitting procedure and all analyses were repeated 10 times on different train/test splits, and then results were averaged across the 10 splits.Each of the 10 splits was obtained using the partition() function of the splitTools library in R (Mayer, 2020), which allowed us to generate samples for the train and test sets that were stratified according to the distribution of the GPS symptom score in the overall sample.
Predictive analyses were implemented in the training set using Elastic-Net regression (Zou and Hastie, 2005), Random Forest (Breiman, 2001) regression, and a combination of these models using a stacking ensemble (for a review, see Džeroski and Ženko 2004).The Elastic-Net algorithm is a generalization of linear regression that allows for both parameter regularization and variable selection, thus reducing the risk of overfitting.The Elastic-Net algorithm can be seen as a combination of L1 regularization (i.e., LASSO regression) and L2 regularization (i.e., ridge regression), in which the alpha parameter can be used to weight the contribution of the L1 and L2 regularization, while the lambda parameter controls the amount of parameter shrinkage.For the purpose of this study the alpha parameter was set at 0.001 (i.e., the lowest alpha value allowed in the employed implementation of the Elastic-Net algorithm), while an automated, 10-fold validated search was implemented for the lambda parameter based on an automatically generated sequence of 100 values.By fixing the alpha value to .001, the Elastic-Net model was forced to behave similar to ridge regression.In this way, the algorithm performed both parameter shrinkage and feature selection, but the latter was kept to a minimum.Our choice is related to the need to avoid the major pitfall linked with using large alpha values, which is the risk of underfitting due to the discarding of most features.Additionally, the uncontrolled removal of many features across models would have been problematic because we were interested in comparing the relative importance of different sets of features, or combination of them, for the prediction of the GPS symptom score.
The Random Forest algorithm combines multiple decision trees for the prediction of both continuous and categorical outcomes, and uses averaging to improve the predictive accuracy and control over-fitting.Each decision tree is created using randomly generated samples of the training data.Being based on decision trees, the Random Forest algorithm goes beyond linear regression algorithms by allowing for the automated modelling of potential interactions and non-linearity among features in predicting the outcome.For the purpose of the present study, the Random Forest algorithm was used to perform regression of the GPS continuous score based on 100 trees, with unlimited depth of trees, and based on a random selection of features.
Finally, the stacking ensemble was implemented by combining the Elastic-Net and Random Forest algorithms using a linear regression algorithm with no parameter regularization as a meta-learning algorithm.Using this approach, the meta-learning algorithm combines the predictions from the two machine learning algorithms (i.e., Elastic-Net and Random Forest), possibly improving over the performance of the single algorithm in the ensemble.
The performance of the trained models were evaluated on the test sets.Accuracy of predictions in the test sets was determined with the correlation between observed and predicted GPS symptoms scores and the mean absolute error (MAE) in the predictions, computed as the mean of the absolute differences between observed and predicted scores.The average correlation and MAE values as computed across the test sets are reported as an indication of overall accuracy of the algorithms in predicting the GPS symptom score.Prediction analyses were performed using the Elastic-Net, Random Forest and Stacking algorithm implementation available using the Rweka package (Hornik et al., 2021) for R.

Testing the feasibility of a procedure for the classification of clinically relevant symptoms
As a final analytical step, we evaluated the contribution of the extracted features for the purpose of the creation of a screening procedure for detecting clinically relevant trauma-related symptoms with a GPS symptom score of 8 or higher, indicating probable PTSD (Frewen et al., 2021).More specifically, we examined whether the predictive models based on language features alone or in combination with self-report information (i.e., demographics, event-related characteristics, risk factors, and all self-report information combined) could accurately identify those with probable PTSD.Hence, we evaluated the following models: (1) the model including all language features; (2) the model based on self-report demographics and all language features; (3) the model based on event-related characteristics and all language features; (4) the model based on risk factor variables and all language features; (5) the model including all self-report measures and language features.To achieve this aim, in each generated sample fold and for each examined model, we computed the area under the receiver operating characteristic (AUROC) as a measure of association between the predicted GPS (continuous) scores and the binary classification of individuals in PTSD risk groups based on the recording of the observed GPS (continuous) scores using the aforementioned cut-off.To visualize this association, we also generated the ROC curve plots for the folds reporting the best and worst overall classification performance.
As an additional aim, we evaluated the performance of these models by also comparing their accuracy among groups of participants reporting different numbers of words when describing the event.For this aim, we used a median split (Median = 4 words) to distinguish the groups of participants reporting short descriptions (number of words ≤ 4) and participants reporting longer descriptions (number of words ≥ 5).
In accordance with Youngstrom (2014), we considered AUROC values ≥ .90 as showing excellent accuracy, values ranging between .80 and .89as showing good accuracy, values ranging between .70 and .79 as fair accuracy, and values ranging between .60 and .69 as poor accuracy.

Bivariate associations between self-report and language features with the GPS symptom score
Language features significantly related to GPS symptom scores with a minimal correlation coefficient of r =|.10| are shown in Fig. 1 using a word-cloud visualization, while correlation between the GPS symptom scores and all the study measures are reported in Table S2 in the supplementary material.For ease of interpretation, in Fig. 1 features reporting a positive correlation are rendered in red, while those showing negative correlations are rendered in blue.The language feature most strongly positively associated with GPS Symptom scores were topic 23 (r = .17,p < .01;top words: abuse, violence, assault, sexual, emotional, childhood), followed by the Negative emotions LIWC category (r = .17,p < .01), the Affect LIWC category (r = .16,p < .01), the abuse n-gram (r = .16,p < .01).Other features showing positive correlation with the GPS symptom score were the Anger LIWC category (r = .13,p < .01), the Sexual LIWC category (r = .10,p < .01),and the sexual, and, and violence n-grams (all showing r = .10,p < .01).In turn, the language features most strongly negatively associated with GPS symptom scores were the did not have n-gram (r = -.21,p < .01), the not have n-gram (r = -.20,p < .01), the Exclusive (r = -.18,p < .01)and Auxiliary verb (r = -.19,p < .01)LIWC categories, the Past LIWC category (r = -.18,p < .01),and the did not n-gram (r = -.17,p < .01).Other features showing negative associations with GPS Symptom score were the Negate LIWC category (r = -.17,p < .01), the Verb LIWC category (r = -.16,p < .01), the did 2 n-grams (r = -.15,p < .01), the Function words LIWC category (r = -.14, p < .01), the Present LIWC category (r = -.12,p < .01).Only one topic (i.e., topic 5) showed a significant negative correlation with the GPS Symptom score (r = -.10,p < .01;top words: accident, car, crash, injury, road, people).
Finally, regarding self-report features, we found that female gender was also associated with higher GPS symptom scores (r = .17,p < .01),while being male was associated with lower GPS symptom scores (r = -.19,p < .01); in turn, indicating a non-binary gender (i.e., other option) showed no association with the GPS symptom score.Older age was related to lower GPS symptom scores (r = -.11,p < .01).Experiencing multiple stressful events was associated with higher GPS symptom scores (r = .27,p < .01),while experiencing a single event was associated with lower GPS symptom scores (r = -.18,p < .01); in turn, failing to report information about how long ago the event happened was negatively related to the GPS symptom score (r = -.23,p < .01).All risk factors were related to higher GPS Symptom scores with small-medium effect sizes (GPS 17 -Stressful events: r = .43,p < .01;GPS 19 -Lack of support: r = .40,p < .01;GPS 20 -Child trauma: r = .27,p < .01;GPS 21 -Psychiatric history: r = .33,p < .01;GPS 22 -Resilience: r =.10, p < .01).

Machine learning for Prediction of GPS symptom score (number of symptoms)
Next, we assessed the relevance of the language features and selfreport information in predicting trauma-related symptoms (as indicated by the GPS symptom score).We first present results of the language features only, followed by the self-report information only and finally the models combining all information.
Results of the tested models are shown in Figs. 2 and 3.For all subsets of features, the stacking ensemble algorithm combining both Elastic-Net and Random Forest algorithms resulted in more accurate predictions compared to models employing these algorithms separately.For this reason, here we only comment the results of models using the stacking algorithm.Regarding models based on language features, the predictive performance of language was at best when all language features were combined together (average r = .37;average MAE = 3.62).Next, sorted in descending order of overall predictive power, were the model based on the open-vocabulary features (average r = .35;average MAE = 3.65), the model based including on the n-gram features (average r = .35;average MAE = 3.66), and the model including only the LIWC features (average r = .34,average MAE = 3.67).Finally, the model including only the topic features showed the worst predictive power over the GPS symptom score (average r = .27;average MAE = 3.77).
Among models including only self-report data, the model including all self-report variables provided the stronger predictive power (average r = .61;average MAE = 2.99), followed by the model including only the risk factors (average r = .57;average MAE = 3.14), the models including only event-related characteristics (average r = .36;average MAE = 3.63), and demographics (average r = .22;average MAE = 3.84) showed the worst predictive power.
Finally, among the models combining language and self-report features, the models including all self-report and language variables showed the best performance (average r ranging from .62 to .63;average MAE ranging from 2.97 to 2.94), with the model combining all selfreport data and n-grams showing the strongest predictive power (average r = .63;average MAE = 2.94), All the models combining riskfactors and language variables showed a strong performance (average r ranging from .58 to .61;average MAE ranging from 3.11 to 3.02), with the model combining risk factors and all language variables showing the highest accuracy (average r = .61;average MAE = 3.02).Combining event-related characteristics and language features provided moderate results (average r ranging from .38 to .42;average MAE ranging from 3.61 to 3.54), with the model combining event-related characteristics and all language variables showing the strongest performance (average r = .42;average MAE = 3.54).Finally, models combining demographics and language data showed moderate predictive power (average r ranging from .31 to .39;average MAE from 3.72 to 3.58), with the model including demographics and all language features showing the best performance (average r = .39;average MAE = 3.58).

Testing the feasibility of a procedure for the classification of clinically relevant symptoms
We also assessed whether language features, self-report information and both combined led to accurate predictions of probable PTSD.Table 1 reports the mean, minimum, and maximum AUROC values computed in the randomly generated sample folds representing the association between predicted GPS symptom scores and the binary  classification of individuals in the probable PTSD versus not probable PTSD group.Overall, when examined in the overall sample, none of the models achieved good accuracy in predicting the classification (AUROC ≥ .80)while the models based on a combination of language and all selfreport features, and a combination of language and the risk factors features achieved fair classification accuracy (average AUROC between .70 and .79).The rest of the models achieved only poor classification accuracy (AUROC between .60 and .69).See Fig. S31 in the supplementary material for visualization of the ROC curves as generated in each fold based on different sets of predictors.
It is worth noting that when split according to the median number of words used by participants in describing the stressful events, we see that performance is slightly improved among participants writing shorter, more concise descriptions (see Table 1), compared to participants providing longer descriptions.Among participants writing descriptions shorter than 5 words, the model combining language features with either risk factors, or all self-report variables, achieved good accuracy in predicting probable PTSD diagnosis (AUROC ≥ .80).

Discussion
In the current study, we explored the feasibility of using short text descriptions provided by participants to describe the stressful event that currently affects them the most for the prediction of trauma-related symptoms and probable PTSD diagnosis.Moreover, we evaluated the feasibility of using a combination of the short text descriptions with a few demographic characteristics and risk factors for predicting traumarelated symptoms.We found that the combination of language features and self-report information resulted in good predictions of traumarelated symptoms and allowed for detection of probable PTSD with a moderate accuracy.This indicates that short text descriptions are useful for predicting trauma-related symptoms and probable PTSD diagnosis, but only when used alongside some self-report information.
We found that the stacking ensemble algorithm, combining both Random Forest and Elastic-Net regression, resulted in optimal predictive power, improving over the separate application of random forests and Elastic-Net.This is in line with literature indicating that the use of a "super learner" combining multiple algorithms is expected to outperform the use of single algorithms (van der Laan et al., 2007).Furthermore, we found that words, categories, and topics related to (sexual) violence, abuse, and strong (negative) emotions were related to more trauma-related symptoms.In contrast, the topic related to (car) accidents, passive language, and negations were related to less trauma-related symptoms.This is in line with previous literature indicating that interpersonal traumatic events are more strongly related to trauma-related symptoms than non-interpersonal traumatic events (Charuvastra and Cloitre, 2008).A previous study using text mining in patients with and without PTSD also found strong emotions and words referring to interpersonal trauma, such as rape, to be predictive of PTSD (He et al., 2012).
The predictive power of the short text descriptions for traumarelated symptoms was optimal when combining all language features (topics, LIWC, and n-grams).This was comparable to the predictive power of self-reported event characteristics and considerably more than demographic variables.However, since the average correlation between predicted and observed trauma-related symptoms when using only language features was only moderate, this is not sufficient for reliable predictions of trauma-related symptoms at an individual level.In turn, the correlation between observed and predicted trauma-related symptoms was strong in a model using the language features combined with We also tested the performance of the aforementioned models in detecting probable PTSD.Based on AUROC values, only the model including language features combined with demographic information, event-related characteristics, and risk factors resulted in detection of probable PTSD with a moderate accuracy (accuracy = .79).Notably, it has been argued that in the field of mental health, reaching an AUROC value of 1 might be impossible because the criterion diagnosis is imperfect, thereby constraining AUROC values.Therefore, an AUROC value between .70 and .80might actually represent a good accuracy (Youngstrom, 2014).Interestingly, a previous study using textual information found an even higher accuracy in detecting PTSD (accuracy = .84;He et al., 2019).In this study, participants were recommended to write at least 150 words about their traumatic event and their symptoms (He et al., 2019).Hence, the language information from this study was likely to include much more information.In the present study, we found evidence of a stronger accuracy among participants reporting very short descriptions (number of words ≤ 4 words), when compared with participants providing longer descriptions.Among participants reporting very short descriptions, the model combining language features with either all self-report features, or just risk factors, showed improved accuracy (accuracy ≥ .80).This finding highlights the potential of the automated coding of free response texts, including those with very short answers, for the purpose of the detection of symptoms of distress related to stressful events.Indeed, the ability of respondents to express freely about their personal experience may help in extending the spectrum of difficulties assessed by self-report questionnaire based on predetermined rating scale questions, ultimately improving the overall validity of assessment (Kjell et al., 2019).One limitation related to use of open-ended questions lays in their coding procedure, which typically requires additional resources compared to that of closed-ended questions; however, recent advances in the use of automated coding approaches, such as those used in the present study, point toward the possibility of a wider use of this type of open-ended questions (Kjell et al., 2019).Future studies might try to find the optimal trade-off between the length of the text response and the burden of time and emotional discomfort for participants.
The results of the current study have several implications for efforts  to improve accurate screening for trauma-related disorders.We found that for accurate predictions, the open text field and only 10 items were needed, including very general information such as gender, whether the event was single or repeated or whether someone experienced social support.Thus, the burden for participants to fill out these items was minimal, especially compared to reporting a range of mental health symptoms or a clinical interview.The open text field did not require much from participants as we did not instruct participants to write a large number of words, nor did we limit the number of words.In fact, the way the participants decided to fill out the question (number of words, active or passive form, emotional or not emotional etc.) provided relevant information for the prediction of their trauma-related symptoms.This makes the screening process very efficient without posing any burden or requiring a lot of effort from participants.The combination of some self-report information and textual information might be used to construct an online screening instrument which provides personal feedback for people about their trauma-related symptoms and the likelihood that they meet criteria for a trauma-related disorder.The current results suggest that this is feasible for PTSD.It might be worthwhile to investigate whether we can further improve the accuracy to avoid false positives and false negatives by adding a few more open-ended questions.For example, one might ask participants to describe the symptoms that affect them most.It might also be interesting to add a question about treatment expectancies and preferences, especially since this might also be relevant for personalized treatment indications.Future studies might investigate whether such questions can be used to also predict other trauma-related disorders, as open-text fields might be less disorder-specific compared to standard screening questionnaires.This might further optimize efficient screening for trauma-related disorders.
In the current study, we used data from 21 languages, and we tried to translate this data into English automatically.When native speakers systematically rated this data, the automatic translations were rated mediocre and had to be manually supervised.This indicates that it is not possible to rely on automatic translations using Google Translate at the moment.Future studies might investigate other automated translation options as this would greatly reduce the amount of manual labor.Alternatively, it would be interesting to confirm our findings using non-English language texts.This study has several limitations.Firstly, we did not establish PTSD diagnoses using a clinician-administered interview.This would have allowed for more reliable information about the potential of the models to detect PTSD diagnoses.Secondly, we included a limited set of predictors in this study.Although this resulted in good predictions of trauma-related symptoms, we do not know whether and how many additional predictors (e.g.peritraumatic distress; Vance et al., 2018) would have increased the performance.Thirdly, although the large, diverse global sample is a strength of the current study and increases the generalizability to diverse countries and languages, this sample was recruited via the internet and, therefore, a convenience sample.Finally, the translations of the texts into English might have slightly changed the context and meaning of sentences and changed the number of words.
Despite these limitations, this is the first study that evaluated whether short text descriptions about the traumatic event might be useful in predicting trauma-related symptoms and probable PTSD in a large cross-national sample.We found that combining information from these text descriptions with demographic information, event-related characteristics, and risk factors led to good predictions of traumarelated symptoms and moderate prediction of probable PTSD diagnosis.We conclude that further research into the use of short text descriptions about the traumatic event is worthwhile, especially in combination with other predictors.

Declaration of Competing Interest
There are no conflicts of interest to declare that are relevant to the content of this article.

Fig. 1 .
Fig. 1.Language features showing strongest positive (red) and negative (blue) correlations with the GPS symptom score D.Marengo et al.

Fig. 2 .
Fig. 2. Average correlation between observed and predicted GPS Symptom score by algorithm and combination of features

Table 1
Association between Predicted GPS symptom scores and GPS symptoms classification: Mean, and Range of AUROC values in the overall sample and by length of description D. Marengo et al.