Deep Temporal Modelling of Clinical Depression through Social Media Text

We describe the development of a model to detect user-level clinical depression based on a user’s temporal social media posts. Our model uses a Depression Symptoms Detection (DSD) classifier, which is trained on the largest existing samples of clinician annotated tweets for clinical depression symptoms. We subsequently use our DSD model to extract clinically relevant features, e.g., depression scores and their consequent temporal patterns, as well as user posting activity patterns, e.g., quantifying their “no activity” or “silence.” Furthermore, to evaluate the e ffi cacy of these extracted features, we create three kinds of datasets including a test dataset, from two existing well-known benchmark datasets for user-level depression detection. We then provide accuracy measures based on single features, baseline features and feature ablation tests, at several di ff erent levels of temporal granularity. The relevant data distributions and clinical depression detection related settings can be exploited to draw a complete picture of the impact of di ff erent features across our created datasets. Finally, we show that, in general, only semantic oriented representation models perform well. However, clinical features may enhance overall performance provided that the training and testing distribution is similar, and there is more data in a user’s timeline. The consequence is that the predictive capability of depression scores increase significantly while used in a more sensitive clinical depression detection settings. © 2011 Published by Elsevier Ltd.


Introduction
Most of the earlier research in the area of user-level depression modelling through social media posts do not attempt to align with the clinical framework of depression detection. By clinical framework, we mean conforming to the definition of clinical depression as defined in DSM-5 1 , i.e. looking for signs of depression in at least a two-week episode of a user. Developing such a model is very challenging because it requires a Depression Symptoms Detection (DSD) model and a framework to calculate depression scores over the temporal episodes in a user's social media timeline. In this work, we mainly focus on using our learned DSD model and clinical insights for depression detection for extracting depression scores. We subsequently represent a user's timeline as a temporal series of depression scores then use that representation for our deep Temporal model of User-level clinical Depression (TUD).
According to earlier research, social media posting activity patterns and language specific clues are very important for user-level depression modelling. Most current research has focused on these features in a non-temporal manner, i.e., on digests of tweets [1][2][3][4][5]. Very recently, a very closely related work was carried out by Nguyen et al. [5], who inferred the presence of depression symptoms from individual Reddit posts of a user. They extracted summaries of symptoms from an arbitrary number of posts through different kernel sizes of a CNN classifier and use those as non-temporal feature representations for user-level depression detection. Their depression presence calculation is based on looking for hand crafted text patterns of depression symptoms in a Reddit post. Relatively little research has considered temporal modelling but showed a different focus from depression detection, e.g., finding correlation between depression score of the patients from depression rating scales and their underlying mood patterns through social media text [6] or tracking change of the same before the date of depression diagnosis [7]. Recently, Zogan et al. [8] proposed a multi-modal social media depression detection algorithm which uses a hierarchical attention layer that leverages each tweets to learn word level and tweet level compositions. The main criticism of most of this research is about the value of extracted features: they fail to follow clinical depression modelling criteria and are primarily based on topical and lexical representations which are not as clinically useful as the clinical representations, i.e., depression scores. In addition temporal patterns analysis is missing.
Unlike earlier research, we extract depression scores for each of the twoweek depressive episodes in a user's timeline and provide it to the temporal deep learning model, thus enabling the consideration of temporal modelling for user-level clinical depression. We also integrate user posting activity patterns through the proportion of the number of days they have posting activities out of all the days in an episode. This helps us distinguish between an episode without any signs of depression and the same period with no activity. Earlier research was also not concerned about varying levels of granularity in a user's timeline. In our approach, we provide our model with a sliding a two-week time window of possible depression episode with various sliding lengths over a user's social media timeline, e.g., sliding lengths of 1, 7 and 14 days. In addition, and absent in the earlier research, we consider two different kinds of important depression modelling strategies: one follows strictly the clinical definition of depression, i.e., there must be social media posts that carry signs of either "Anhedonia" or "Low mood" in an episode to qualify it as an episode of depression; and the other does not. Since depression scoring depends on the thresholds used by the clinicians for determining whether a depression symptom is expressed either "not at all," "for several days," "more than half of the days," or "nearly everyday," we experiment with a more sensitive threshold, that qualifies an episode to be expressing depression even when a user has exhibited a symptom for at least one day in that episode. Therefore, the main motivation of this work comes from user level clinical depression modelling, which means, following clinical criteria of depression detection as laid out in DSM-5 and in clinical practice 2 .

Methodology
We begin with an extensive analysis of our datasets. First, we report distributions of different user specific statistics related to social media usage behavior, demography and linguistic components analysis based on a well-known psycholinguistic lexicon named, LIWC [9]. Next, we describe different clinical features based on depression scores and social media usage behavior of the users and how we extract them from our datasets. Later, we describe these feature distributions across our datasets. We then describe our deep learning model followed by the experimental setups, where we describe our sets of feature-ablated models and single-feature models compared to the all-feature model and relevant baselines.
We experiment with three types of depression episode analysis, starting from most granular to least granular. To do this, we slide a two-week temporal window in a user's time line from their earliest post in the history to the latest. We experiment with various slide length = (1,7,14). Slide length=1 provides us with the most granular temporal analysis to slide length=14 which is the least granular settings of the same. We keep the temporal window as two weeks to conform with the DSM-5 criteria of depression which defines depressive episodes to be of two weeks long. Moreover, it is found that temporal mood patterns are best captured through a two-week time window [6], and weekly windows are better than per-day analysis [7].
We experiment with two kinds of clinical depression detection settingsone strictly follows the clinical definition of depression and the other does not. We also experiment with two kinds of clinical analyses based on two different depression scoring strategies-one reflecting traditional clinical scoring approach, another reflecting more sensitive approach for depression detection. We create three main datasets for training purpose and separate a portion of each for testing the performance of the model. We also create a separate test set from one of the datasets which is annotated for ongoing depressed users and then evaluate all the models in that set.
Finally, we provide detailed analysis on how different clinical features contribute to the user-level depression detection task in each of those datasets across various level of granularity and clinical settings.

Datasets
We have created balanced data subsets from the CLPsych-2015 and IJCAI-2017 datasets [2,10]. Both of these datasets are from Twitter users who selfdisclosed their diagnosis of depression through a self-disclosing statement. In both of these datasets, depressed users are identified from Twitter users' selfdisclosure and control users are the users who do not have such disclosures. We use balanced depressed and control subsets of users for our experiments, as it is found to be the most effective strategy to build robust user-level depression detection model by Shen et al. [10], the curators of the largest benchmark dataset (IJCAI-2017 dataset) for the same task.
CLPsych-2015 users have markedly longer tweets history compared to IJCAI-2017 users. Moreover, IJCAI-2017 users have data preceding only one month of their self-disclosure. So analyzing IJCAI-2017 data in contrast to CLPsych-2015 provides a clear idea whether recency of self-disclosure has any effect on temporal user-level depression detection. In addition, our experiments are heavily based on social media posts from Twitter instead of Reddit or any other depression forums alike. The reason is that, we would like to use an unbiased representative of social media text, as opposed to using the datasets which have strong self-reporting bias such as depression forums.

Experiment datasets creation
We run experiments on three datasets. As described earlier, these datasets are extracted from two publicly available datasets: CLPsych-2015 and IJCAI-2017, which are similar to most of the datasets previously reported: they use use public social media posts from self-disclosing users (i.e., Twitter users) for their depression condition. We describe the curation of these datasets as follows:

CLPsych-2015-Users dataset
CLPsych-2015-Users dataset is a balanced subset of the CLPsych-2015 dataset. We ensure each user has minimum 50 posts and 30 days of Twitter history. This dataset does not include any self-disclosing statements. The original dataset from Twitter was created from users with the disclosure statement "I was just diagnosed with depression". Further, the original dataset curators employed human annotators to verify the authenticity of these self-disclosing statements for most of the users in that dataset. In addition, for a control population, random users were selected without such disclosing statements. The timeline for this dataset collection is in between the years 2008 and 2013.

IJCAI-2017-Users dataset
We use a subset of the IJCAI-2017 dataset with users who have minimum 50 posts and 30 days of Twitter history. Note further, this is a multi-lingual dataset with users producing Tweets in different languages. To initially avoid the need for multi-lingual analysis, we discard user records who have more than 20% non-English tweets. Even with this filter, we still find close to 1000 users. This dataset does include the self-disclosure statements from the users. For this dataset, the self-disclosure looks like the following text: "I (am/was/have been) diagnosed with depression." Many of these disclosures also include the exact time of such diagnosis. Control users are identified based on the Twitter users who do not have any tweets with the character string "depress." Because the Twitter API could return a huge number of tweets, the curators of this dataset restricted their collection of control tweets from the month of December, 2016. The timeline for collecting IJCAI-2017 depressed users is in between the years 2009 and 2016. Note further that this dataset contains the most recent one month of Tweets from the disclosure for depressed users; for control, it is just the recent one months of posts.
Since for this dataset we have self-disclosure statement and the timeline of depression diagnosis, by analyzing each user's self-disclosure, we identify genuine users and two types of user datasets based on the recency of their diagnosis: 1. IJCAI-2017-Ongoing-Users: these users declared that their depression diagnosis is recent. 2. IJCAI-2017-Today-Users: these users declared that they were diagnosed with depression exactly at the day of the disclosure.
We identify genuine users based on the criteria that the user is talking about their own depression and not using sarcasm, lyrics or any other text that do not directly indicates the user's depression diagnosis. Whenever a user expressed any doubt about their depression diagnosis, we also consider them as not genuine. Details on the annotation task for finding out users with current/ongoing depression is provided in the work by McAvaney et al. [11]. We find only 20% of our IJCAI-2017 users to be genuine ongoing depression candidate users. Moreover, we find only 9% of those users who disclosed their exact date of depression diagnosis.

Mixed-Users dataset
Mixed-Users dataset is a derived dataset created by combining both the above datasets. This dataset is created to see whether combining both datasets help in depression detection in our test set described in the next sections. We do not separately report the feature and linguistic analysis for this dataset because it is the aggregate of our two main training datasets, i.e. CLPsych-2015-Users and IJCAI-2017-Ongoing-Users.
The choice of minimum number of posts, days and proportion of non-English tweets to curate the above datasets is largely influenced by an earlier research which curated one of the very well-known benchmark datasets for user-level depression detection through Twitter timeline [2]. The authors of that paper used users with maximum 25% non-English tweets and minimum 25 posts, where, we adopted a more strict strategy for non-English tweets proportion (i.e. 20%) and minimum number of posts (i.e. 50) to facilitate more data per user for our deep learning model and thereby learning better models.

Dataset Statistics
Here we provide user-level social media behavior statistics including users' demographic profile. We also provide linguistic component distribution analysis for the above mentioned datasets. For this linguistic analysis, we use a well-known psycholinguistic lexicon named, LIWC [9], which we use to identify user-level mood fluctuation, emotion and sentiment analysis in temporal social media data.
We do not have demographic information for the IJCAI-2017 dataset, nor do we have any information of the geographic location of the users. For the CLPsych-2015 dataset only, we have demographic information available.

User specific statistics:
In our dataset statistics tables, we provide the following user specific statistics: 1. #Users: Total number of users. 2. Avg. Frequency. of Posting (AFP): Time difference between two consecutive user activities; here activity means Tweet post by a user. AFP is the average of these differences. The lower the number, the more the activity or posting frequency of the user. 3. Fluctuation of Posting Frequency (FPF): This is standard deviation of AFP, which approximates irregularity of a user's posting frequency. 4. #Tweets: Total number of tweets in a user's profile. 5. #Proper-Tweets: Total number of proper Tweets, i.e., Tweets after preprocessing all Tweets in a user's timeline. 6. #Days: Total number of days a user has Twitter history. 7. Age: Age of a user. Only available for CLPsych-2015 dataset, inferred by a third party machine learning model for detecting age [2]. 8. Gender: Gender of a user. Only available for CLPsych-2015 dataset, inferred by a third party machine learning model for detecting gender [2]. 9. Avg. Tweets Length: We report the average length of Tweets, i.e., average number of tokens in all Tweets in a user's timeline. 10. Avg. Sents : We also report the average number of sentences in a Tweet. This is done by simply split a tweet based on period/question mark/exclamation.
For all these statistics, we report the average and standard deviations across depressed and control population except #Users (Tables 2 and 3). We use Welch's two-tailed unpaired t-test to find statistical significance amongst the means of these features across depressed vs. control population (statistically significant means p-value < 0.05). Welch's unpaired t-test is a widely used method for comparing means between two populations [12].
We observe that IJCAI-2017-Ongoing/Today-Users datasets are smaller than CLPsych-2015-Users: the average number of posts in CLPsych-2015-Users is higher in their timeline compared to IJCAI-2017-Ongoing/Today-Users. However, average Tweet length and average number of sentences are the same across these datasets (Table 1).
In IJCAI-2017-Ongoing/Today-Users, the number of posts for the control population is higher than the depressed population, however, there is no such difference for the same in CLPsych-2015 users. #Tweets and #Proper-Tweets are significantly higher in the control dataset than the depressed Tweets in the IJCAI-2017-Ongoing/Today-Users. In CLPsych-2015 there is no such difference. In all three datasets, the average Tweet length is significantly higher in depressed population compared to the control population.
For both depressed and control CLPsych-2015-Users, we find there are more females than males and most of them are young adults (Table 2), with the control population significantly older than the depressed population, by 4 years.
The Twitter timeline of CLPsych-2015-Users is significantly longer than IJCAI-2017-Ongoing/Test-Users. Moreover, in the CLPsych-2015-Users dataset, control users have significantly longer timelines than depressed users. For the IJCAI-2017-Ongoing/Test-Users, they are same, because, IJCAI-2017-Ongoing/Test Users datasets are collected for a window of 1 month only. In the CLPsych-2015-Users, depressed users post more frequently and show less fluctuation than control users; it's just the opposite for IJCAI-2017-Ongoing/Test Users. However, in both datasets, both control and depressed users are very active which is reflected through their AFP which is less than two days (Table  3).   Table 3 User posting related statistics for all datasets (* indicates significantly higher with p-value < 0.05 in Welch's two-tailed unpaired t-test)

Linguistic components distribution:
Here we provide the linguistic component analysis with the help of LIWC. We first create a digest of all Twitter posts from both depressed and control users' Twitter timelines. We then apply LIWC on this digest. For a given digest, LIWC finds the proportion percentage of lexicon items under each lexicon components. We call this proportion percentage, "lexicon component intensity (LCI)". We follow the steps provided below to perform our linguistic component distribution analysis: 1. We find the deviations between the LCIs (we call LCI dev ) for depression (LCI d ) and control (LCI c ) population for each dataset. All the positive values (or deviations) mean those components have high LCI in depressed population compared to the control population; negative means vice-versa, and zero means equal (Equation 1). 2. Finally, we report LCI dev for all the common LIWC components where LCI dev > 0 for the depressed population and the control population for all three datasets. For the control population, we make negative deviation positive. We then report the average and standard deviation of those in the Tables 4 and 5 in descending order of the average LCI dev across all datasets (Equations 1, 2 and 3) for the calculation of these measures. This analysis provides us with the LIWC components that are most clearly expressed in depressed population compared with the control population and vice versa.
These tables show that the language used by the depressed population has more use of personal pronouns, negative emotion and anxiety related words compared to the control population (bold items in Table 4). This observation aligns with earlier research, such as [1].

Tweets preprocessing
We use the following data preprocessing for the Tweets.   Table 5 Control deviations for all three datasets 1. Lowercase each words. 2. Remove re-tweets and replies. 3. Remove one character words (except "a", "i" and "u") and digits. 4. Remove tweets which are less than three words long. 5. Re-contract contracted words in a tweet. For example, "I've" is made "I have". 6. Elongated words are converted to their original form. For example, "Looong" is turned to "Long". 7. Remove tweets with self-disclosure, i.e. any tweet containing the word "diagnosed" or "diagnosis" is removed. 8. Remove all punctuation except period, comma, question mark and exclamation. Punctuation have been found useful to represent a text based on sentence embedding. 9. Remove URLs. 10. Remove non-ASCII characters from words. 11. Remove hashtags 12. Remove emojis.
All the Tweets which are excluded after preprocessing are counted towards user posting activity but they don't carry signs of depression. "No posting activity" or absence is represented differently than absence of depression, so that our modelling can distinguish between these two.

User level filtering
Our datasets are derived from two widely used benchmark datasets used by numerous established research [3,10,[13][14][15] without any user filtering. One reason is because the original data curators already verified the users through human annotation and analysing the genuineness of their disclosure [2,10]. In addition, our own Tweets preprocessing and minimum 50 number of posts constraint also removes users with excessive gibberish and irregular users. Finally, we also manually reviewed each user's timeline to verify the quality of the users based on the content of their posts, i.e., whether they have at least a few posts regarding their struggles related to depression.

Clinically Relevant Features Extraction
Here we describe how we calculate several clinically relevant features for an episode (i.e., for a two-week time window). We later use these features to learn temporal patterns using our Deep Temporal model of User-level clinical Depression (TUD).

Depression Score (DS)
One of the major contributions of our research is to employ the DSD model to guide extraction of depression scores for an episode. We extract such depression scores over all such episodes in a user's Twitter timeline and then use TUD to learn useful temporal patterns of depression.
To enable this feature extraction process we take the following steps: 1. We first sort the posts of a user based on their Twitter post timestamp information, in an ascending order of recency. 2. We then create day-wise chunks of Tweets.

For each day of Tweet chunks we calculate a Depression Symptoms
Expression Vector (DSEV), where DSEV ∈ {0, 1} d , and d = #depressionsymptoms. Each index of this vector corresponds to each of the 10 depression symptoms we are interested in. DSEV is initialized as all 0s at the beginning, indicating no symptoms is expressed; then if any of the Tweets in the chunk has expressed symptoms, a particular index of DSEV vector is assigned value 1, which signifies a corresponding depression symptom is expressed for that day (Algorithm 1). 4. Later, in the first layer of TUD, we extract all the DSEVs in an episode, aggregate them and calculate the percentage of days each depression symptoms is expressed (Lines (5-12) in Algorithm 2). We calculate this percentage on the number of days the user has activity, i.e., Twitter posts. 5. A user may not have tweets in each episode for all of its days. So we also keep track of the days for which a user has no activity (i.e. no Tweets), which we use to calculate an Absence-Ratio (AR). This is further discussed in Section 5.3. 6. Finally, Depression Score (DS) is calculated based on the percentage of days for appearance of each depression symptom. Here we consider "Agitation" and "Retardation" as one symptom, instead of two separate ones to conform with PHQ-9 (for the sake of brevity this is not included in the algorithm).
If this is within a predefined range of thresholds 3 , as defined in PHQ-9, we assign a corresponding score (or symptomScore) for that symptom in an episode. Aggregating all these scores for all the symptoms provide us with the final depression score (Lines (5-28) in Algorithm 2). 7. In order to identify clinical depression, a user must have either "Low Mood" or "Anhedonia," so we adjust our scoring Algorithm 2 (Lines 29-39), so that we can calculate depression scores by fulfilling the clinical criteria or relaxing it. The former option is called Clinical Scoring (CS), the latter is called Non-clinical scoring (NCS). In CS criteria, a depression score of 0 is assigned for an episode, if none of the above depression symptoms are expressed in that episode; otherwise, we move on with the depression score calculation as described earlier. We report our TUD performance for both options. 8. We also consider a much more sensitive version of depression scoring. So, instead of considering all the thresholds stated in lines (12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26) in Algorithm 2, we consider only one threshold, i.e., whenever there is a Tweet carrying signs of depression, we consider a symptom score (symptomScore) of 1, and an episode will be considered as a mild depression episode whenever, it has depression score (depScore) > 0, otherwise the episode will not be considered as a depression episode, we call this Minimal Depression Expression (MDE) based Temporal Modelling.

Semantic information
To create a representation that captures semantic information corresponding to a depression episode, we first take the average of the sentence embeddings for all the Tweets in a day to represent that day. We call this Day-Level-Sentence-Embedding-Average (DLSEA). Subsequently, based on this day level semantic representation, we calculate the episode level semantic representation by again calculating the average embedding for all the DLSEAs in an episode. We call this Episode-Level-Sentence-Embedding-Average (ELSEA). We also take all the Tweets and the average of their sentence embeddings, which we call All-Tweets-Embedding-Average(ATEA). We use a Universal Sentence Encoder (USE) based sentence embedding for all these representations, as USE has been found out to be an effective and compact representation for many NLP tasks [16].

User posting activity pattern
As mentioned earlier, we determine posting activity patterns for each episodes. To do so, we calculate the number of days a user has no activity (or no social media posts in a day) out of all days in an episode, which we call the Absence-Ratio (AR).

Temporal depression patterns
We extract two kinds of temporal depression patterns among all the episodes with user activity. These patterns are widely discussed in clinical literature and practice to identify and monitor depression [17]. These are (1) Depression Recurrence Frequency and (2) Inertia.
To calculate those, we first binarize the temporal series of episodic depression scores. We call this series Binarized Temporal Episodes (BTE). Through binarization we convert the depression scores to 1 if those correspond to minimal or higher level of depression, otherwise we convert it to 0 (Algorithm 3). We later take the following steps to calculate Depression Recurrence Frequency and Inertia scores.

Depression Recurrence Frequency Score (DRFS):
Depression recurrence represents the repetition of depressive mood. Here, we track whether a user's depression shows up in a cyclic manner. To calculate this, we first compress BTE, or remove consecutive repetitive binary scores; we call this series Compressed Binarized Temporal Episodes (CBTE). Later, we find the cycles of the pattern "1-0-1", which means, a user starts with depression, gets better but again falls into depression. We calculate all such cycle in CBTE and normalize that with the number of items or binary scores in CBTE. We call this score DRFS (Algorithms 4, 5 and 6).

Inertia Score (IS):
Inertia means the tendency of staying in depressive mood for some extended period of time (e.g., multiple two week periods). To calculate this, we take BTE and find how many consecutive episodes have values 1, which means how many consecutive depressive episodes are there in a user's timeline. We then Input: Compressed-Binarized-Temporal-Episodes, CBT E Output: Depression-Recurrence-Frequency-Score, DRF S cycles ← CCA(CBT E) ; DRF S ← (cycles/CBTE) ; return DRF S ;

Algorithm 6:
Depression-Recurrence-Frequency-Score Algorithm(DRFSA) normalize this count with the total episode counts of that user. We call this score Inertia Score (IS) (Algorithm 7).

Clinically Relevant Feature Distribution in the Datasets
Here we report the extracted feature distributions, such as depression levels, depression score related temporal patterns (i.e., DRFS and IS) and useractivity patterns for our three datasets (i.e., CLPsych-2015-Users, IJCAI-2017-Ongoing-Users and IJCAI-2017-Today-Users). To calculate this distribution, we first determine the proportion of episodes out of all the episodes in a user's Twitter timeline. We then find out the average and standard deviation of these measures for all the users in depressed and control population. These numbers are reported in Tables 6, 7 and 8. We report differences among these features across depression versus control populations, based on Welch's two-tailed unpaired t-test (statistically significant means p-value < 0.05). We find that, in CLPsych-2015-Users dataset, depression levels, such as "Minimal" and "Mild," and temporal patterns, such as "IS" and "DRFS" are significantly higher in depressed population compared to the control population. Alternatively note that, instances labelled as "None" are significantly higher in the control population than in the depressed population. These distributions are expected, based on earlier research [1,17] and clinical criteria of depression 4 .
In the IJCAI-2017-Ongoing-Users dataset, we note that the depressed population has a significantly higher Absence-Ratio than the control population. However, for all other features and in both IJCAI-2017-Ongoing-Users and IJCAI-2017-Today-Users datasets, we do not see any statistically significant difference. Figure 1 illustrates the over-all temporal deep-learning model. The model is provided with day-level aggregate depression scores and semantic representation of Tweets based on day-level average embedding. Subsequently, the model employs a flexible settings for different sliding day lengths and calculates episode level aggregates of depression score and semantic representations. This flexibility helps us doing three kinds of granular analysis over a user's depressive episodes based on sliding lengths of 1, 7 and 14. These temporal episode level feature representations are later concatenated and further fed to a BiL-STM encoder to learn necessary temporal patterns of depression. This step produces an encoder output h i for each episode, and is further combined with final BiLSTM hidden representation, h f inal . This is done for the entire temporal episode sequence to determine an attention weight, w i for each episode (Equations 4 and 5). Each w i is then normalized based on a softmax function which turns it to an attention score α i . This attention mechanism has been proposed by Bahdanau et al. [18] and is often called, "Global Attention" or "Bahdanau Attention." Finally, we calculate a fixed length Attention score weighted sum of encoder outputs or episodes, C (Equation 6), which is further fed to a fully connected or dense layer followed by a sigmoid activation function which produces a binary value, "1" indicating presence or "0" absence of depression. Hyperparameter settings for training TUD are provided in the Appendix A.

Experimental Setup
We report the accuracy scores (described in the next section) for user level depression detection task individually for each of our three datasets (i.e. CLPsych-2015-Users, IJCAI-2017-Ongoing-Users and Mixed-Users) for slide length=1 (because this provides us with the best results).  Fig. 1 Detailed TUD model architecture

5.2) and a BiLSTM-Attention model for and (2)
All-Historic-Tweets-Semantic-Representation based model (HTS): this model uses ATEA representation (Section 5.2) followed by a fully connected layer for the binary depression detection task. 5. Non-Clinical vs Clinical Setting: We also report whether following strict clinical criteria for depression detection, i.e., verifying the presence of either "Anhedonia" or "Low Mood," makes any difference in user level depression detection compared to non-clinical settings (described in Section 5.1). 6. Minimal Depression Expression (MDE) based temporal modelling: Based on the depression level features distribution, we confirm that the "None" level is higher in control than depression (Section: 5), which indicates that we may try MDE to observe any increase the accuracy for DS.

Evaluation
Since our task is a binary classification task, for accuracy analysis, we report Precision, Recall and F1 scores for each of our three datasets across the corresponding held-out sets and a test set. To enable 10 fold cross validation (CV), we create 10 (train set, held-out set) pairs. We then report average Precision, Recall and F1 scores and their standard deviations across this 10 folds. We also report, how our models trained on each folds do on a separate test set i.e. in IJCAI-2017-Today-Users dataset (Section: 3.1.2). This provides us with the information on how generalizable our model is in a dataset with a totally different data distribution. We use two-tailed paired t-test and consider the difference between two accuracy scores as significant if p-value is < 0.05.

Results Analysis
In this section we provide results analysis in the following dimensions (corresponding experiments are reported in Tables 9,10,11,12,13 and 14). Underline indicates the score is significantly worse than that of the model which uses all the features (all-feats model).
1. Feature ablation study: For all three dataset experiments, we do not see any significant accuracy difference among the ablated models and allfeats in both the held-out and test sets, except the avg-embedding ablated model, which performs significantly worse in majority of the cases. 2. Single feature study: We report single features, i.e., depression-score (DS), absence-ratio (AR) and temporal patterns (TP) for all the experiment datasets. We use TP, which is a vector of two scores, i.e., IS and DRFS. TP (and so do IS and DRFS) is calculated over a user's timeline unlike a series of scores like DS and AR. As these scores are just single values, we do not believe IS and DRFS will produce any better predictive value than TP. We do not report their performance individually. We see that these models are highly unstable, i.e., they have high variability in accuracy scores across different folds in held-out and test sets. Performance becomes expectedly worse when the training and test sets are from different distributions, i.e., where number of episodes vary by a large margin (Tables 9, 10  in the experiment results, with MDE settings, we find, instead of concatenating DS if we element wise multiply it with temporal embedding representation (ES) to create all-feats model, then there is some accuracy improvement over the original concatenation based all-feats model. In general in this mode, different TUD models become more stable with increased accuracy, however, through comparing the best models in each datasets in both held-out and test set for MDE and Non-MDE mode, we do not see their difference is statistically significantly different. 8. Precision vs. recall: We see that in held-out sets, precision and recall scores are close. However, in test sets, recall becomes higher and precision becomes lower, resulting in more sensitive models. Change in training data distribution (i.e. trained in more temporal episodes) results in sensitive models (as evaluated in test data).
All the above observations can be summarized into the following observations: 1. Performance of DS depends on the dataset characteristic; if in a particular dataset, DS has significantly more discriminatory power then in that dataset DS might add more value. 2. In general, single feature based models perform worse than all features and ablated features based models. 3. Language only models (i.e., baseline models) are over-all pretty good interms of user-level depression detection, compared to posting behavior of the users, expressed depression in the posts through their depression scores and relevant temporal patterns. Note however, that those features can positively effect the model performance, provided that the data distribution is same in train and test sets. 4. Mixing two datasets with different distribution makes temporal modelling worse which is indicated by how HTS performs better than all other temporal models in the Mixed-Users dataset. Interestingly, making the depression score calculation more sensitive in MDE, we find DS becomes more effective as its accuracy increases significantly compared to the one that strictly follows the clinical thresholds. 5. Larger sliding length can result in similar model performance than more granular sliding lengths, indicating the promise for building more compact models in future.

Limitations
Some limitations of our work are provided below: 1. Our model uses depression score calculated based on the output of our Depression Symptoms Detection (DSD) model. This model is trained on a highly imbalanced dataset and is not robust to identify all the symptoms of depression from text.
2. We do not consider pure transformer models because earlier research do not indicate any extra benefit for this kind of temporal modelling. The amount of memory needed by the Self-attention [19] in a Transformer is quadratic on the length of the input, which means there is a significant limitation on the input size. Another shortcoming for using a transformer is that, to represent a sequence, an explicit mechanism to inform the model on the order of episodes is needed, which is not necessary in our architecture. There is a state-of-art transformer model called Temporal Fusion Transformer (TFT) for temporal modelling [20], however, it is not yet established whether this TFT architecture is better. Interestingly, TFT has a close connection to our BiLSTM-Attention model in its model architecture. Our future work will consider other Attention mechanisms to see if there is any improvement. 3. We strictly follow clinical criteria of depression detection, which limits us from experimenting with various lengths of depressive episodes, i.e., episodes larger than two weeks or less. Likewise, we emphasize on the expression of depression symptoms in a Tweet; if a candidate Tweet expresses depression but no particular symptom is detected (which is a rare possibility), that Tweet does not contribute to the depression scoring. We do not explicitly account for other mental health conditions, bereavement and other conditions that can resemble depressive symptoms. We also do not confirm whether any depressive symptom causes significant change in a user's daily life functioning. In future, we would like to investigate further to establish optimal thresholds and other depression criteria mentioned above in our clinical depression modelling. 4. Although LSTM might not perform good for longer sequence, BiLSTM followed by attention helps alleviate problems with longer sequence. 5. We largely follow the machine learning evaluation framework used in the seminal work of [1] for social media based depression detection; their sample size is also similar to us 5 . Moreover, most of the experiment results we report are not statistically significant, but paired t-test in 10 fold cross validation is robust against Type-2 error [21]), which means, when there is no significant difference between two model's accuracy, we can confidently assume that their accuracies are similar. We believe, our experiment results in an independent test set (i.e., IJCAI-2017-Today-Users) complement the analysis with the held-out set. Moreover, we find those clinical features have some discriminatory power which also have significant difference across depressed and control population (Section 6). This further corroborates the efficacy of our extracted clinical features. We have also focused on the nature of change in accuracy scores rather than comparing only their value, which also sheds light on the performance of our various model-feature combinations.

Conclusion
We have described the construction of a deep temporal clinical depression modelling (TUD), using Twitter posts and all of their sub-components. These sub-components are created to help extract depression score (and few clinically relevant features based on it) from temporal social media posts. Later, we find their efficacy based on their accuracy for user-level depression detection in several modes of analysis. We observe that, clinical features are more useful in same data distribution and some of the features are dataset specific. Also, semantic embedding representation is the most effective among all.

Appendix A Temporal User-level Clinical Depression Model (TUD) Training Configuration
Here we report the training configuration for TUD ( Since TUD is a binary classification task, we use the same settings for loss function as DPD described earlier.