Optimal risk and diagnosis assessment strategies in perinatal depression: A machine learning approach from the life-ON study cohort

This study aimed to assess the concordance of various psychometric scales in detecting Perinatal Depression (PND) risk and diagnosis. A cohort of 432 women was assessed at 10 – 15th and 23 – 25th gestational weeks, 33 – 40 days and 180 – 195 days after delivery using the Edinburgh Postnatal Depression Scale (EPDS), Visual Analogue Scale (VAS), Hamilton Depression Rating Scale (HDRS), Montgomery-Åsberg Depression Rating Scale (MADRS), and Mini International Neuropsychiatric Interview (MINI). Spearman ’ s rank correlation coefficient was used to assess agreement across instruments, and multivariable classification models were developed to predict the values of a binary scale using the other scales. Moderate agreement was shown between the EPDS and VAS and between the HDRS and MADRS throughout the perinatal period. However, agreement between the EPDS and HDRS decreased postpartum. A well-performing model for the estimation of current depression risk (EPDS > 9) was obtained with the VAS and MADRS, and a less robust one for the estimation of current major depressive episode (MDE) diagnosis (MINI) with the VAS and HDRS. When the EPDS is not feasible, the VAS may be used for rapid and comprehensive postpartum screening with reliability. However, a thorough structured interview or clinical examination remains necessary to diagnose a MDE.


Introduction
Perinatal Depression (PND) is generally considered a Major Depressive Episode (MDE) occurring at any time during pregnancy and up to 12 months after delivery (ACOG, 2018).Although adjusted pooled prevalence estimates reach 12% of all pregnancies, considerable variability has been reported by studies using symptom scales or diagnostic instruments (Woody et al., 2017).A nearly detection of PND is important but the identification of depressive episodes in the perinatal period poses several clinical challenges.The term depression encompasses a broad range of symptoms that might at times be difficult to distinguish from physiological distress reactions (Snaith, 1996).Although somewhat time-consuming, many studies have shown structured interviews to improve performance in case classification and prevalence estimates of disorders (Mitchell et al., 2011;Moussavi et al., 2007).Nonetheless, a recent umbrella review of 69 meta-analyzes including 81 prevalence estimates found that only 10% reflected studies that classified depression exclusively through clinician-administered structured interviews, whereas almost 90% used screening or rating tools alone, or in combination with other methods (e.g., medical records, self-report); pooled prevalence rates varied considerably, ranging from 17% for diagnostic interviews, to 22% and 31% for studies based on combinations and screening or rating tools, respectively (Levis et al., 2019b).
Given the abundance of instruments used for PND, direct comparisons in large samples are necessary to select optimal, disorder-specific tools.The self-administered Edinburgh Postnatal Depression Scale (EPDS) is the most widely used screening tool for PND in communitybased settings (Bhat et al., 2022).Above the identified cut-off score, women should be referred to a specialist to verify diagnosis with a structured clinical interview -such as the Mini International Neuropsychiatric Interview (MINI) (Lecrubier et al., 1997;Sheehan et al., 1997) or the Structured Clinical Interview for DSM disorders (First et al., 2016).The MINI has recently been found to classify more patients with depression than other instruments by an individual participant data meta-analysis of 57 studies including 17,158 participants (Levis et al., 2018).
The Hamilton Depression Rating Scale (HDRS; Hamilton 1960) and the Montgomery-Asberg Depression Rating Scale (MADRS; Montgomery and Asberg 1979) are commonly employed clinician-administered rating scales to measure treatment response in clinical trials (Cuijpers et al., 2021;Sockol et al., 2011), to assess changes in symptom severity over time in clinical practice and as relatively brief screening tools for depression.Unlike the HDRS, the MADRS mainly targets core mood symptoms, such as sadness, tension, pessimistic thoughts, and suicidal thoughts.
Recent developments in machine learning (ML) models for predicting postpartum depression risk involve combining clinical, sociodemographic, and biological data.A predominant focus on supervised learning and common ML models like support vector machines, random forest, and logistic regression has been observed (Zhong et al., 2022).Notably, patient psychiatric and gynecological history, along with sociodemographic information, have proven reliable in identifying those at risk (Cellini et al., 2022;Wakefield and Frasch, 2023;Xu and Sampson, 2023).Although fewer studies have explored biological variables, differences in metabolite changes show promise in classifying women with and without postpartum depression (Yu et al., 2022).Recent ML applications to postpartum depression risk include a study utilizing patient-reported survey responses in early pregnancy, demonstrating moderate performance in predicting depression risk across trimesters and postpartum periods (Reps et al., 2022).Another study identified mood status in the first trimester, previous depressive episodes, and marital status as crucial predictors of later onset postpartum depression, with additional factors such as sleep quality, age, previous miscarriages, and adverse life events enhancing predictive model performance (Garbazza et al., under review).
In this study, we aimed to investigate the relationship between a structured clinical interview (MINI) and both clinician-administered and self-report validated scales in a large cohort of women assessed at multiple time points from the first trimester of pregnancy to the first six months postpartum.Our objectives were to compare classification models for perinatal depression risk and diagnosis using a multiscale assessment approach, establish consistency across instruments, and determine an optimal selection of assessment tools both before and after delivery.

Data collection
Data for the present analysis were extracted from the "Life-ON" study, a multicenter, prospective cohort study on sleep and mood changes in the perinatal period, which has been extensively described elsewhere (Baiardi et al., 2016;Garbazza et al., 2022).The recruitment of participants in the 4 centers was carried on for 3 years, between the beginning of 2016 and 2019.The last follow-up visit, falling 18 months after the inclusion of the last participant, took place in June 2020.In total, about 2000 women in the first trimester of pregnancy were contacted and invited to participate in the study.
Four hundred and thirty-nine women (age: mean 33.7, std 4.2) were longitudinally followed-up from the first trimester of pregnancy until months postpartum.Main inclusion criteria were age 18-45 years, gestational age between 10 and 15 weeks, lack of major medical conditions; main exclusion criteria were a diagnosis of bipolar disorder or psychosis, a current or recent (within 6 months) depressive episode (Baiardi et al., 2016).Five different rating scales for depression were administered at four different time points during the study: visits (10-15th gestational week), 3 (23-25th gestational week), 6 (33-40 days after delivery) and 9 (180-195 days after delivery).From these data, complete combinations of scores on all five depression scales were obtained.When all 5 scales were administered to the participants, their completion required up to 1 h of time.However, this occurred in 5 out of 11 study visits, while in the remaining 6 follow-up visits the only psychiatric scales administered were the EPDS and VAS, which required an average of 5 min to be completed.All clinician-administered scales and the MINI structured interviews were conducted by staff psychologists or physicians.Participants were first asked to complete the self-administered scales EPDS and VAS by themselves.This was to respect a consistent time sequence, given that in 6 visits EPDS and VAS were the only two scales administered.Furthermore, in this way we tried to avoid that the participants could be influenced in the self-assessment of their mood by the subsequent clinical interview with the investigator.Afterwards, when required according to the study protocol, the investigator carried out a clinical interview based on the MINI scale.Finally, the semi-structured HDRS and MADRS scales were administered by the researcher and discussed together with the participants.Multi-scale combinations were available for 432 different women, and for 421, 336, 277 and 242 women at visit 1, 3, 6 and 9, respectively.For women, combinations of scores were available at all four visits.In terms of prepartum and postpartum, 757 observations came from the prepartum phase (visits 1 and 3) and 519 observations from the postpartum phase (visits 6 and 9).A total of 1276 complete score combinations were extracted from the Life-ON cohort, to constitute the pooled data set.Fig. 1 summarizes the distribution and dispersion of the sample throughout the study visits.
The five scales administered during the study were: • The Hamilton Rating Scale for Depression (HDRS) has two common versions with either 17 or 21 items and is scored between 0 and points.The first 17 items measure the severity of depressive symptoms while the extra four items on the extended 21-point scale measure factors that might be related to depression, but are not thought to be measures of severity, such as paranoia or obsessional and compulsive symptoms.Here we used the 17-item scale, which yields the following total score ranges: 0-7 (no depression), 8-16 (mild depression), 17-23 (moderate depression), >24 (severe depression).• The Montgomery-Äsberg Depression Rating Scale (MADRS), consists of 10 items evaluating core symptoms of depression.Nine of the items are based upon patient reports, and one is on the rater's observation during the rating interview.MADRS items are rated on a 0-6 continuum (0 = no abnormality, 6 = severe), yielding the following total score ranges: 0-6 (no depression), 7-19 (mild depression), 20-34 (moderate depression), >34 (severe depression).• The Edinburgh Postnatal Depression Scale (EPDS) is a 10-item selfadministered screening tool used tailored for women during pregnancy and postpartum.Responses are scored 0-3 according to the severity of the symptom.The total score is determined by adding together the scores for each of the 10 items.The scale is considered an effective screening tool for major and minor depressive syndromes throughout pregnancy and postpartum above a total score of 9 (Levis et al., 2019a), but its accuracy increases if the cut-off is raised above 12 (Cox, 2019).In this study, we chose the lower cut-off to favor a more inclusive approach because the study cohort was composed of women without a diagnosis of depression at baseline.Moreover, in Italian validation studies optimal cut-offs were found to be 9/10 (Carpiniello et al., 1997), or 8/9 in the context of community screenings (Benvenuti et al., 1999).For the aims of this study, we considered non-binary, continuous scores for HDRS, MADRS, EPDS and VAS, and binary scores for the MINI (presence or absence of a depressive episode), and the EPDS (above or below the cut-off).

Statistics
The Mann-Whitney U test was applied to different non-binary scales to assess differences in distribution between the following samples: (i) prepartum and postpartum samples, (ii) samples with positive and negative MINI, (iii) samples with EPDS above 9 or not.In all cases, since we performed multiple tests, we adjusted the p-values for multiple tests by means of the Benjamini-Hochberg correction (Benjamini and Hochberg, 2019), to control the false discovery rate.
We regarded the presence of a monotonic relationship as indicative of agreement between two non-binary scales, and we assessed this by means of Spearman's rank correlation coefficient.All the analyzes were performed in R (Core Team, 2018).

Classification models
We considered the full available sample (including both pregnancy and the postpartum), and focused on the prediction of a binary scale using the values of all the other available scales for the same patient at the same visit.In particular, we focused on predicting MINI (using HDRS, MADRS, EPDS and VAS), and a binarized form of EPDS (using HDRS, MADRS, VAS and MINI).This latter was defined by considering women with EPDS>9 as at-risk for depression ("Yes") and women with EPDS≤9 as not at-risk for depression ("No").
We developed multivariable classification models for predicting the values of a binary scale using the other scales.The performances were evaluated in cross validation (five-fold) with 10 repetitions, to assess the stability of the obtained results.Stratification with respect to both target and visit values was used for creating folds.Moreover, the division into folds for the n-th repetition of any two models predicting the same target was chosen to be the same.The optimal probability thresholds for assigning outcome classes to model predictions was selected by optimizing the geometric mean of sensitivity and specificity in a nested cross validation set up.The classification models were implemented by means of the caret R package (Kuhn, 2021).
As evaluation metrics for the performances of the classification models, we used the area under receiving operating curve (AUROC), the area under the precision recall curve (AUPR), sensitivity (SEN) and specificity (SPE).We pooled together predictions for the different folds of a same repetition, calculated the values of the metrics for the single repetitions, and then calculated mean and standard deviation across the ten repetitions.

Overall distributions of rating scale scores during pregnancy and the postpartum
We analyzed the overall distributions of the five scales, shown as histograms in Fig. 2A.As highlighted by the plots, there is a large We assessed differences between prepartum and postpartum distributions of non-binary scales by means of Mann-Whitney tests.In two cases, namely for EPDS and HDRS, the tests revealed significant differences, with adjusted p-values of ~10^(-11) and ~10^(-3), respectively.Such differences also emerge from the boxplots of Fig. 2B, where distributions for the two periods are shown.For both EPDS and HDRS postpartum observations appear to have (in general) lower values than prepartum ones, while this difference is less substantial using MADRS and VAS tools.

Agreement among scales during pregnancy and the postpartum
We considered all possible pairs of non-binary scales and we studied their joint distributions, by separating prepartum and postpartum periods.Qualitative observations on differences between prepartum and postpartum and on the agreement among scales can be obtained from the scatterplots displayed in Fig. 3A.Here, by agreement we mean monotonicity to each other: two scales are in good agreement if their scatter plot shows a monotonic upward trend.In order to quantitatively support such observations, we calculated Spearman's rank correlation coefficients between all possible pairs of scales, by keeping separation between prepartum and postpartum observations.These measures of agreement are presented in the diagrams represented in Fig. 3B.The main considerations that can be derived are the following:   • HDRS and MADRS show good agreement, especially for values larger than zero.In fact, the correlation coefficients are among the largest (0.65 in prepartum and 0.58 in postpartum) • HDRS and VAS show poor agreement, especially for low values.In fact, the correlation coefficients are among the smallest (0.3 and 0.23).

• HDRS and EPDS show poor agreement, especially in postpartum.
This is reflected in a decrease of the correlation coefficient from 0.39 to 0.21.• EPDS and VAS show good agreement in general.In particular, there is more agreement on low values between the two in postpartum than during pregnancy.This is reflected in an increase of the correlation coefficient from 0.5 to 0.62.

Prediction of a binary scale using others
As a preliminary study, we performed Mann-Whitney tests to assess whether the distribution of the non-binary scales is different between women with positive or negative MINI and between those with EPDS≤9 or EPDS>9.In all cases but one (values of HDRS on samples determined by MINI), significant differences emerged (Fig. 4A).In general, the distributions for samples determined by values of binary EPDS appear more separated than for those determined by MINI.This can be appreciated in the boxplots depicted in Fig. 4B, where boxes are well apart for binary EPDS.
We trained multivariable classification models to predict both MINI and binary EPDS, using all the other available scales as features.As shown in Table 1, we obtained a well-performing model for the prediction of binary EPDS, and a less performing one for the prediction of MINI.Then, both for the prediction of MINI and binary EPDS, we trained a model for each possible subset of the available scales.In both cases, we identified a model which only considers two scales but performs comparably to the one using all the available scales.In particular, for predicting binary EPDS it is the model using only VAS and MADRS, for predicting MINI the one using only HDRS and VAS.
When dealing with imbalanced datasets, the AUC score can give overly optimistic results.In our case, this is particularly true for the MINI model, which had only 6% depressed women compared to the EPDS model's 10%.As a result, the distortion caused by the imbalance is likely to have a greater impact on the MINI model than on the EPDS model.This reinforces our conclusion that the EPDS model is superior to the

Table 1
Performances of classification models for prediction of binary EPDS and MINI.The first two rows refer to models using all available scales, the second two to models using only two scales.All models were trained in five-fold cross validation for ten repetitions.Average and standard deviation across repetitions for the following performance metrics are shown: area under receiving operating curve (AUROC), area under the precision recall curve (AUPR), sensitivity (SEN) and specificity (SPE).MINI model.Fig. 5 illustrates repetition one (out of ten) of the classification models in major detail.Fig. 5A shows ROC and PR curves and their underlying area both for the models using all the available scales and the aforementioned models using only two scales.Fig. 5B provides confusion matrices for the models using all scales: these were determined by using the optimal threshold in terms of geometric mean of sensitivity and specificity.

Multiscale prediction of PND risk (EPDS) and diagnosis (MINI)
The main finding of our study is that machine learning models employing several self-and clinician-administered depression scales classify women at risk for PND (EPDS total score > 9) substantially better than women with a MINI-confirmed Major Depressive Episode.Indeed, symptom questionnaires are not designed to ascertain diagnostic status, and our models confirmed their relatively low reliability.This finding has implications for both research and clinical practice.Symptom screening tools are often employed in both settings because administration of diagnostic interviews is time-and resource-consuming Fig. 5. (A) ROC and PR curves, with their underlying areas, for repetition one out of ten of models predicting binary EPDS (left) and MINI (right).In both cases a model using all available scales and one using only two scales are shown: for binary EPDS the two-scale one uses VAS and MADRS, for MINI it uses HDRS and VAS.(B) Confusion matrices for repetition one of models using all available scales.The threshold used for assigning a class to prediction is the one maximizing the geometric mean of sensitivity and specificity.
A. D'Agostino et al. (Levis et al., 2018).However, all self-report and clinician-administered questionnaires appear to clearly identify PND likelihood rather than MDE diagnosis.According to some authors screening for PND through available instruments fails to confer benefits above usual clinical care, i. e. the clinician's inquiry and attention to mental health and well-being during pregnancy and postpartum (Lang et al., 2022).Given the known specificity of EPDS for the perinatal period, our findings could also suggest that the PND construct does not fully overlap with MDE.MINI might fail to capture core aspects of PND, given that previous research has shown that the likelihood of being diagnosed a MDE increases less for the MINI than for the Structured Clinical Interview for DSM Disorders (SCID) as EPDS total score increases (Levis et al., 2020).The ideal PND screening tool should limit the weight of symptoms that overlap with physiological postpartum experiences, and capture unique symptoms that are missing from MDE screening tools (Batt et al., 2020).Indeed, available instruments fail to clearly differentiate PND from the common experience of "baby blues", and from other clinical conditions with overlapping symptoms, such as generalized anxiety disorder, obsessive-compulsive disorder, and postpartum psychosis (Kettunen et al., 2014).

Agreement across psychometric instruments during pregnancy and postpartum
HDRS total score was relatively consistent with MADRS, but not with VAS nor EPDS, especially postpartum.EPDS appeared to be relatively more consistent with VAS, especially when scores were low and postpartum.The low degree of correspondence between HDRS and VAS is not surprising given their very different structure: the former requires an objective assessment of cognitive, affective and neurovegetative symptoms of depression, whereas the latter only measures the subjective experience of "feeling depressed".Indeed, the MADRS appears to correlate more closely with VAS, perhaps due to its focus on "core" depressive symptoms.
HDRS is widely known for its focus on somatic and neurovegetative symptoms of depression (Gibbons et al., 1993;Nixon et al., 2020;Vindbjerg et al., 2019).In the context of PND, HDRS may detect somatic symptoms which are common during pregnancy and/or the postpartum period (i.e.fatigue, body aches, reduced libido, sleep disturbances) but may not necessarily stem from a depressive condition and may not be associated with a subjective experience of negative affect.Hence, HDRS may be less specific and its scores may be inflated when compared to other measures.On the other hand, it may be argued that some women who develop PND may not recognize such symptoms as indicators of depression, but rather attribute them to the physiological stress of the perinatal period.The symptomatic overlap between depression and common physical complaints in pregnancy and after delivery has been previously highlighted by Ross et al., 2003.In their cohort of 150 women followed-up between 36 weeks gestation and 16 weeks postpartum, somatic item scores did not correlate with total HDRS score during pregnancy, but increased at 6 weeks postpartum, when mood items score correlation with total score lowered in comparison to pregnancy.The authors concluded that women may be more inclined to identify their complaints as physical rather than mood-related after childbirth, compared to pregnancy (Ross et al., 2003).The abundance of somatic items on HDRS might therefore act as a confounding factor in the screening process and in the assessment of severity in PND.Indeed, agreement between HDRS and EPDS strongly decreased in our cohort in postpartum observations.We found major agreement on low values between EPDS and VAS in postpartum rather than during pregnancy, suggesting VAS may be employed as a fast screening tool after childbirth.Visual Analog Scales are straightforward, graphical self-reports of emotional states that can overcome linguistic barriers and have been employed in studies of both postpartum blues and depression (Cox et al., 1983;Kendell et al., 1981).Originally developed to assess mood in patients with neurological disturbances such as aphasia or stroke (Stern, 1997), VAS has also been employed in other clinical settings as a screening tool (Bennett et al., 2006).However, the broad range of concurrent validity coefficients (0.12-0.82) limits the interpretation of results and has raised concerns over the scales' psychometric quality in terms of validity and reliability (Athanasou, 2019).

Study limitations
Some limitations of our work must be considered.First of all, our findings might be influenced by the cut-off choice of 9 for EPDS, as higher scores have been shown to progressively yield more cases of MINI-confirmed MDE diagnoses (Levis et al., 2020).However, our choice was driven by the observation that lower cut-offs are most efficiently employed to avoid false negatives and identify most patients who meet diagnostic criteria (Levis et al., 2019a).In addition, from a purely numerical perspective, employing a cut-off value of 12 for EPDS would have resulted in a dataset with a degree of imbalance around 3.6%.Given the limited amount of data, we believe this level of imbalance would be too severe to produce reliable results.Second, relatively low rates of depression risk and MDE were found in our cohort (10% and 6% of all observations, respectively), thus limiting the overall number of positive cases in the binary EPDS and MINI prediction models.This is likely to depend on our choice to exclude women diagnosed with a depressive episode or bipolar disorder at baseline, which has been explained elsewhere (Baiardi et al., 2016).Finally, limited and varying evidence of validity in the identification of antepartum depression has been reported for EPDS (Owora et al., 2016), although it remains the most commonly used instrument in clinical practice.

Conclusion
Globally, our findings suggest that results derived from different scales should be compared with great caution, due to a substantial variability across women with low/high symptom scores and during pregnancy or the postpartum period.Whenever the EPDS cannot be employed, the VAS can be reliably administered for ultrarapid, extensive postpartum screening.On the other hand, commonly employed clinician-administered or self-report tools cannot reliably replace a full structured interview or the clinical examination required to establish a diagnosis of MDE.

Financial support
The"Life-ON" study was funded by the Swiss National Science Foundation (grant: 320030_160250/1) and the Italian Ministry of Health and Emilia-Romagna Region (grant: PE-2011-02348727).

Declaration of competing interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
The self-reported single-item visual analogue scale (VAS) is used to evaluate depression severity with the following instruction: "On a scale from 0 to 10, where 0 is the worst mood imaginable and 10 is the best mood imaginable, please indicate how you are feeling right now by marking a point on the line".Participants rate their selfperceived level of depression by making a cross on a continuous, straight 10 cm-line drawn on paper.Outcome values are calculated by measuring the distance reached from point 0 using a ruler.•The Mini International Neuropsychiatric Interview (MINI), is a short, fully structured interview designed to identify the 17 most common psychiatric disorders in DSM-III-R, DSM-IV, DSM-5 and ICD-10.

Fig. 1 .
Fig. 1.Alluvial plot displaying the progression of participation throughout the study.Each vertical block's height represents the number of patients for a particular visit, with black blocks indicating patients attending and red blocks representing those who missed the visit.The blocks are interconnected to demonstrate how they change over time.

Fig. 2 .
Fig. 2. Distribution of measurements for the different scales.In each subfigure, for non-binary scales, boxplots show the distribution of values; for MINI, which has the two possible values "Yes" and "No", a bar plot shows the percentage of "Yes".(A) All the measurements together.(B) Measurements split by prepartum and postpartum.For non-binary scales, adjusted p-values from Mann-Whitney tests are shown: these assess whether prepartum and postpartum distributions for a given nonbinary scale are significantly different.(C) Measurements split by visit.

Fig. 3 .
Fig. 3. (A) Scatterplots in log-log scale for pairs of non-binary scales.A jitter (0.25 on log-transformed values) was added on both axes in order to make the graph more readable.Prepartum and postpartum visits are shown in different panels.(B) Diagrams showing Spearman's correlation coefficients for pairs of non-binary scales in prepartum and postpartum.

Fig. 4 .
Fig. 4. (A) Boxplots showing distributions of non-binary scales for the two samples determined by dividing observations into two groups according to their value of binary EPDS.Adjusted p-values from Mann-Whitney tests are shown: these assess whether the two samples are significantly different.(B) Analogous to subfigure A, but for samples determined by means of MINI.