Introduction

Parkinson’s disease (PD) is a progressive neurodegenerative condition characterized by three motor symptoms including rigidity, resting tremor, and bradykinesia. However, there is increasing evidence that PD patients can suffer several non-motor disturbances, such as sleep disorders, cognitive impairment, and behavioral changes, that can appear early and dominate the clinical picture during progression. The relationship between PD and behavioral disturbances was explained by the connection between the basal ganglia ventral region and the cingulate and orbital cortices, and the connection between the basal ganglia medial region and the orbitofrontal and prefrontal areas (Papagno & Trojano, 2018; Trojano & Papagno, 2018).

Opportunities to use technology to modulate or influence brain circuitry and human behavior have increased in recent years, with deep brain stimulation (DBS) being the most important and accepted treatment (Marsili et al., 2021). Targets for DBS are chosen based on their predominant symptomatology. Neurostimulation of the subthalamic nucleus (STN) allows a reduction of levodopa intake in advanced PD (Deuschl et al., 2006), while globus pallidum internus (GPi) stimulation seems to reduce dyskinesia (Krause, 2001; Zhang et al., 2021) and psychiatric symptoms (Bang Henriksen et al., 2016). Analyzing the subthalamic versus pallidal DBS therapeutic efficacy for PD, Elgebaly et al. (2018) concluded that there were no significant differences between the two procedures. Specifically, both improve motor function and, consequently, daily living (Deuschl et al., 2006). Only psychomotor processing speed, as measured by the Stroop color-naming test, seemed to favor the GPi DBS group (Elgebaly et al., 2018).

Even though DBS (specifically [STN] or [GPi] stimulation) was first FDA approved for PD in 2003, few studies have reported a follow-up greater than five years (Hitti et al., 2020). Among them, Bang Henriksen et al. (2016) reported the survival rate and outcomes, such as presence of hallucinations, dementia, and nursing home placement, of PD patients treated with DBS with at least ten years of follow-up. They observed that dementia was present in 46% and hallucinations in 58% of the 79 PD patients. Furthermore, older age at surgery was correlated with an increased prevalence of nursing home placement. Other outcome domains such as patient satisfaction, motor symptom control, and ability to perform activities of daily living (ADLs) were instead investigated by Hitti et al. (2020) who reported an overall improvement. Nevertheless, after a very long follow-up, more than 20 years post DBS, different ethical questions might emerge, such as the ones addressed by Gilbert and Lancelot (2021). They reported a case study and analyzed how extending life span without improving quality of life may result in a burden for patients and families.

Another domain that can be affected in PD is mood. Two randomized prospective studies reported no change in depression between baseline and after six months in a class I trial, while anxiety seemed to improve (Kurtis et al., 2017; Witt et al., 2008). The authors suggest caution in the interpretation of such changes after DBS because the Beck anxiety inventory, used in the evaluation, included several items with a strong somatic connection.

An important concern about DBS is how it affects cognitive performance. Indeed, exclusion criteria include dementia or a significant dysexecutive syndrome and depression or anxiety. This is accurate for studies, but in a clinical setting, cognitive impairment, depression, and similar disorders are considered relative neurobehavioral contraindications to DBS. Therefore, it is crucial to investigate cognitive domains, considering that decline could occur independently as a consequence of the pathological process and of aging.

A number of publications (Kurtis et al., 2017; Parsons et al., 2006; Witt et al., 2013) reported that STN DBS produces a statistically significant but mild decrease in executive functions and working memory. However, no significant cognitive changes were reported in 57% of the studies included in a meta-analysis (Appleby et al., 2007). Of note is that these review conclusions are based on DBS across a variety of conditions, not just PD. Recent reviews of cognitive outcomes after DBS in PD found heterogeneous cognitive effects, although deficits in verbal fluency were consistent and related to micro-lesions (Cernera et al., 2019, 2020; Mulders et al., 2021). In these meta-analyses and reviews, different neuropsychological instruments are used to investigate the same function, but tests investigating the same cognitive domain can in fact involve different components. Moreover, time of testing after DBS is not taken into account. The time of testing, measured as the length of time between DBS and neuropsychological evaluation, can have an important impact on the cognitive domain. At three months after DBS the patient may not be completely recovered (micro-lesions) which may affect performance, and considering that DBS is a long-term stimulation therapy, from our perspective it is important to understand what occurs one year or more after the DBS, as some effects might emerge over time. A too-long interval could possibly include effects of ageing and of disease progression.

Considering these issues, we performed a more careful selection of studies to avoid possible pitfalls (Funkiewiez, 2004). The primary aim of the current review was to provide a comprehensive overview and meta-analysis of the DBS long-term effects on four cognitive functions, namely, memory, executive functions, language, and mood, as measured by specific neuropsychological tests.

This review focuses on i) specific domains that are believed to be affected by DBS based on the previous literature, ii) specific neuropsychological tests (i.e., those used most in the selected papers to obtain more precise information); iii) a well-defined period (12 to 36 months after DBS), and iv) explicit avoidance of the insertion of data from the same clinical population more than once, as some studies are based on overlapping or identical samples.

Methods

The present meta-analysis, conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Liberati et al., 2009; Moher et al., 2015), is based on 48 studies (Table 1) investigating the long-term (12 to 36 months) cognitive changes in PD patients following DBS. Four cognitive domains were considered: (i) memory, namely, delayed recall, working memory (backward digit span), and immediate recall, (ii) executive functions, namely, inhibition control (color-word Stroop test) and flexibility (phonemic verbal fluency), (iii) language (semantic verbal fluency), and (iv) mood (anxiety and depression).

Table 1 Summary of the studies’ characteristics

Literature Search and Study Selection

Two electronic databases including MEDLINE (https://pubmed.ncbi.nlm.nih.gov) and Web of Science were searched for studies investigating DBS and its impact on the previously mentioned cognitive domains in PD patients. The considered papers were published between January 2000 and June 2021. The following terms were used in our research: (1) “Parkinson’s disease” or “PD” AND (2) “deep brain stimulation” or “DBS” AND (3) “cognition”, “memory”, “executive functions”, “language”, “depression”, “anxiety”. The reference lists of included studies and relevant reviews were searched to identify additional studies (Altinel et al., 2019; Appleby et al., 2007; Barbosa & Charchat-Fichman, 2019; Büttner et al., 2019; Cernera et al., 2020; Combs et al., 2015; Constantinescu et al., 2017).

Titles, abstracts, and full-text articles were screened independently by the authors and evaluated for eligibility based on the following inclusion and exclusion criteria:

Inclusion Criteria

  • interventions designed for adults with advanced Parkinson's disease,

  • DBS stimulation (unilateral or bilateral) of STN or GPi,

  • DBS specified as main intervention or treatment,

  • outcomes were measurable continuous variables including at least one of the four areas (memory, executive functions, language, mood),

  • neuropsychological data were reported before and after DBS surgery, or between PD DBS and control group (PD without DBS),

  • at least one standardized neuropsychological instrument from the following was used: delayed recall, backward digit span, immediate recall, color-word Stroop test, phonemic verbal fluency, semantic verbal fluency, anxiety, and depression scales,

  • the follow-up period was between 12 and 36 months. We considered that a period of 12 months could be long enough to exclude the effects of the surgical procedure, while a period longer than 36 months could include changes in cognitive functions due to ageing or progression of the neurodegenerative disease (and not due to DBS),

  • at least 5 participants in the study,

  • peer-reviewed publications,

  • published in English,

When several papers were derived from the same study, either with increased recruitment or extended follow-up evaluations, we chose the one with the higher number of participants and the most complete data reported at follow-up. If papers deriving from the same study reported results for different neuropsychological tests, then all papers were included in this review but in different meta-analysis, making sure that only one outcome from the same population was included each time.

Exclusion Criteria

  • other cephalic stimulation sites, for example, caudal zona incerta (Philipson et al., 2020),

  • a follow-up period less than 12 months or more than 36 months,

  • pathologies other than PD,

  • case reports and research studies with fewer than five participants,

  • articles from gray literature (i.e., literature that is not formally published in sources such as books or journal articles, e.g., unpublished Ph.D. thesis),

  • We chose not to include Ph.D. theses for two main reasons. Often, Ph.D. students embargo their dissertations (i.e., for 2 to 3 years they are not accessible even upon request), and good theses are published in journals as research articles.

  • studies not published in, nor translated into, English,

  • data could not be extracted because the study lacked data integrity to analyze treatment effects and no reply was obtained when writing to the authors.

As far as we know, there is no consensus regarding which tests or scales are to be used by clinicians nor which cognitive functions are to be evaluated in PD patients undergoing DBS (Papagno & Trojano, 2018; Trojano & Papagno, 2018). Therefore, we chose tests that were reported more often in the literature and that better capture the functions of interest (Dujardin et al., 2016). Specifically, we created a database with more than 50 different neuropsychological tests and selected those most frequently used. As already reported, these tests were: (i) delayed recall, backward digit span (working memory), and immediate recall to measure MEMORY, (ii) the color–word Stroop test and phonemic verbal fluency to measure EXECUTIVE FUNCTIONS, (iii) semantic verbal fluency for LANGUAGE, (iv) anxiety and depression scales to measure MOOD. In this last case, different but equivalent scales were used. Similarly, different delayed and immediate recall tests were employed to reach the highest statistical power (i.e., a higher number of included papers). In the results section, we reported which tests were used in each included publication. Most of the data were based on verbal memory tasks, which, from our perspective, does not mean that all the tests are equivalent but indicates that most of the instruments share a common objective.

We ran a preliminary selection based on title, keywords, and abstract excluding those that clearly did not satisfy our criteria. Subsequently, we made a further selection by inspecting the full manuscripts and applying the inclusion and exclusion criteria. Unresolved papers were discussed by the authors to reach a consensus.

Data Extraction

For each included paper, the relevant information to be extracted concerned i) intervention characteristics, including the target area, unilateral or bilateral DBS implantation (Table 1), ii) study characteristics, including design, language, main objective, and conclusions (Table 1), and iii) patient characteristics including sample size (treatment and control group), gender, age, education, disease duration, time from DBS, before and after DBS levodopa equivalent daily dosage (LEDD), Unified Parkinson's Disease Rating Scale (UPDRS) Part III: clinician-scored motor evaluation (before and after DBS) (Table 2).

Table 2 Summary of patients’ characteristics

The standardized mean difference (SMD) computed as Hedges’ g and sampling variance for each included study were calculated using the Comprehensive Meta-Analysis Software, while summary analyses, the likelihood of publication bias and heterogeneity tests, were computed using the “metafor package” for R, version2.4–0 (Viechtbauer, 2010). Pre- versus post- SMDs have some limitations, namely in uncontrolled designs it is often impossible to disentangle which proportion of the SMD (in our case Hedges’ g) is due to the treatment and which to spontaneous recovery or other uncontrolled variables. In other words, the pre-post SMDs are not always informative about the effects of the treatment, and these types of studies often suffer high levels of heterogeneity. Another important issue with pre-post SMDs is that the scores on the outcome measures at pre-test and those at post-test are not independent of each other. To account for the correlation between these two scores, and because the value for this correlation is seldom reported, we assumed as fixed value the Rosenthal's conservative estimate of 0,7, as many previous studies did (Hofmann et al., 2014; Johnsen & Friborg, 2015). Considering that this value is not based on empirical data, we also computed the same analysis with other three correlation points (0,0, 0,5, and 0,9) to verify whether there was any change by modifying the correlation estimate.

When a control group was included, Hedges’ g and variance were calculated for each study based on the pre-post means and standard deviation, the number of participants from both control and DBS groups, and the estimated correlations of 0,0, 0,5, 0,7, and 0,9. We only report the results with a correlation of 0,7, since we observed that what was statistically significant remained so independent of the correlation point.

Study Quality Assessment

The methodological quality of the included studies was assessed using the Physiotherapy Evidence Database (PEDro) tool and a grade for the level of evidence was assigned to each study according to the modified Sackett Scale (see Table 1 for the level of evidence; Sackett et al., 2000; Moseley et al., 2002). PEDro consists of a checklist of 11 yes-or-no questions (Table S1 in Appendix A, Supplementary data) assessing the quality of clinical trials. The PEDro scale is considered a valid and comprehensive instrument previously applied in systematic reviews (de Morton, 2009; McIntyre et al., 2016). Items can be scored as either present (1) or absent (0), and the total score is obtained by summation. Higher values indicate greater quality (9–10, excellent; 6–8, very good; 4–5, good; < 4, poor; Foley et al., 2003). Criterion 1, which relates to external validity, is not used to calculate the PEDro score.

The Sackett Scale includes five levels of evidence. Level 1 refers to meta-analysis and “high-quality” RCTs (PEDro score ≥ 6). Level 2 evidence is also derived from RCTs but from those with PEDro scores less than six, while Level 3 evidence refers to case‐control studies. Levels 4 and 5 comprise uncontrolled pre-and post-treatment tests, observational studies, case studies, or single-subject series with no multiple baselines. Overall evidence was qualified using the grading of recommendations, assessment, development, and evaluations (GRADEpro GDT, https://gradepro.org) and the Meader et al. (2014) GRADE assessment checklist (Table S2, in Appendix A. Supplementary data). GRADE provides a transparent approach and guidance on rating the overall quality of research evidence indicating four levels of evidence along a continuum (i.e., high, moderate, low, and very low) based on five factors including 1) risk of bias, 2) inconsistency, 3) indirectness, 4) imprecision, and 5) publication bias (Meader et al., 2014).

For each meta-analysis, the pooled effect, and the level of heterogeneity, by means of the Q and I2 statistics were calculated (Higgins & Thompson, 2002). The Q-statistic, representing the ratio of observed variation to within-study variance, indicates how much of the overall heterogeneity can be attributed to between-studies variation. Being a null hypothesis significance test, it is assessing the null hypothesis that all studies are examining the same effect. Therefore, when statistically significant, it implies that the included studies do not share a common effect size (Higgins & Thompson, 2002; Quintana, 2015). I2 (i.e., total heterogeneity or total variability) is a percentage which estimates the proportion of the observed variance reflecting a real difference in effect sizes, or the actual difference between studies (Borenstein et al., 2017). I2 values of 25%, 50%, and 75%, represent low, moderate, or high inconsistency, respectively. Influential cases were identified using the “inf” function from the “metafor package” for R (Del Re, 2015; Kovalchik, 2013; Polanin et al., 2016). To identify studies that have may disproportionately contributed to heterogeneity and the overall result, we used (Baujat et al., 2002) plot. The horizontal axis illustrates study heterogeneity and the vertical axis illustrates the influence of a study on the overall result. We also applied a set of diagnostics derived from standard linear regression, available within the “metafor package”, in order to spot potential outliers which could influence the observed heterogeneity (Viechtbauer & Cheung, 2010).

Meta-analysis publication bias may be due to various elements, such as the fact that we explicitly included only peer-reviewed, English-written papers, or that experiments with small effect sizes are more likely to remain unpublished. The likelihood of publication bias was assessed graphically by using the funnel plot tool together with the Egger’s regression test (Egger et al., 1997) and the rank correlation test (Begg & Mazumdar, 1994). The trim and fill method, which imputes “missing” studies to create a more symmetrical funnel plot, (Duval & Tweedie, 2000) was used for bias correction only if the previously mentioned tests were significant, since a p-value < 0.05 is consistent with a non-symmetrical funnel plot.

To avoid a large number of figures and tables, some of the materials, such as the risk of bias evaluation tables and the Baujat and Funnel plots, are found in Appendix A in the Supplementary Data.

Results

We retrieved 2522 citations. Duplicates and studies that did not satisfy the inclusion criteria as revealed by the title or abstract were excluded, and 277 papers underwent full review (Fig. 1.), resulting in 48 accepted articles (Table 1).

Fig. 1
figure 1

Flow Diagram of study selection and inclusion

Studies Characteristics

Thirty-four studies were within-subjects designs (uncontrolled pre-post DBS) and fourteen papers had a PD (no DBS) control group. With the exception of two RCTs (Schuepbach et al., 2013; Tramontana et al., 2015), no studies were blinded or random. The methodological quality of the RCTs was high, level 1b evidence. The 14 studies with a control group were rated as level 3 on the modified Sackett Scale (McIntyre et al., 2016), while the uncontrolled pre-post studies were considered as level 4 evidence (Tables 1 and A.3). The main risk of bias was due to the methodological limitations of the open-label design. Cognitive decline may occur in patients with PD over time, and in serial (test–retest) neuropsychological assessments, a repeated performance can improve due to practice effects when no parallel versions are used. Another main risk of bias was the lack of randomization and allocation concealment in between-subjects design.

Table 3 Summary of all the meta-analysis results, grouped by domains

Participants’ Characteristics and Intervention

This review includes 2039 adults with a clinical diagnosis of PD undergoing DBS surgery and 271 PD control participants (ODT [optimal drug therapy]). 1768 patients received STN stimulation, and 271 received GPi stimulation. Only 36 patients received unilateral DBS. In the DBS participants, LEDD was lowered in almost all cases (Table 2).

The participants’ characteristics were heterogeneous among studies especially concerning the time interval from PD diagnosis, age, and inclusion criteria (Table 2).

Regarding the DBS intervention, unilateral or bilateral STN was targeted in all the included studies, but only seven studies targeted STN and GPi separately. The stimulation parameters were also heterogeneous and not always specified. For this reason, we could not use this type of data as covariates in our analysis. For example, different types of electrodes from monopolar (Mikos et al., 2010) to tetrapolar (Acera et al., 2019) were used, pulse width ranged between 94,0 ± 10,56 μs (Fraraccio et al., 2008) and 60,5 ± 10,9 μs (Pillon et al., 2000), and usually a high-frequency stimulation, for example, 183,5 Hz (Rothlind et al., 2007) or 130 – 135 Hz (Asahi et al., 2014), was applied, while voltage varied between 2 and 4 V (Boel et al., 2016; Fraraccio et al., 2008; Odekerken et al., 2015; Pillon et al., 2000).

Meta-analysis Results

As previously mentioned, we took into consideration only specific neuropsychological tests that cover four cognitive areas. After inspecting the literature, we kept those tests that were more frequently used in order to have enough analysis power. When it occurred that the instruments were so heterogeneous that it was not possible to choose one test, but we had evaluated the specific domain as important, we combined different neuropsychological tests (this is true for delayed recall, immediate recall, depression and anxiety).

The standardized mean differences (SMD) were pooled together using the random-effects model regardless of the heterogeneity of test results (Q or I2) since there was a certain amount of variance between studies due to their characteristics (e.g., stimulation parameters, stimulation areas, patients’ characteristics). All the meta-analytic results are listed in Table 2.

DBS Effects on Memory

Delayed Recall

Twenty-four papers were considered and nine different assessment instruments were used, including (a) Rey Auditory Verbal Learning Test – delayed recall (Mulders et al., 2021; Acera et al., 2019; Boel et al., 2016; Odekerken et al., 2015; Heo et al., 2008; Rizzone et al., 2014; Smeding et al., 2011; Williams et al., 2011), (b) Grober and Buschke Verbal Learning Test – delayed free recall (Dujardin et al., 2001; Funkiewiez, 2004; Pillon et al., 2000), (c) Hopkins Verbal Learning Test-Revised – delayed recall (Follett et al., 2010; Mikos et al., 2010) (d) Wechsler Memory Scale – delayed logical memory (Fraraccio et al., 2008; Klempírová et al., 2007; You et al., 2020; Zangaglia et al., 2009), (e) California Verbal Learning Test – long delay free recall (Janssen et al., 2014; Woods et al., 2001), (f) Brief Visuospatial Memory Test–Revised (Rothlind et al., 2007), (g) Repeatable battery for the assessment of neuropsychological status – delayed memory index (Asahi et al., 2014), (h) Chinese Auditory Verbal Learning Test – delayed recall (Tang et al., 2015) delayed recall, and (i) Story recall test – delay free recall (Volonté et al., 2021).

The SMD for STN and Gpi combined was small but statistically significant (Fig. 2, and in Appendix A – Figs. S1 and S2), Hedges’ g = -0,13 (95% CI = [-0,23; 0,02]; p = 0,02; K = 24, N = 1429).

Fig. 2
figure 2

Forest plot—DBS effects on delayed recall

Only studies including a control group, DBS vs. ODT PD, were analyzed separately in order to exclude possible confounding factors that characterize pre-post data. A statistically significant negative effect was observed in that the DBS PD patients’ scores were lower than those of the ODT PD group (Fig. 2, and in Appendix A Figs. S3 and S4), Hedges’ g = -0,40 (95% CI = [-0,75; 0,05]; p = 0,02; K = 5, N_control = 130, N_DBS = 200). We did not observe a funnel plot asymmetry, but two studies (You et al., 2020; Zangaglia et al., 2009) had a higher impact on the result.

Backward Digit Span

Analysis included nine studies (Contarino et al., 2007; Daniele et al., 2003; Dujardin et al., 2001; Fraraccio et al., 2008; Moretti et al., 2003; Rizzone et al., 2014; Rothlind et al., 2007; Tang et al., 2015; Yamanaka et al., 2012). Pooled data did not provide evidence of significant changes after DBS (Fig. 3), Hedges' g = 0,11 (95% CI = [-0,02; 0,23]; p = 0,09; K = 9, N = 156). The levels of heterogeneity among studies were low: Q test = 6.07, p = 0,64, and I2 was 0,00%, but we applied the random-effects model because the differences between studies in terms of participants’ characteristics were relevant. The Yamanaka et al. (2012) study was identified as an outlier by the Baujat plot (Fig. S5 in Appendix A). Publication bias was evaluated using the funnel plot, the Egger's regression intercept test, and the rank correlation test for funnel plot asymmetry (Kendall's tau). No evidence of publication bias was found (Fig. S6 in Appendix A, Table 3).

Immediate Recall

Analysis included 16 papers and nine different assessment instruments, including (a) Grober and Buschke Verbal Learning Test – free immediate recall (Dujardin et al., 2001; Funkiewiez, 2004; Pillon et al., 2000), (b) Wechsler Memory Scale—logical immediate memory (Fraraccio et al., 2008; Klempírová et al., 2007; Tröster et al., 2017), (c) Hopkins Verbal Learning Test-Revised—logical memory (Follett et al., 2010; Mikos et al., 2010), (d) California Verbal Learning Test—immediate verbal list learning (Woods et al., 2001), (e) Rey–Kim Memory Battery – verbal memory immediate recall (Heo et al., 2008), (f) Repeatable battery for the assessment of neuropsychological status (Asahi et al., 2014), (g) Rivermead Behavioral Memory Test (Odekerken et al., 2015), (h) Chinese auditory verbal learning testverbal memory (Tang et al., 2015), and (i) Rey Auditory Verbal Learning Test (Boel et al., 2016; Acera et al., 2019; Mulders et al., 2021).

The random-effects meta-analysis, for STN and GPi combined, yielded a statistically non-significant result, with an overall effect size of Hedges' g = -0,06 (95% CI = [-0,21, 0,09]; p = 0,645; K = 16, N = 720), Fig. 3. We further used the Baujat plot to explore heterogeneity (Fig. S7 in Appendix A) and the funnel plot to assess publication bias (Fig. S8 in Appendix A). In the absence of publication bias, studies should be distributed symmetrically with larger studies appearing toward the top of the graph and clustered around the mean effect size and smaller studies toward the bottom. Data showed no potential outliers, and tests for publication bias indicated no need for bias correction, given that neither the rank correlation nor Egger's regression test was statistically significant.

Fig. 3
figure 3

 Forest plot - DBS effects on working memory and immediate recall

DBS Effects on Executive Function

Phonemic Verbal Fluency

Thirty-one studies investigated phonemic fluency post STN DBS. Hedges’ g value was -0,42 (95% CI = [-0,51; -0,33]; p < 0,0001; K = 31, N = 1326). In Appendix A, Fig. S9 shows the forest plot, while Figs. S10 and S11 indicate the Baujat plot and the Funnel plot respectively. Moretti et al. study (Moretti et al., 2003) was identified as a potential outlier. After its exclusion, the effect size did not significantly change, indicating a decrease in the phonemic fluency performance after DBS, Hedges’ g = -0,40 (95% CI = [-0,49; -0,32]; p < 0,0001; K = 30, N = 1308), (Fig. 4).

When only the GPi stimulation studies were pooled into the analysis, Hedges’ g had a value of -0.30 (95% CI = [-0,55; 0,04]; p = 0,02; K = 6, N = 304), (Fig. 4, and in Appendix A Figs. S13 and S14). This result was statistically significant, suggesting relevant differences between the pre-post phonemic fluency after GPi DBS. The heterogeneity was high (I2 = 79%) and the Baujat plot indicated Pillon et al. (2000) as a potential outlier (Pillon et al., 2000), (Fig. S13 in Appendix A). The funnel plot revealed no publication bias (Fig. S14 in Appendix A).

Eight studies compared DBS and ODT PD participants, and the results were statistically significant, characterized by low heterogeneity and no publication bias (Fig. 4, and in Appendix A Figs. S15 and S16), Hedges’ g = -0,56 (95% CI = [-0,79; -0,33]; p < 0.0001; K = 8, N_control = 183, N_DBS = 256).

Fig. 4
figure 4

Forest plot – DBS effects on phonemic verbal fluency

Color–word Stroop Test

We pooled 21 studies into the meta-analysis, and the result for all STN studies was statistically significant, Hedges’ g = -0,30 (95% CI = [-0,39; -0,22]; p < 0,0001; K = 21, N = 958, I2 = 42.6%), (Fig. 5, and in Appendix A Figs. S17 and S18). Because the funnel plot and regression test for asymmetry suggested a risk of publication bias (Appendix A Fig. S18), the trim and fill method was applied, estimating five missing studies on the right side, Hedges’ g = -0,26 (95% CI = [-0,34; -0,18]; p < 0,0001; K = 26). Data indicate that after STN DBS, the performance decreased.

Considering only GPi stimulation, the Hedges’ g had a value of =—0,16 (95% CI = [-0,38; 0,05]; p = 0,13; K = 6, N = 304, I2 = 70,6%), a small effect size that did not reach statistical significance (Fig. 5, and in Appendix A Figs. S19 and S20).

Comparing pre-post DBS and ODT PD patients an effect size of -0,45 was observed (95% CI = [-0,74; -0,15]; p = 0,003; K = 5, N_controls = 117, N_DBS = 182; I2 = 27.9%), a medium value indicating that after surgery the DBS group had statistically significant lower scores (Fig. S17, and in Appendix A Figs. S21 and S22).

Fig. 5
figure 5

Forest plot – DBS effects on Stroop test (color–word)

DBS Effects on Language

Semantic Fluency

Twenty-eight studies investigated STN DBS effects on semantic fluency. Six papers examined the GPi stimulation, and eight publications included a control group.

In the case of STN stimulation, SMD was -0,48 (95% CI = [-0,55; -0,41]; p < 0,0001; K = 28, N = 1378), (Fig. 6, and in Appendix A Figs. S23 and S24). The I2 of 42.8% indicated a moderate heterogeneity of the effect size. The pooled analysis also revealed statistically significant SMD for GPi DBS, Hedges’ g = -0,50 (95% CI = [-0,59; -0,40]; p = < 0,0001; K = 6, N = 304), (Fig. 6, and in Appendix A Figs. S25 and S26).

An additional subgroup meta-analysis on semantic fluency exploring the differences between DBS and no-DBS PD patients showed a Hedges’ g value of -0,49 (95% CI = [-0,70; -0,27]; p < 0,0001; K = 7, N_control = 139, N_DBS = 217), (Fig. 6, and in Appendix A Figs. S27 and S29). DBS patients obtained lower scores compared to no-DBS PD participants.

Fig. 6
figure 6

Forest plot – DBS effects on semantic verbal fluency

DBS Follow-up Effects on Emotional State: Depression and Anxiety

All of the average scores of the psychometric scales, in which higher total scores indicate more severe symptoms (e.g., depression or anxiety), were multiplied by -1 to ensure that all scales pointed in the same direction. Specifically, an improvement of the investigated function will be located on the right part of the forest plot with a positive sign. In contrast, a lower score will be located on the left part of the forest plot, having a negative sign, and will indicate a decline of the cognitive function.

Depression

Assessment was relatively consistent across the 27 included studies. The neuropsychological tests used for the evaluation follow. Seventeen used the Beck Depression Inventory (BDI) (Castelli et al., 2006; Dietrich et al., 2020; Follett et al., 2010; Funkiewiez, 2004; Heo et al., 2008; Janssen et al., 2014; Kim et al., 2013; Kishore et al., 2010; Liu et al., 2019; Pillon et al., 2000; Pusswald et al., 2019; Rothlind et al., 2007; Tang et al., 2015; Volonté et al., 2021; Witjas et al., 2007; Woods et al., 2001; Zibetti et al., 2011), four papers applied the Montgomery–Åsberg Depression Rating Scale (MADRS) (Acera et al., 2019; Ory-Magne et al., 2007; Schuepbach et al., 2013; Smeding et al., 2011), three used the Hamilton Depression Rating Scale (HAM-D) (Boel et al., 2016; Jiang et al., 2015), two publications applied the Zung Self-Rating Depression Scale (Daniele et al., 2003; Rizzone et al., 2014), and one the Hospital Anxiety and Depression Scale (Jost et al., 2021).

Analysis of DBS publications reporting data immediately before and after treatment (12 to 36 months follow-up) revealed a statistically significant but very small SMD of 0,34 (95% CI = [0,04, 0,65]; p = 0,02; K = 27, N = 1512) for STN (in Appendix A Figs. S29, S30 and S31, and in the manuscript Fig. S32) and a SMD of 0,11 (95% CI = [0,01, 0,21]; p = 0,03; K = 4, N = 231) for GPi stimulation (Fig. 7, in Appendix A Figs. S32 and S33), suggesting an improvement after DBS. Schuepbach et al. (2013) study was identified as a potential outlier. After its exclusion, the effect size for STN was of 0,21 (95% CI = [0,07; 0,34]; p = 0,002; K = 26, N = 1261), (Fig. 7).

Since the regression test for the GPi funnel plot asymmetry was statistically significant (z = 2,04, p = 0,04), we used the fill and trim method. The new data, with one imputed missing study (Table 3), indicate an SMD of 0,10 (p = 0,05; K = 5), a small effect size that barely reaches statistical significance. Overall, these results suggest that depression was slightly reduced at follow-up compared to pre-surgery.

Anxiety

The meta-analysis of 10 STN DBS studies showed a significant improvement after DBS (Fig. 7, and in Appendix A: Figs. S34 and S35), SMD of 0,30 (95% CI = [0,10; 0,50]; p = 0,01; K = 10, N = 290). Anxiety was assessed by means of the State-Trait Anxiety Inventory—STAI (Castelli et al., 2006; Rothlind et al., 2007; Zibetti et al., 2011), the Beck Anxiety Inventory (Tang et al., 2015; Woods et al., 2001), the Zung’s Anxiety Scale (Daniele et al., 2003), the Hamilton Anxiety Scale (Jiang et al., 2015), and the Hospital Anxiety and Depression Scale (Jost et al., 2021; Boel et al., 2016; Kishore et al., 2010). The Baujat plot (in Appendix A Fig. S34) and the Viechtbauer and Cheung influential test identified one study (Tang et al., 2015) as a potential outlier. From the forest plot (Fig. 7), it is evident that the effect size and confidence interval were higher than reported in the other publications. In conclusion, the current data indicate that after STN BDS, patients have slightly lower anxiety and depression levels.

Fig. 7
figure 7

Forest plot – DBS effects on mood: depression and anxiety

GRADE Assessment

The GRADE quality of evidence for all pre-post design outcomes (Tables 4) was low (i.e., the true effect may differ significantly from the estimate) due to several methodological issues: (i) cognitive decline can occur in PD patients over time independently of treatment, (ii) in serial neuropsychological assessments, an improved performance may result from practice effects, although the relatively long intervals between cognitive assessments should partially reduce this confounding factor, (iii) heterogeneity of patients’ groups, (iv) small sample size in some studies (Woods et al., 2001).

Table 4 GRADEpro summary: DBS in Parkinson's disease

The level of evidence for study designs that included a control group (Table 5) was moderate (i.e., the true effect is likely to be close to the estimated effect, but it is still possible to be different). The downgrade was due to the lack of randomization, possible publication bias, and because the no-DBS and DBS PD groups were not perfectly matched, especially regarding disease duration that was shorter in the no-DBS PD group (Table 2).

Table 5 GRADEpro summary: DBS compared to ODT in Parkinson's disease

Discussion

STN and GPi DBS are effective, accepted therapies for PD motor complications, especially when drugs are not effective (Rughani et al., 2018). While DBS has been found to improve motor symptoms (Mao et al., 2019), long-term decrease in cognitive function has been reported (Merola et al., 2014). In terms of neuropsychological performance, the effects of DBS in PD have been investigated in several studies and meta-analyses (Altinel et al., 2019; Castrioto et al., 2014; Combs et al., 2015, 2018; Elgebaly et al., 2018; Liu et al., 2014; Mansouri et al., 2018; Martínez-Martínez et al., 2017; Parsons et al., 2006; Vizcarra et al., 2019). The number of reviews and meta-analyses on this topic is large when compared to the number of actual high quality, randomized clinical trials (or experimental papers), pointing to the difficulty of creating an experimental setting and controlling for confounding variables.

Our meta-analysis differs from the previous ones in three aspects. (i) The analyses concerned only four domains (memory, executive functions, language, and emotional state) that were previously reported as crucial for the PD DBS (Dujardin et al., 2016; Papagno & Trojano, 2018; Trojano & Papagno, 2018), and we tried to select homogeneous neuropsychological tools (the most frequently used neuropsychological tests). We chose this procedure as different tests might target different aspects of the same cognitive function. Previous publications consider the same function explored with different tools (Combs et al., 2015; Elgebaly et al., 2018; Wu et al., 2014). (ii) Besides the strict selection of neuropsychological tests, an additional novelty is a clear follow-up time-point (between 12- and 36-months post-DBS). Because the number of published RCTs with PD patients is small and the methodological quality is lowered by the difficulty to balance the DBS with the control group, we chose a specific follow-up period. In fact, in most cases, DBS patients have a longer period of disease and are older than no-DBS PD participants. We considered the 12 to 36 months post DBS period to be optimal for two main reasons: a) it is long enough to exclude immediate post-surgery effects allowing a post-intervention recovery, and b) it is short enough to avoid a cognitive decline due to disease progression or ageing. Previous meta-analyses did not control for the follow-up duration (e.g., 3-months, 6-months, 1-year, 6-years, etc. were considered together) (Sako et al., 2014; Wang et al., 2016; Xie et al., 2016). This creates an important bias since the cognitive performance at three months from the DBS might be different from the one observed at 12 months. Moreover, ON verses OFF stimulation papers were not differentiated either, while we only considered DBS ON. The restriction on the follow-up period has reduced the number of included studies but has offered a better perspective of the DBS's long-term effects on cognition. When possible, we compared DBS to PD control group. Unfortunately, the number of included papers was exceedingly small (four to eight) and, as mentioned, not matched to each other. (iii) Our data are an update of the previous reviews, including very recent publications (Volonté et al., 2021) and more than 2000 PD DBS participants.

Results showed that delayed recall (long-term verbal and visuospatial) performance after DBS significantly changed in the 1–3-year follow-up. The decline was more relevant in the DBS group as compared to the ODT group (Hedges’ g = -0, 40; p = 0,02) but also in the pre- verses post-DBS testing (Hedges’ g = -0,13, p = 0,02). This decline is reported in several studies (Mehanna et al., 2017; Nassery et al., 2016; Xie et al., 2016). Because it co-occurs with the decline of other functions, the delayed memory changes could depend on executive function and processing speed deterioration (Higginson et al., 2003).

The immediate recall analysis showed no statistically significant changes after one to three years post DBS (Hedges' g = -0,06; p = 0,645). This is in contrast with previous reviews that found a negative change in memory performance. To our knowledge, the meta-analyses that indicated a subtle decline in memory after DBS aggregated both delay and immediate recall data, so the observed changes might have been influenced by the delayed recall performance (Xie et al., 2016; Wang et al, 2016, Elgebaly et al., 2018; Parsons et al., 2006).

Working memory impairment is often encountered in PD patients (Papagno & Trojano, 2018), and one would predict that surgery, such as DBS, could further disrupt this function. Our results, similar with the ones published by Martínez-Martínez et al., (2017) indicated a small, statistically insignificant, effect size (Hedges' g = 0,11; p = 0,09). We can conclude that there is no long-term negative impact of DBS on the working memory, at least as measured by the backward digit span.

Executive Functions and Attention impairments are among the most consistent findings in PD natural progression (Papagno & Trojano, 2018) and are also described after DBS (Martínez-Martínez et al., 2017). A large number of cognitive tests are subsumed under the executive functions heading, but we selected only two tests consistently applied in the literature that better capture flexibility and inhibition. It was not our intention to cover all aspects of executive functions. In a future meta-analysis, the focus can be exclusively on executive functions, and one can cover all aspects from attention to working memory and problem solving. A decrease in performance was observed in both the phonemic verbal fluency and Stroop Test (DBS versus control, Hedges’ g = -0,45; p = 0,003), while no significant changes were observed after GPi stimulation on the Stroop Test (Hedges’ g =—0,16; p = 0,13). It has been suggested that the effects on phonological verbal fluency could depend on the position of the electrodes over the left STN (York et al., 2009), PD progression (Muslimović et al., 2007), or surgery microlesions’ effects (Lefaucheur et al., 2012; Mehanna et al., 2017; Wang et al., 2016). It has also been found that DBS can reduce left temporal and inferior frontal cortex activity, thus interfering with verbal fluency (Fasano et al., 2012). Another valid hypothesis could be that disease progression is associated with cognitive deterioration, especially in patients who would be DBS candidates. Meta-analyses that compared STN and GPi stimulation effects on cognition indicate that the electrodes’ locations might be crucial in order to minimize side effects. More specifically, some authors suggest that GPi might be safer than STN and that unilateral stimulation might be preferable to bilateral stimulation (Elgebaly et al., 2018; Liu et al., 2014). However, additional factors, such as the area of active stimulation or the volume of the electrode contact, can affect the outcome. In our review, we could not control the stimulation type, bilateral vs. unilateral, due to the small number of studies applying unilateral DBS, but, when possible, we differentiated the stimulation target. We found a difference in the Stroop test with the STN group, but not the GPi group, showing decreased performance. In conclusion, there was some decrease in verbal fluency and inhibition after DBS.

Regarding linguistic abilities, although changes in verbal fluency are common in PD patients (Muslimović et al., 2007), the decrease after DBS is more severe. While Elgebaly et al. (2018) identified a slight improvement in the GPi DBS group in verbal fluency, our data show a moderate decrease in both semantic and phonemic fluency performance even with GPi stimulation (e.g., semantic verbal fluency GPi DBS: Hedges’ g = -0,50; p = < 0,0001).

An improvement (small to medium effect sizes) of depression (SMD of 0,34; p = 0,02 for STN and a SMD of 0,11; p = 0,03 for GPi stimulation), and anxiety (SMD of 0,30; p = 0,01) was found. It is hard to interpret pre versus post data in terms of causality without a control group. For instance, it is plausible that emotions related to the upcoming intervention influenced pre-DBS anxiety and depression scores or that the improvement of motor symptoms after DBS reduced depression. However, there is also evidence indicating no significant difference between the DBS group and the medically treated group (Nassery et al., 2016). Besides anxiety and depression, another mood change increasingly recognized post-DBS, although not investigated in this review, is apathy (commonly described as loss of motivation, decreased initiative and energy, and an emotional indifference). Two recent meta-analyses (Wang et al., 2018; Zoon et al., 2021) concluded that apathy was more prevalent after STN DBS compared to the pre-operative state or to control groups managed only with medication. No data are available for social cognition, which is another area of impairment in PD (Mattavelli et al., 2021).

We must acknowledge several limitations in the present meta-analysis. As with all meta-analyses, the quality is limited by the number and the level of the included studies (GRADE analysis). Standardized tests are not always used, and some of them are performed in different versions (e.g., memory tests). Often it is not clearly reported how the final scores were calculated, making it hard to properly choose and group the correct means and SD for our purpose. There is also the risk of publication bias, meaning that studies with significant findings are more likely to be published with an overestimation of the effects. In order to control for this last limitation, as reported, we conducted publication bias analysis and when necessary, we corrected for the missing studies applying the trim and fill method. Reviews are also prone to search and selection bias. Based on the current findings and given the low statistical power of the pooled analysis, especially regarding the control group studies, it is clear that further RCTs comparing DBS and PD control groups (best pharmacological treatment) are required, with standardized outcome measures and adequately reported results. An important obstacle in planning RCTs is the patients' reluctance to participate in randomized studies. They are often unwilling to be in a control group for more than a few months. This is especially relevant now that many DBS devices have been approved and surgical centers have proliferated (i.e., patients can access DBS procedures without the onerous demands involved in research trials). The other issue is that control groups in long-term studies have not been constituted by random assignment and provide a false sense of security that extraneous variables are being controlled for.

We acknowledge that variables such as L-dopa response or age and attention at baseline are predictors of cognitive and psychosocial outcome after DBS (Smeding et al., 2011). Unfortunately, this type of data is not always reported, but in Table 2 we have included a selection of variables regarding the participants’ characteristics which may help the reader gain an understanding of the factors that may influence DBS cognitive outcomes.

Summing up, our findings add new data to the existing literature by demonstrating that cognition and emotion show significant changes after DBS, some positive, such as a decrease in anxiety and depression, and some negative, such as impairment in long-term memory, verbal fluency and specific subdomains of executive functions, for example, flexibility and inhibition.

These results have a possible simple explanation, since both GPi and STN are part of a circuit involved in inhibition or disinhibition of frontal areas (Huh et al., 2018). What can be taken for granted is that verbal fluency, long-term memory, and inhibitory control should be intact before submitting a patient to DBS, being crucial aspects to test before treatment. In line with this perspective, a recent publication presented data suggesting that poor presurgical performance in verbal memory recognition, language processing, and visuospatial performance is associated with patient- or caregiver-reported decline following DBS surgery (Mills et al., 2019). Of course, this is not a strict and forward recommendation especially because the GRADE evaluation indicates that new publications could modify the observed effects. Clinicians should carefully balance the potential benefits and risks of a DBS intervention based upon each patient’s characteristics.

Finally, the improvement of motor symptoms probably produces a better perception of quality of life, even in presence of cognitive worsening (Mehanna et al., 2017; Merola et al., 2014; Wu et al., 2014).