Physical Activity for the Treatment of Chronic Low Back Pain in Elderly Patients: A Systematic Review

Chronic low back pain (CLBP) affects nearly 20–25% of the population older than 65 years, and it is currently the main cause of disability both in the developed and developing countries. It is crucial to reach an optimal management of this condition in older patients to improve their quality of life. This review evaluates the effectiveness of physical activity (PA) to improve disability and pain in older people with non-specific CLBP. The Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines were used to improve the reporting of the review. Individual risk of bias of single studies was assessed using Rob 2 tool and ROBINS-I tool. The quality of evidence assessment was performed using GRADE analysis only in articles that presents full data. The articles were searched in different web portals (Medline, Scopus, CINAHL, EMBASE, and CENTRAL). All the articles reported respect the following inclusion criteria: patients > 65 years old who underwent physical activities for the treatment of CLBP. A total of 12 studies were included: 7 randomized controlled trials (RCT), 3 non-randomized controlled trials (NRCT), 1 pre and post intervention study (PPIS), and 1 case series (CS). The studies showed high heterogeneity in terms of study design, interventions, and outcome variables. In general, post-treatment data showed a trend in the improvement for disability and pain. However, considering the low quality of evidence of the studies, the high risk of bias, the languages limitations, the lack of significant results of some studies, and the lack of literature on this argument, further studies are necessary to improve the evidences on the topic.


Introduction
Low back pain (LBP) is a common symptom that can improve spontaneously within a few weeks. However, about 2-7% [1] of cases may evolve into chronic low back pain (CLBP) that may lead to significant disability. Age is a well-known risk factor for CLBP in association to [2,3], psychological distress, inactivity, social environment, comorbidity, gender, genetic, and prior work exposure. CLBP affects approximately 20-25% of the elderly population (older than 65 years) [4], and it currently is the main cause of disability both in the developing and developed countries [5,6]. It increases linearly from the third decade of life affecting more women than men [7]. After a single episode of LBP, there is a higher risk to become recurrent [8]. CLBP, that is one of the most important conditions that leads to work-related disability, has dramatic consequences on the costs for the health system [9]. It is defined by the location of pain between the lower rib margins and the buttock that lasts for more than 12 weeks [10,11] and it can be often accompanied by neurological symptoms in the lower limbs

Materials and Methods
We focused our research on studies concerning PA as a treatment for CLBP in elderly patients. The Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines were used to improve the reporting of the review. The Grading of Recommendations Assessment Development and Evaluation (GRADE) [26] approach was used to assess the quality of evidence of the articles that include full data.

Study Inclusion Criteria
• Peer-reviewed studies of each level of evidence according to Oxford Classification. We included randomized clinical trials (RCT) and non-randomized controlled studies (NRCT) designs such as observational studies (OS), pre-post interventional studies (PPIS), and case-series studies (CS). We excluded case reports, technical notes, letters to editors, instructional courses, in vitro studies, cadaver investigation, systematic reviews, and meta-analyses. • Studies including elderly patients (mean age > 65 years) suffering by CLBP (at least > 3 months).

•
Clinical outcomes (disability and pain) of patients treated with PA (cardiovascular or aerobic) or exercise programs that included loaded (against gravity or resistance) as a component. To define a study as eligible, it had to include at least one pain assessment or one disability assessment. The disability outcome needed to be evaluated by one or more of the following scales: 36-Item Short Form Health Survey (SF-36) Version 1.0 and 2.0 (SF-36); Roland Morris Disability Questionnaire (RMDQ); Oswestry Disability Index (ODI); and Back function (FFBH-R) [27]. The pain outcome had to be evaluated by one or more of the following scales: Numerical pain rating scale (NRS); Global Rating Change (GRC); Patient Pain Questionnaire (PPQ); and Visual rating scale (VRS).

•
Only articles written in English and Italian languages were included.

Study Exclusion Criteria
• Studies with a mean age of patients < 65 years old; • Studies in which PA was a part of a multidisciplinary program; • Studies including participants who had physical problems that did not allow them to perform PA (diabetes untreated, muscle-skeletal problems, postural problems, neurological diseases, cardiovascular conditions).

Search Protocol
The following articles were screened from inception to March 2019: Medline, Scopus, CINAHL, EMBASE, and CENTRAL. For the search strategy we decided to use the following keywords: "low back pain" OR "chronic low back pain" AND "physical activity" OR "physical therapy" AND "elderly" OR "old aged" OR "older age" AND "Meziere" AND "Souchard" AND "global postural rehabilitation" "Feldenkrais" AND "McKenzie" AND "back school program" AND "Tai-Chi" AND "Pilates" AND "water therapy" OR "hydrotherapy" OR "balneotherapy" OR "hydrokinesis." We used the keywords isolated or combined. We searched for more studies among the reference lists of the selected papers and systematic reviews.

Study Selection
We accepted only English and Italian publications. The initial search of the article was conducted by two reviewers (D.S.S. and C.G). They used the protocol of search previously described to identify literature. In case of disagreements, the consensus of a third reviewer (R.F.) was asked. The researchers used the following research order. Titles were screened first, then abstracts and full papers. A paper was considered potentially relevant and its full text reviewed if, following discussion between the two independent reviewers, it could not be unequivocally excluded on the basis of its title and abstract. The full text of all papers not excluded on the basis of abstract or title was evaluated. The number of articles excluded or included were registered and reported in a PRISMA flowchart ( Figure 1). For designing the PRISMA we followed the rules by Moher et al. [28].

Data Extraction
Data were extracted on: author, n of participants, year of study, content of intervention and control group, follow-up, outcomes (disability and pain), and mean age.

Data Extraction
Data were extracted on: author, n of participants, year of study, content of intervention and control group, follow-up, outcomes (disability and pain), and mean age.

Quality of Evidence
To estimate the potential bias that were most relevant for the study, we used the following tools: the Cochrane tool for assessing risk of bias in randomized trials (RoB 2 tool) [29] (Table 1) and the Risk Of Bias In Non-randomized Studies of Interventions (ROBINS-I) [30] (Table 2). In order to avoid imprecisions, the elected papers were rated independently by two reviewers (E.A. and S.D.S.) and verified by a third (G.V.). We used the GRADE approach (Tables 3 and 4) to rate the overall quality of evidence. However, only six articles [31][32][33][34][35][36] showed full post-treatment data, therefore it was not possible to assess all the studies included using GRADE approach. The GRADE approach classifies the quality of evidence for each outcome grading the following domains: study design, risk of bias, inconsistency, indirectness, imprecision, publication bias, magnitude of the effect (not assessed in this study), dose-response gradient (not assessed in this study), and influence of all plausible residual confounding (not assessed in this study). The quality of evidence was then classified as follow: • High Quality of Evidence: among 75% of articles included are considered with low risk bias. Further researches are useful to change either the estimate or confidence in results.

•
Moderate Quality of Evidence: one of the GRADE domains is not met. Further studies are required to improve the quality of the study and the evidence.

Quality of Evidence
To estimate the potential bias that were most relevant for the study, we used the following tools: the Cochrane tool for assessing risk of bias in randomized trials (RoB 2 tool) [29] (Table 1) and the Risk Of Bias In Non-randomized Studies of Interventions (ROBINS-I) [30] (Table 2). In order to avoid imprecisions, the elected papers were rated independently by two reviewers (E.A. and S.D.S.) and verified by a third (G.V.). We used the GRADE approach (Tables 3 and 4) to rate the overall quality of evidence. However, only six articles [31][32][33][34][35][36] showed full post-treatment data, therefore it was not possible to assess all the studies included using GRADE approach. The GRADE approach classifies the quality of evidence for each outcome grading the following domains: study design, risk of bias, inconsistency, indirectness, imprecision, publication bias, magnitude of the effect (not assessed in this study), dose-response gradient (not assessed in this study), and influence of all plausible residual confounding (not assessed in this study). The quality of evidence was then classified as follow: • High Quality of Evidence: among 75% of articles included are considered with low risk bias. Further researches are useful to change either the estimate or confidence in results.

•
Moderate Quality of Evidence: one of the GRADE domains is not met. Further studies are required to improve the quality of the study and the evidence. • Low Quality of Evidence: two of the GRADE domains are not met. Further research is very important. • Very Low Quality of Evidence: three of the GRADE domains are not met. The results of the study are very uncertain. In the case of studies with a sample size inferior to 300 subjects the quality of the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  : low risk; : some concern; : high risk; Cri the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  : low risk; : some concern; : high risk; Critica : low risk; : some concern; : high risk; Cri the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed wi tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critic the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed w tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Cri the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed wi tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critic the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. : low risk; : some concern; : high risk; Critica : low risk; : some concern; : hig the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. : low risk; : some concern; : high risk; Critic the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  : some concern; J. Clin. Med. 2020, 9, x FOR PEER REVIEW 5 of 21 the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical.  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bia tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high tools. In our study we used Rob2 and ROBINS-I).   the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with d tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bia tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high the study is considered very low if there was also a high tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was als tools. In our study we used Rob2 and ROBINS-I).   the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with d tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assess tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk;  the study is considered very low if there was also a high tools. In our study we used Rob2 and ROBINS-I).    the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I).  the study is considered very low if there was also a high risk of bias (assess tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; : low risk; : some concern; : high     the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical.
J. Clin. Med. 2020, 9, x FOR PEER REVIEW the study is considered very low if there was also a high risk of bias (assess tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; the study is considered very low if there was also a high risk of bi tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high the study is considered very low if there was also a high tools. In our study we used Rob2 and ROBINS-I).   the study is considered very low if there was also a high risk of bias (assessed with different tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with differe tools. In our study we used Rob2 and ROBINS-I). : low risk; : some concern; : high risk; Critical. : low risk; : some concern; : high risk; Critical. the study is considered very low if there was also a high risk of bias (assessed with tools. In our study we used Rob2 and ROBINS-I).     [34] ⊕⊕ LOW MD: mean difference, *: statically significant; C.I.: confidence interval.
The outcomes assessed were improvement in pain and disability, both evaluated at the end of the treatment. Follow-up were different and ranged from 1 month to 48 months. Furthermore, the outcomes were subgrouped into RCTs, NRCTs, and other studies (pre-post intervention and case series).

Study Selection
We created a flow-chart diagram according to the PRISMA protocol that shows the selection process of the studies (Figure 1). We found a total of 2173 studies (no additional studies were found in gray literature). We obtained 1891 studies when the duplicates were removed. Of the 1891 studies, 1709 articles were excluded from our study through the title screening. We assessed the abstracts of 182 articles and we excluded 94. Then, 88 full-text articles were screened. Out of these studies, 76 were excluded for the following reasons: mean age of patients < 65 years old (n = 64); experimental intervention not meeting the inclusion criteria (n = 8), and comparison group not meeting the inclusion criteria (n = 4). After this process, we included 12 articles in our study. No unpublished studies were retrieved.

Study Characteristics
A description of the characteristics of the studies that was considered eligible for this review is reported in Table 5. A total of 12 articles were selected for this systematic review. We included 7 RCT of I level of evidence (LOE), 3 NRCT (3 OS of II LOE), 1 PPS of III LOE, and 1 CS of IV LOE. Studies were published between 1992 [37] and 2016 [31].   Based on the data of the included studies, a total of 1581 patients were treated for CLBP. The mean age of patients at the time of treatment was 71.88 ± 3.01 and ranged between 67.5 [36] and 76.0 [42].
The studies cited in this review show high heterogeneity in terms of study design, interventions, and outcome variables. The results are presented descriptively, focusing on disability and pain and further issues of potential interest. In general, post-treatment data showed a moderate range of improvement for disability and pain. Otherwise, these results need to be evaluated carefully due to the high risk of bias and the high heterogeneity of the studies included.

Methodological Quality
The Rob2 tool for RCT and ROBINS-I tool for NRCT, pre-post intervention and case-series were used to assess the methodological quality of each study. For RCT we found three studies with an overall risk identified as "some concerns," 3 as "high risk," and 1 as "low risk". Concerning the NRCT we found 1 study with an overall risk of bias identified as "critical" [38] and 2 studies as "moderate" [34,37]. We assessed the pre-post intervention study with an overall risk of bias identified as "serious" [41]; instead the case series was identified as "moderate" [42].
The quality of evidence of the studies included in GRADE ranges from low to moderate. All the studies, except one [34], have a small sample (n < 300). Methodological quality assessments of each study are summarized in Tables 1 and 2. The quality of evidence of full data trials was performed using GRADE approach (Tables 3 and 4). The analysis of the data of the study was reported using the mean difference between studies. RevMan5 (version 5.3) was used to calculate the mean difference of the included studies. Because of the lack of post treatment results in some studies, we decided to perform a systematic review and not a meta-analysis. We report the outcomes of each study in Table 5.

Results of Individual Studies
The intervention methods are usually well described in all the included studies. High heterogeneity in the type of PA was reported in all the studies. We included all types of PA (walking [32,35], back school and hydrotherapy [39], isotonic resistance exercises [40] yoga and qijong [31], TOTXR [33] and LEXTR). The authors divided the description of intervention per outcome (pain and disability) in three subgroups (randomized controlled trials, non-randomized controlled trials, and other studies, including pre-post intervention and case series).

Randomized Controlled Trials (RCTs)
Seven RCTs were included. They were divided per outcome: 2 studies [36,40] examined the improvement in pain (measured by NRS and VRS); 5 studies [31][32][33]35,39] assessed the disability outcome (measured by ODI, RMDQ, PPQ, FRI, FFBH-R, and SF-36). Single studies were assessed for risk of bias using Rob2 tool. Two studies were classified as "high risk," three as "some concerns," and one as "low risk." It was possible to include only 5 articles in GRADE analysis [31][32][33]35,36]. The overall quality of evidence in these studies ranges from "low" to "moderate" according to GRADE. The quantitative effect estimate was reported as mean difference between and within studies (when possible). This heterogeneity among studies and the low quality of evidence could lead to an overestimation of the results. The results of the outcome of the other studies are reported in Table 5.

Outcome: Pain
Two RCTs studies [36,40] presented data on pain at the end of the treatment. The authors used NRS and VRS to evaluate the improvements in pain. Follow-up was 3 months in the study carried out by Holmes et al. [42] and 4 months in the study by Vincent et al. [36]. At the end of the treatment, they both reported a reduction of pain in the group treated by PA (isotonic resistance exercises in Holmes et al. [42] group and TOTXR and LEXTR in Vincent [36] group). The study by Holmes et al. was classified as "high risk," and the risk of bias of the study by Vincent et al. was assessed as "some concern" using Rob2 tool. The study by Vincent et al. [36] was assessed as "moderate" quality using GRADE analysis. It was not possible to evaluate the overall quality of the other study according to GRADE [26] because of the lack of data. Otherwise, in both articles it was reported an improvement in pain evaluated by NRS and VRS. Vincent et al. [36] reported a better NRS in the intervention group compared to the control group at the end of the treatment (MD −1.73, 95% C.I. −3.11 to −0.35, p = 0.01). Holmes et al. [42] reported a difference from 5.3 to 2.1 points in VRS from the beginning to the end of the treatment (no full data were reported concerning to control group results). Otherwise, the authors reported an improvement in pain between the intervention and the control group, but this was not statistically significant (p > 0.05). The results of the outcome of the other studies are reported in Table 5.

Outcome: Disability
Five RCT studies [31][32][33]35,39] presented data on disability at the end of the treatment. The authors used ODI, RMDQ, SF-36, PPQ, FRI, and FFBH-R to assess the improvements in disability. Follow-up was heterogenous: 1 month for Tsatsakos et al. [35]; 1.5 months for Ferrel et al. [32]; 3 months for Teut et al. [31] and Costantino et al. [41]; and 4 months for Vincent et al. [33]. At the end of the treatment, all studies reported an overall improvement in disability. The PA program was different between studies (walking [32,35], back school and hydrotherapy [39], yoga and Qijong [31] and TOTXR [33]). In the study by Ferrel et al. [32] the control group was constituted by the hydrotherapy group and not by a no-intervention group as in the other studies. Also, in this study they reported an overall increase in disability in both groups. The studies by Tsatsakos et al. and Ferrel et al. were classified as "high risk," Teut et al. as "low risk," Costantino et al. and Vincent et al. as "some concern" using Rob2 tool. It was not possible to assess the quality of evidence of the study by Costantino et al. [41] because of the absence of a "no-intervention" control group. The overall quality of the other 4 studies [31][32][33]35] was evaluated as "low" according to GRADE [26]. In specific, the authors divided the studies into two subgroups: RCTs measured by ODI and RCTs measured by SF-36. We used only these scales since they were reported in all studies. We found a reduction of disability evaluated by ODI (MD −1.24, 95% C.I. −1.94 to −0.54; p = 0.0005 *). Moreover, an improvement of SF-36 in patients treated by PA was reported (MD 2.88, 95% C.I. −3.30 to 9.06, p = 0.36). Costantino et al. [41] observed a highly significant statistical difference of SF-36 (13.30 ± 1.44, p < 0.001 *), measured in both intervention groups (back school and hydrotherapy) at the end of the treatment. The results of outcome of the other studies are reported in Table 5.

Non-Randomized Controlled Trials (NRCT)
We included in our review three NRCT [34,37,38] studies. They were divided per outcome: 2 studies [34,37] examined the improvement in pain (measured by GRS and PPQ); 1 study [38] assessed the disability outcome (measured by RMDQ). The latter study did not have a control group. Single studies were assessed for risk of bias using ROBINS-I tool [30]. Two studies [34,37] were classified as "moderate" overall risk and one [38] as "critical." Because of the lack of data, it was possible to assess the quality of evidence, according to GRADE, only of the study by Hicks et al. [34] classifying as "low." The quantitative effect estimate of this study was reported as mean difference between groups. The high heterogeneity among studies and the low quality of evidence could lead to an overestimation of the results. The results of outcome of the other studies were reported in Table 5.

Outcome: Pain
Two NRCT studies [34,37] presented data on pain at the end of the treatment. The authors used GRS and PPS to evaluate improvements in pain. Follow-up was 1 month in the study by Khalil et approach. The quality of the other studies was evaluated by Rob2 for RCT and ROBINS-I for the other study types. The lack of data in some articles, and the poor literature among this topic could lead to low quality of evidence. Our research highlighted that older patients with CLBP treated with PA showed an overall pain and disability improvement in the majority of the studies. Otherwise, these conclusions need to be taken carefully, considering the high risk of bias, the low quality of evidence of the literature, and the languages limitations of this study (only English and Italian articles were included). Because of these limitations and the absence of high-quality literature, we decided to perform only a systematic review of the literature and not a meta-analysis.
However, the extreme variability of type, duration, intensity, and execution modality of the proposed PA, the different body district on which PA were focused on in each different program and the compliance of the patients, are important variables that make it impossible to recommend a specific protocol in the elderly population. This lack of standardization was also confirmed by Airaksinen et al. [18] that found a considerable variety of PA, such as stretching, aerobic exercises, or muscle reconditioning.
Regarding the 4 studies evaluating pain (2 RCTs and 2 NRCTs), they showed that both lumbar isotonic resistance exercise cycles and abdominal, thoracolumbar and upper limb isotonic and isokinetic strengthening exercises, improve pain in elderly patients with CLBP. In their RCT, Vincent et al. [32] also reported, at a 4-months follow-up, an improvement in walking speed and endurance. This finding confirms that the physical treatment of CLBP might be focused not only on the lumbar muscles but also on the lower limbs and thorax (exercises for breathing muscle districts [39]). Otherwise, one study [40] reported an improvement in pain, but not statistically significant if compared with the control group (p > 0.05).
The studies which assessed disability (5 RCTs, 1 NRCT, 1 pre-post intervention and 1 case series) confirmed that walking, back school exercise, hydrotherapy, yoga and Qijong, bicycle program, strengthening and stretching program, and combined PA and cognitive-behavioral program improve the functional performances of elderly people with CLBP. However, because of the high heterogeneity of the studies, we found a significant reduction of disability evaluated by ODI (p = 0.000 5*), but the improvement of SF-36 in patients treated by PA was not significant (p = 0.36). Moreover, we also found an improvement in patients treated by different types of PA such as back school and hydrotherapy [39] (p < 0.001 *) at the end of the treatment.
Other important concerns are compliance and motivation of the patient that may represent decisive parameters during CLBP treatment in the elderly. Beissner et al. [39] emphasized an interesting treatment option represented by the cognitive-behavioral therapy (CBT) in association with PA to reduce symptoms in patients with CLBP. This novel treatment is becoming increasingly important. In a recent systematic review, Vitoula et al. [45] highlighted that CBT was effective in patients with CLBP, especially in reducing pain perception and helping them to improve their functionality. Furthermore, the review showed that better outcomes can be achieved when treatments are personalized. This represents a remarkable issue. In fact, several studies included in our research [34,38,39] showed that patients that maintain a prolonged compliance to the rehabilitation protocols and were highly motivated had better outcomes in pain relief and function outcomes.
It is crucial to focus on the biological effects of PA [46,47]. One major limit to perform PA in old-aged patients is the sarcopenia, defined as a loss of muscle mass (lean body mass) with a reduction of muscle function [48]. This process represents a specific condition of normal energy balance in the elderly, with an increase in body fat percentage. Limb surgery postoperative period, disuse, endocrine diseases (such as diabetes type II), and uncontrolled nutrients intake lead to sarcopenia [49]. This condition could lead to a frailty status, with a reduction of PA [50]. Landi et al. [51] conducted a review of the literature reporting that PA has an important role in the reduction of sarcopenia in old-aged people. PA could also increase irisin [52] and osteocalcin [53]. The former is a hormone-like myokine produced by skeletal muscle during PA [54]. Irisin can induce thermogenesis from brown adipocytes. This protein has also an effect in the control of bone mass, with positive effects on cortical mineral density. It is also demonstrated that irisin plays a crucial role in the reduction of sarcopenia in old people [55,56]. Osteocalcin is a bone-derived hormone-like protein. It could favor physiological functions increasing the bone formation [57], regulating the muscle decrease related to age [58], and reducing the risk of diabetes type II [4,59]. Chahla et al. [60] reported in their study that osteocalcin is higher in patients who perform regular PA, with an increase in bone mineralization, muscle function, and reduction of risk of diabetes type II.
Moreover, several studies [61][62][63] report that PA could also reduce the level of osteoporosis, resulting in a valid therapeutical approach for this disease in elderly people.

Limitations
The results of this study should be considered with caution, as there was a high heterogeneity in terms of follow-up, type of intervention, and standardization of physical protocols. In fact, the follow-up varies from a minimum of 1 month to a maximum of 48 months, as well as the number of patients (49 to 392). The small sample size and the high heterogeneity among trials as well as the absence of a control group in three studies [38,41,42], make the estimate of the effect of intervention extremely challenging. Moreover, the low quality of the studies (from "low" to "moderate"), and the high risk of bias of some studies included, decrease the power of our conclusions. Nevertheless, some studies reported an improvement of outcomes in patients treated by PA, even if their results were not statistically significant. These data could lead the authors to overestimate the results considered. Another important limitation of this systematic review is the decision of the authors to include only English and Italian articles. This limitation could lead to an exclusion of relevant studies related to this specific topic. Therefore, further high quality evidences that take into account the standardized methods and a similar cohort of patients are desirable. At the same time, this review should promote future investigations, also including other languages, to better understand which type of PA is preferred to treat older patients with CLBP and help our clinical practice.

Conclusions
In the available literature PA seems to have a trend of improvement in pain and disability in elderly patients with non-specific CLBP. However, because of the limited and low-quality literature it is not possible to state this positive effect as a definitive conclusion. In order to avoid the overestimated effectiveness of PA on CLBP from high risk of bias studies, new high-quality evidence is needed.