Advances in peri-operative care and surgical technique have significantly improved outcomes of liver resections, attaining mortality rates below 5% in high-volume centres [1, 2]. The benefits of minimally invasive surgery (MIS) have been highlighted in several gastrointestinal surgical procedures for the treatment of benign and malignant diseases, mainly owing to reduced pain, blood loss, length of hospital stay and complications [3,4,5]. However, MIS uptake has been slower in liver surgery due to concerns regarding the technical feasibility and safety of the technique [6,7,8,9,10]. Twenty-five years after the first reports of LLR, only a small proportion of liver resections is performed through a minimally invasive approach [11, 12].

It is important to consider the level of complexity of different procedures when working on improving the uptake of LLR. While simpler procedures are being performed laparoscopically more commonly, complex major liver resections that are more challenging remain uncommon [9, 13,14,15]. The ability to objectively define the difficulty of a LLR prior to surgery would support surgeons in (1) identifying appropriate resections to perform laparoscopically at each step of the learning curve for a better ad equation of skills to surgical difficulty, and (2) selecting patients to benefit most from the minimally invasive approach.

Surgical scores have been proposed to rate the difficulty of LLR [16,17,18,19]. However, none of the them is routinely used in practice. To be used by surgical teams, the difficulty scores must be able to accurately predict the outcomes of LLR in a diverse population of patients undergoing this procedure, and must be simple and easy to use in a preoperative clinical setting. The Morioka International Consensus on LLR recognised the usefulness of predictive tools to better select procedures according to surgeons’ skills and experience, but highlighted the need for validations of existing tools prior to clinical application [13].

Therefore, the objective of this study was to systematically review the literature for studies reporting development or validation of scores predicting the difficulty of LLR.

Methods

Search strategy

Medline (1966-December 2017), EMBASE (1974-December 2017), Web Of Knowledge and the Scopus database (2004–December 2017) were systematically searched. No restriction was applied regarding language or the type of publication. As of December 2017, the grey literature using the OpenSIGLE database, the Trip database and Google Scholar, was queried for relevant studies. With assistance from an information specialist, the search strategy was initially developed for Medline and adapted for each data source (see Online Appendix). Keywords and MeSH (or EMTREE) terms were gathered into three categories: (1) liver resection, (2) minimally invasive surgery and (3) predictive score. Medical subject headings (MeSH) did not exist for prognostication tools and so a previously developed filter for predictive models was used [20]. To increase search sensitivity, each keyword was exploded. In addition, bibliographies of all included studies were reviewed for any relevant publications.

Study selection

This review included studies reporting on the development, validation or impact assessment of scores predicting the difficulty or outcomes of LLR (laparoscopic-assisted, totally laparoscopic or robotic). Studies including adult patients (≥ 18 years old) operated for benign and malignant diseases were eligible. Studies were excluded if they reported on the prognostic impact of a single factor (unless it was updating the accuracy of an existing prognostic tool), relied on inappropriate analytic purpose (e.g. multivariate modelling not aimed at prediction, development of novel statistical methods), were not limited to LLR or did not represent original data or research (e.g. editorial, review). Studies that included patients not fulfilling the inclusion criteria (e.g. open liver resection) were excluded if it was not possible to distinguish those patients from the larger study population. In the event of duplicate publication, we included the most relevant and the most informative study.

Data abstraction

A standardised extraction form was developed and pilot-tested using previously published recommendations from the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS) checklist and Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines [21, 22]. The characteristics of study design, study population, prognostic factors, outcome measures and tool development (statistics modelling decisions, candidate variable selection) and validation (internal and external validation methods, measures of the model predictive accuracy) were abstracted.

Risk of bias assessment

Considering that there is currently no published validated scoring system to assess the risk of bias of predictive tools’ development and validation studies, a checklist was created using key methodological elements suggested by the TRIPOD guidelines [22]. Similar assessment has been used in prior systematic reviews of prognostic tools [23,24,25].

Two authors (JH and KB) selected studied, extracted data and assessed risk of bias independently. Disagreements were resolved by consensus. A single reviewer (JH) assessed the references from the grey literature and meetings’ proceedings.

Data summary

The statistics related to predictive scores’ development and validation are reported in summary tables. Formal statistical evaluation of the internal or external validity was defined as assessment of discrimination and/or calibration of a score. Discrimination assesses the ability of the score to distinguish between high- and low-risk patients. Calibration assesses how closely the predicted values of the outcome match the observed outcomes in the study sample. Both these are standard in good statistical practices evaluating clinical prediction model [22, 26].

Results

Systematic search

The search results are presented in Fig. 1. Seven articles were initially included. One additional article was included later because it was published immediately after the initial search strategy was performed. Four tools predicting the difficulty of LLR were identified: the Difficulty Scoring System, the Hasegawa score, the Difficulty of Laparoscopic Liver Resections Classification and the Halls tool (Table 1) [16, 27, 28]. Four articles contained solely an external validation of one of the identified scores [29,30,31,32]. We did not identify studies assessing the effectiveness or implementation of tools in clinical practice.

Fig. 1
figure 1

Flow diagram of study selection

Table 1 Characteristics of included studies and predictive tools

All tools were presented as risk scores. One tool provided the statistical equation, including coefficients and the intercept, used to create the risk score [16]. Publication of the equation is recommended by the TRIPOD guidelines in order to allow adequate external validation of the tool with appropriate statistical methods [22].

Score development methods

Details of the tools’ development and validation studies are presented in Table 2. None of the included studies relied on prospectively collected data aimed at creating a clinical prediction tool. Four studies used data from Japan (57%), 2 from Korea, one from France and one from multiple European institutions. Data related to patients undergoing LLR from time periods covering 2003 to 2016 in most studies, except for two studies that used data going back to 1995 and 1997. Three studies included patients from multiple institutions, two reporting on tool development and one on tool validation. The cohort sample sizes ranged from 78 to 2856. The number of events were reported in one out of 4 development studies. Two studies reported the follow-up duration (extrapolated from description of morbidity being assessed at 90 days following surgery).

Table 2 Methodological evaluation of the included studies

The methodology for the included studies is summarised in Table 3. Three out of 4 tools identified predictive factor for inclusion using p-value-based variable selection (univariate and/or multivariate analyses). The fourth tool relied solely on a priori identification of clinically relevant variables.

Table 3 Predictive factors included in and outcomes predicted by the models reported in this review

Two studies reported the frequency and/or handling of missing data. Two tools predicting continuous outcomes used linear regression models to develop their predictive tool, one used logistic regression for categorical outcome and the fourth did not report the statistic model used. None of the studies reported verifying the statistical assumptions of the models. One tool provided the statistical equation used to create the risk score [16].

Populations, predictive factors and outcomes

The populations addressed by each predictive tool are detailed in Table 1. All tools were developed to assess difficulty of LLR in patients undergoing pure LLR for neoplasm. The predictive factors included in the tools and the outcomes being predicted are presented in Table 3. There was significant heterogeneity in both the predictive factors included in the tools and the outcomes being assessed.

Up to 5 factors were included in a single tool. A total of 10 predictive factors were included across all tools. Extent of liver resection and tumour location within the liver were common to three tools.

The Difficulty Scoring System tool was developed to predict a difficulty index whereby experts had assigned a score from 1 to 10 to specific types of LLR and which had been tested for inter- and intra-rater reliability in prior studies. The Hasegawa tool indicated being developed to predict operative time. The Difficulty of LLR Classification tool’s development study did not explicitly mention the outcome to be predicted, but it is understood it was developed to predict 90-day post-operative morbidity defined by the Clavien–Dindo classification and mortality [33]. The Halls tool was developed to predict intra-operative complications defined by the Satava classification [34]. Two tools then tested the developed tool against additional outcomes. The Difficulty Scoring System development study assessed the risk score’s association with operative time, blood loss and post-operative morbidity and mortality. The Hasegawa tool examined the association between the risk score and blood loss, conversion, length of stay and post-operative morbidity.

Internal and external validity

Each tool’s predictive performance is summarised in Table 4. The Halls score included internal validation, using 1/3 of the cohort not used for score development (split-sample) [22], and reported calibration (visual inspection) and discrimination (area under the curve). The Difficultly Scoring System’s performance was externally validated in four studies, of which three were independent validations.

Table 4 Details of evaluations of tools internal and external validity

The outcomes tested in the validation studies varied and did not match the outcomes that the tool was developed to predict (Table 2). The studies examined the association between risk scores from the tool and blood loss, operative time, conversion, blood transfusion, post-operative morbidity, intra-operative hepatic pedicle clamping and length of stay, using univariate and multivariate regression models. One study assessed the model’s calibration with Spearman’s ran correlation statistic and visual inspection of scatter plots. One study examined the model’s calibration by computing the Area Under the Receiving Operating Curve (AUROC), which was 0.53.

Discussion

This systematic review identified 4 tools to estimate the difficulty of LLR, including one subjected to 4 external validations. These scores predicting the difficulty of LLR are limited by methodological flaws and lack of proper validations. This represents a gap in LLR practice.

None of the studies adhered to recommendations for the development and validation of predictive tools (TRIPOD guidelines) [22]. Discrimination and calibration of the tools were not assessed in 3 out of 4 development studies. Development and validation studies used regression analyses to examine the association between score risk categories and outcomes, which is not an appropriate method in predictive score development and validation [22]. When assessed, discrimination was not better than chance along in predicting LLR difficulty (AUROC = 0.53). The lack of information on missing data on predictor variables, number of events and follow-up time were other significant issues. Finally, small sample sizes of some development studies combined with a high number of predictor variables (5 factors for 90 patients in development of the DSS) could lead to overfitting of the models and inaccurate risk prediction.

Similar to other reviews of prognostication and prediction tools, significant variation was documented in the number and types of predictive factors included in the identified tools [23, 24]. Variation in variable inclusion may be driven by the use of statistical rules, rather than clinical knowledge, to select covariates in the model, or the use of data from sources not designed to develop a predictive tool. Indeed, the processes to select covariates included in the scores were not standardised. Some factors previously recognised by international LLR experts were neither considered nor included, especially patient-level factors [18]. Intra-operative variables were also used, which limits usability in the preoperative setting for surgical planning. Selecting prognostic factors using a data-driven strategy risks over-estimating the predictive ability, since this strategy optimises case mix and statistical relationships specific to the development dataset (overfitting). Models and tools developed that way are less likely to perform well in other populations. The ideal predictive tool would incorporate clinically relevant factors a priori in the model to improve face validity and facilitate subsequent tool adoption in the community.

Furthermore, LLR difficulty is complicated to predict; there is no consensus regarding the definition of LLR, as evident by the variety and inconsistency of anchor outcomes predicted by the existing scores. Some scores attempted to predict more than one outcome. Validation studies did not validate against the outcomes for which the score was developed, thereby introducing bias and confusion about validity and applicability. An ideal tool would use an universal, consistent and objective definition for LLR difficulty as the anchor outcome, which is not currently available.

A predictive tool for the difficulty of LLR that can be used reliably during preoperative surgical planning could help in improving the uptake of LLR. The Morioka International Consensus on LLR outlined the steep learning curve for LLR [13]. Therefore, preoperative assessment of the difficulty of LLR is important in selecting appropriate patients according to a surgeon’s skills and experience at each stage of the learning curve. Mentoring from and discussions with more experienced minimally invasive surgeons can improve case selection and incorporation of minimally invasive techniques in one’s practice [35]. However, these opportunities may not be readily available depending on practice settings. The availability and acceptability of a well-developed and validated tool may help to more individually and accurately estimate LLR difficulty, thereby assisting a broader group of surgeons in safely transitioning towards LLR. To be useful in practice, a score should be practical, easy to use and present generalisability proved via robust validation studies [22, 26, 36]. There is a significant opportunity to misclassify LLR difficulty with the existing tools, hence potentially not selecting proper cases for LLR which can expose patients to harm. As many tools were developed and published prior to the circulation and widespread dissemination of the TRIPOD guidelines across the medical community, these limitations may be easily addressed in future research at the time of study design. Finally, a reliable and accurate predictive tool would also improve risk stratification to support the design and conduct of clinical trials focusing on LLR.

This systematic review is limited by the literature available on the difficulty of LLR and the methodological weaknesses of the included studies, most of which are related to lack of adherence to guidelines for predictive tools development and validation [22, 26]. While there is a vast body of literature regarding the results of LLR, the difficulty of LLR has not been defined and few studies have systematically addressed how to estimate it [9]. Studies looking at factors associated with the results of LLR as part of descriptive multivariable analyses were not included as they did not aim to predict the difficulty of LLR, which requires a different methodology. The strengths of this systematic review include a comprehensive, systematic and sensitive search conducted without restriction for language or types of publication, which also considered the grey literature, and was performed in duplicate. The included studies were rigorously and critically appraised following recommendations from the TRIPOD and CHARMS guidelines, which limits the potential for bias in the information captured and conclusions drawn [22, 26]. The current results highlight the limitations of existing tools and the need to systematically develop and validate tools based on a well-defined anchor outcome for difficulty of LLR.

Conclusion

This systematic review identified limitations to existing tools predicting the difficulty of LLR, mostly due to lack of adherence to best practices for predictive tools development and validation. Therefore, existing tools cannot be used confidently or safely to predict LLR difficulty. Consistent objective clinical outcomes to define LLR difficulty should be established, and better-quality tools developed and validated in a wide array of populations and clinical settings. This could contribute to improve uptake of LLR as well as risk stratification for future trials.