Predicting the Risk of Alcohol Use Disorder Using Machine Learning: A Systematic Literature Review

The number of deaths caused by alcohol-related diseases may be reduced by predicting alcohol use disorder (AUD). Many researchers have worked on AUD prediction using machine learning (ML) techniques. However, to the best of our knowledge, there is a lack of a comprehensive systematic literature review (SLR) that summarizes the existing studies on AUD prediction using ML in the last ten years. To address this knowledge gap, this article provides an SLR of academic articles on AUD prediction using ML techniques dated from January 2010 to July 2021. This SLR highlights technical decision analysis related to five aspects: data collection site, characteristics, and type of dataset; data sampling and data pre-processing techniques; feature types and feature engineering techniques; and characteristics of ML techniques and evaluation metrics. Six bibliographic databases were searched, and the identified studies were rigorously reviewed based on the above five aspects. In the selected studies, public datasets were not used very often for AUD prediction. Given that, the current paper identified two different types of data collection sites for review. Imbalanced class distribution in datasets was the primary focus of the pre-processing and sampling steps. Various features, including demographics, family history, drinking behaviour, and electronic health records, were introduced as the more widely used AUD prediction features. The filter, wrapper, and embedded methods were identified as the primary feature selection methods. Support vector machine was the most widely employed algorithm for predicting AUD; however, the lack of deep neural network techniques is evident in this field. Moreover, considering gender disparities, early detection of AUD, and identifying trajectories towards AUD are suggested for future work. For the purpose of evaluating the performance of the prediction approaches, most studies considered the overall accuracy and the area under the receiver operating characteristic curve. Nevertheless, external validation was not performed in any of the selected studies. This paper also discusses challenges and open issues of AUD prediction for future research. This SLR represents a valuable resource for scholars investigating the prediction of AUD.


I. INTRODUCTION
Alcohol use disorder (AUD) is a broad term used to refer to problems caused by alcohol consumption. Affected individuals tend to have a lack of control over their alcohol consumption; they continue to drink despite the serious adverse effects of alcohol on their health and on the lives of others, including family, friends, and co-workers. Extreme alcohol use leads to various diseases, such as liver cirrhosis, chronic pancreatitis, upper gastrointestinal cancers, cardiomyopathy, polyneuropathy, and dementia. High alcohol consumption also affects decision-making abilities, in addition to accelerating cases of violence and harmful behaviours. The World Health Organization (WHO) report of 2014 mentioned that approximately 5.9% or 3.3 million deaths were caused by alcohol misuse [1]; alcohol misuse is also the world's fifth leading cause of death [2] and the leading risk factor for premature death and disability.
According to the diagnostic criteria stated in the Diagnostic and Statistical Manual of Mental Disorders (5th edition), 36.0% of adult males and 22.7% of adult females in the US, from 2012 to 2013, fulfilled the criteria for AUD at some point in their lives. In addition, 17.6% of men and 10.4% of women in past years were also noted to have fulfilled these criteria [3]. In Europe, 3.5% of people aged 18-64 years have been estimated to be alcohol dependent, with 11.1% of them being assessed as heavy drinkers [4]. Although AUD contributes to the second-highest level of all diseases linked to mental disorders after depression [5], treatment rates have been low, and most patients suffering from AUD have never received specialized treatment for their addiction [6]. Different studies have reported different reasons for the low treatment rate of AUD in Europe. [7] claimed that primary care physicians did not recognize AUD, delaying treatment. The reason for this is that conventional methods tend to detect alcohol-related problems through self-test reports [8]. Patients' dishonesty, lack of memory, the taboo surrounding the issue, and various other reasons may result in self-test reports that are used in the diagnosis of AUD being inaccurate.
Numerous factors are related to the increased risk of AUD. These factors have helped scientists predict AUD, for example, a history of alcoholism in biological family members [9], psychological factors, such as level of stress [10], and personality disorder [11], behavioural factors, such as gambling problems [12], and social influences [13]. Health records in hospitals also contain a substantial amount of information that can be related to AUD and thus potentially useful for AUD prediction. Advancements in machine learning (ML) methods may make the prediction of AUD based on health records even more precise and thereby helpful for staff medical decision-making. Through the development of ML methods, researchers can identify target groups for AUD interventions [14].
ML methods can classify documents or reports into one or more predefined categories [15]. These methods have been documented in numerous clinical documents and reports, such as electroencephalogram (EEG) reports, radiology reports, electronic health records (EHRs), and biomedical documents. These documents show that ML methods were used for detecting cancer stages [16], identifying paediatric traumatic brain injury [17], and predicting AUD [10] and [18]. Various types of ML methods exist, such as unsupervised machine learning (UML), semi-supervised machine learning (SSML), and supervised machine learning (SML). The primary goal of SML is to build an efficient mapping function to accurately predict or classify a dependent variable from the independent variables [19]. In contrast to SML, UML does not require a specific outcome variable. It is primarily used for clustering and dimensionality reduction. The aim of SSML methods, in comparison, is to optimize classification accuracy using only a few labelled records [20].
The basic methodology for developing a predictive model using ML usually consists of five phases: data collection, pre-processing, feature engineering, predictive model development, and model evaluation (see Figure 1). A dataset, including clinical reports and survey questionnaires, is initially collected from one source or multiple sources. Thereafter, these reports are labelled by experts or through clustering techniques of specific classes (e.g., AUD positive or AUD negative). Next, the pre-processing technique is employed to remove the dataset forms that are unnecessary or are noisy information. After the pre-processing stage, a numeric master feature vector is constructed as a result of feature engineering. This is done by selecting or extracting the most discriminative features from the dataset. The primary input of ML algorithms used for developing a predictive model or a classification model is the master feature vector. A variety of validation methods, precision, recall, F1 score, overall accuracy, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC), can then be employed to evaluate and validate the constructed ML algorithms [21].
In recent years, several studies have examined the prediction of AUD in particular. These studies have primarily provided reviews on AUD, predictions of AUD (e.g., duration of treatment and intervention type) [22]- [25], predictions of other addictions (smoking, gambling, and cocaine) [26], and a scoping study on prediction of AUD [27]. However, the important aspects of AUD prediction using ML techniques, particularly the data collection process, types and characteristics of the datasets, sampling and pre-processing techniques, types of features and variables, the process of handling feature redundancy and high dimensionality, types of ML algorithms used, and evaluation processes and performance metrics, have not been systematically reviewed  in the past 10 years. Therefore, the current systematic literature review (SLR) strives to evaluate the academic articles that explored AUD prediction using ML methods in the period from January 2010 to July 2021. In particular, the current SLR article aims to maximize the five general phases of building a predictive model or a classification model (Figure 1). To the best of our knowledge, this article is the first SLR to focus on AUD prediction over the last decade.
The primary contributions of this SLR are as follows: • The paper reviews studies from the past decade using five different dimensions. • It explores the different types of datasets used in predicting AUD using ML techniques; it also reviews the data pre-processing and sampling techniques that have been used to prepare datasets for AUD prediction. Moreover, it analyses the types of features and variables that contribute to the development of the AUD and the techniques used for extraction, selection, and the reduction of intended features, as well as ML algorithms that have been used for AUD prediction and their performances. • Finally, it outlines open issues and research challenges related to ML-based AUD prediction. The structure of this SLR is as follows: Section II presents the methodology applied in this study to systematically select the primary studies. Section III presents the comprehensive review and findings of selected studies. Section IV presents discussion on the findings. Section V offers some suggestions and research directions for future work. Finally, Section VI concludes this review.

II. METHODS
The methodology of this paper was inspired by the SLR guidelines proposed by Kitchenham and Charters [28]. As shown in Figure 2, we adopted three steps from this methodology: planning, implementation, and reporting. The planning phase was discussed in Section I above. The implementation phase will be discussed in Section II, parts A to C. Reporting reviews that include critical analysis and data synthesis will serve as the final phase discussed in Section III.

A. SEARCH STRATEGY AND SEARCH RESULTS
We shaped the search strategy stage into a five-group block, as shown in Table I. Anne Faber Hansen (a professional librarian at the University of Southern Denmark) and Ali Ebrahimi (the corresponding author) prepared a list of potential keywords and formulated queries to search the literature. The search tools Medical Subject Headings (MeSH) and Boolean logic (OR, AND) were used to narrow and expand the search terms. The formulated queries were applied to the keywords, title, and abstract of articles to identify potential journal and conference articles published (in English) from January 2010 to July 2021. Six highquality academic databases in medical and engineering fields, including Medline, Embase, Inspec, ScienceDirect, Web of Science, and IEEE Xplore, were considered for the extraction of relevant literature. More details on query formation for each academic dataset can be found in the supporting materials.
A total of 3,736 studies were retrieved using the abovementioned search query. Next, all extracted data were stored in Endnote [29], and duplicate studies were removed, which resulted in the exclusion of 1,381 studies. The details of the search results are presented in Figure 3.

B. DATA SCREENING AND SELECTION CRITERIA
Twelve principal reasons were used as the basis of the inclusion and exclusion criteria, which are listed in Table II. The most important and vital criteria were based on the primary aim and the goal of the studies. First, the primary aim of some studies was not AUD prediction. For example, one study [30] proposed a novel approach for automatically selecting features from a multivariate time series using a trace-based class separability criterion. Although an EEG dataset containing alcoholic and non-alcohol patients was used, the study's primary aim was not to analyse patients for AUD prediction. Second, many studies aimed to detect or make a diagnosis of AUD or alcoholism [31]. Prediction is an analysis based on a pattern from past and present data and information, whereas detection is mining for the extraction of information from a dataset when it is being processed. For example, processing past clinical records of patients can be used for the prediction of AUD. However, the detection of AUD can be accomplished by the analysis of a survey questionnaire [32]. Given that the primary focus of such studies was AUD detection rather than prediction, these studies were also excluded [33]- [36].
Third, some studies, for example, [37] and [38], focused on prediction, but it was geared toward the abuse of other substances (e.g., smoking, cocaine and gambling) instead of alcohol. Fourth, some studies considered ML techniques for assessing alcoholism treatment, for example, [39]. Fifth, a few  studies, such as [40], used ML techniques to assess the prediction of specific diseases of patients with an alcohol drinking problem. Other studies, such as [41], used ML techniques to analyse patterns in psychiatric disorders. All of these were excluded, including studies that used ML techniques on animals. Studies where full texts could not be accessed at the time of the SLR were also excluded. Moreover, a few studies used alcoholic and non-alcoholic patient datasets to indicate the performance of improved ML techniques [42]. These studies were also removed since they were not aimed at predicting AUD. Following duplicate removal, the remaining 2,355 articles were screened based on the inclusion and exclusion criteria using the titles, abstracts, and keywords proposed by three authors (AE, AN, and GMS). Discrepancies regarding whether the articles should be included or excluded were resolved through majority voting. In the case of a tie, the authors (AE, UKW, ASN, and MM) made the final decision. Figure 3, based on PRISMA guidelines [43], illustrates the screening of articles. The article title and keyword screening resulted in the inclusion of only 283 articles out of 2,355 articles. Next, these 283 articles were included in the abstract and full-text screening phase. This was performed by three of the authors (AE, AN, and GMS) based on the inclusion and exclusion criteria (see Table II). Finally, twelve articles were retained.

C. QUALITY ASSESSMENT AND DATA EXTRACTION
The quality assessment criteria based on the primary objective of this study were adjusted to identify the final quality of the selected studies. A quality checklist of questions is shown in Table 1 in the Appendix. Three of the authors (AE, AN, and GMS) assessed the selected studies using the 'Yes' or 'No' response, which carried the weights of '1' and '0', respectively, for evaluating the selected studies. The final results of this evaluation were then discussed using the Delphi method [44], and the threshold for inclusion of any study was set at five. All selected studies carried a score of six or above (50% or above); therefore, they were all included in the review. Three tables based on the five previously mentioned aspects were created to tabulate the twelve selected primary studies. These three tables include Table III, which briefly describes the datasets, data collection sites, pre-processing, and sampling   Inclusion Criteria 1. The main goal of the article must be the prediction of AUD. 2. The paper should describe the use of machine learning algorithms for the prediction of AUD. 3. The paper should have reported research to finding the features that may facilitate the prediction of AUD using machine learning algorithms. 4. Clinical reports, medical documents or other type of dataset should have been used. 5. The paper should be written in English. Nonetheless, papers on the processing of non-English documents were included. 6. Article must be published between 2010 and 2021. Article is either conference proceeding or journal article.

Exclusion Criteria
1. Studies which aimed the detection or diagnosis of the AUD 2. Studies which assessed the outcome of alcoholism treatment or the quality of post treatment. 3. Studies which aimed the prediction of other substance abuse than alcohol. 4. Studies which aimed to predict alcoholism lapses. 5. Studies on animals. 6. Studies which the full text is not available.  techniques. Table IV describes the characteristics of the  features and feature engineering process, and Table V describes the ML approaches used in the primary studies and the performance metrics used to evaluate them. In Section III, the extracted information from the primary studies is critically reviewed.

III. REVIEW ON THE PREDICTION OF ALCOHOL USE DISORDER
Data generated from the twelve studies were critically reviewed based on the five different aspects mentioned above. Subsection A presents a review of the different datasets used for the prediction of AUD as well as the preprocessing techniques used to prepare the dataset for further analysis. Subsection B presents a review on the feature engineering techniques and the types of features used for the prediction of AUD, and Subsection C presents a review on the various ML techniques used for the prediction of AUD and the different performance metrics used to evaluate these techniques.

A. DATASET CHARACTERISTICS AND THEIR COLLECTION SITES, PRE-PROCESSING AND SAMPLING TECHNIQUES
ML methods have been applied to several types of datasets, such as survey questionnaires, clinical and educational datasets, and historical personal data, for the prediction of AUD. As shown in Table III, most of the studies considered data from survey questionnaires [10], [45]- [52]. However, other sources of data, such as a combination of EEG samples and family history (FH) [53], a combination of students' health and clinical records, and educational datasets [54], EHRs [49], [50] and [52], genetic data [51], and MRI [47] and [48], were used in other studies. In terms of the collection sites, the primary selected studies can be categorized into single and multiple data collection sites. The category of each dataset is also shown in Table III. Among all studies, there were two datasets that were collected from a single site, one used by [45] and [10] and the other used by [54]. On the other hand, another study [55] combined datasets from two different public schools in Portugal, while another [53] collected data from six different sites. More information regarding the datasets, such as year of study, age of participants, gender, and period of follow-up, can be found in Table III.
One of the primary processes of the data mining task is data pre-processing. This is a process of formatting data so that they are usable by the ML algorithms. Depending on the dataset, pre-processing tasks differ, but all aim to improve the quality of the final predictive model by cleaning the dataset prior to model development. Data usually contain a high level of noise and sparsity, which can exist among medical measures (such as blood pressure), clinical scores (heart rate), and clinical codes (such as diagnosis ICD codes). In this respect, noisy data can emerge when participants provide invalid responses or when the data are incorrectly encoded into the spreadsheets or databases. Therefore, comprehensive preprocessing of the datasets is necessary to form the variables of the collected data to be more meaningful for predictive models based on ML algorithms.
In the related literature, different pre-processing techniques were used to overcome these issues. Two studies, [45] and [10], applied the K-Medoids clustering algorithm [56] to label the datasets. However, some studies [48]- [52], [54] employed the results of the questionnaires to label their datasets. In terms of handling missing values, two studies, [10] and [47], used imputation techniques, one [51] applied the interpolation method, and two, [49] and [45], removed patients and participants with missing values from the dataset.
In terms of dealing with imbalanced class distribution in datasets, several [48]- [50], [55] studies prevented class imbalance in the dataset. Most of these studies applied synthetic minority oversampling technique (SMOTE) [57]based techniques to handle this problem. Those studies that considered the data scaling process [10], [45], [48]- [50] used the normalization technique for rescaling all feature values to a range of between 0 and 1 [58].

B. REVIEW OF FEATURE ENGINEERING TECHNIQUES AND TYPE OF FEATURES
One of the most important aspects of field classification and prediction when using ML techniques is selection of the right feature or variable. A feature is an individual measurable property of the process being observed. Using a set of features, some ML algorithms can perform classification and prediction tasks [59]. In one study [27], variables such as FH and psychological and genetic factors were the features most widely used for the prediction and detection of AUD. Demographic features, including age, sex, family status, education level, income, and occupation, constitute another set of features widely used in ML studies [60]. A summary of the types of features used in the studies identified in the present systematic review is shown in Table IV. Two studies, [45] and [10], employed factors such as drinking motives, academic performance, psychological factors, demographic items, and drinking behaviour during the last month as features. In comparison, another study [46] considered factors such as a history of drug use at some time in their lives, frequency of alcohol use in their lifetime, the past 12 months and past 30 days, self-reported impulsivity, motor impulsivity, choice impulsivity, self-reported sensation seeking, and executive function. One study [53] considered FH and EEG signals while another [54] considered demographics together with clinical records and a history of drug and alcohol use, and another [55] considered social, demographic, and school-related variables. One study [47] also considered demographic variables and FH of patients along with their psychological and health statuses. Among all selected studies, only one [51] considered genetic variables along with demographic  Categorical data were handled using one-hotencoding.
A standard normalization technique was used for data scaling.  [50] and [49], considered discharge data based on ICD-10 codes, while one [52] considered laboratory results. One of the critical steps in developing a predictive model using ML techniques is feature engineering [61] and [62]. In the literature, this process is divided into two categories: feature extraction and feature selection [63]. Their aims are to solve problems such as high dimensionality, data sparsity, feature redundancy, a vast number of features, and high noise levels. Feature extraction places project features into a new feature space with lower dimensionality, and newly constructed features are usually comprised of combinations of the original features.
Multiple correspondence analysis (MCA), principal component analysis (PCA), linear discriminant analysis (LDA), and canonical correlation analysis (CCA) are examples of feature extraction and dimensionality reduction techniques. In the literature, [45], [10], and [46] employed feature extraction techniques to overcome dimensionality problems in their datasets.
Feature selection approaches aim to select a small subset of features that minimize redundancy and maximize relevance to the target. These approaches can be divided into three groups: filter, wrapper, and embedded [63]. The filter method selects features based on the statistical characteristics of features [64]. In the filter method, features are ranked according to specific criteria. They are then featured using the highest ranking to compose a feature subset. Some of the most common filter methods are information gain, mutual information, chi-square, Fisher score, and relief [65]. The filter method does not rely on the performance of the ML algorithms; therefore, the biases of classifiers cannot be taken into account in the classifier FS method [66]. In the primary studies, one [53] used the least absolute shrinkage [67], one [47] employed t-test and permutation testing, one [51] applied Pearson correlation coefficients, one [49] applied chi-square and considered the frequency of ICD-10 codes, and one [52] employed information gain as filter selection methods to select the best features.
The wrapper method basically uses the accuracy of the predictive model to evaluate the quality of the selected features. In the wrapper method, all subsets of the features are first searched, and then their importance is evaluated based on the performance of the classifier. These steps will be repeated until the best features remain [68]. The primary disadvantage of this method is that they have to run the classifier many times to identify the best feature, which makes it computationally expensive for a dataset with a large number of features [69]. Among all identified studies, one [10] employed a 1-norm support vector machine as a wrapper method for selecting the best features. Finally, an embedded method that uses a filter method to pick candidate features and then employs a wrapper method to select the best features based on the accuracy of the ML model [70]; consequently, the disadvantages of filter and wrapper methods are mitigated by using such a hybrid method.
In the literature, one study [54] performed the embedded method by applying stepwise multivariable logistics regression [71], [72]. Moreover, one [45] applied recursive feature elimination techniques as an embedded method to select the best feature set. More information regarding the feature engineering process of each study is provided in Table  IV.

C. REVIEW OF PREDICTIVE MACHINE LEARNING APPROACHES AND PERFORMANCE METRICS
ML is an important domain of artificial intelligence; it enables machines to learn and act on specific tasks. As discussed earlier, SML consists of techniques and algorithms that can predict future events or classify data by learning the patterns of existing data. Generally, discriminative and generative algorithms can be employed in SML models [73]. Logistics regression is an example of the SML algorithm, which is widely used in classification problems; by fitting a curve line between variables, it attempts to build a classifier [74]. Another famous classification algorithm is the support vector machine (SVM) algorithm [75], which attempts to identify the best data classifier for the given data; it can also achieve excellent predictive performance. K-nearest neighbor (KNN) is an algorithm that attempts to classify an unknown instance based on its neighbours' classification [76]. This means that it labels targets by checking class labels of the K nearest points in the feature space. The decision tree is another powerful algorithm for modelling data that benefits from tree-like structures to classify given data [77]. Random forest is an example of ensemble learning that consists of several decision trees [78]. A neural network (NN) is another ML algorithm that has been identified as a powerful method for some complex ML tasks [79]. In the NN algorithms, a network of cells is produced, and connections between the cells are adjusted in a way that results in networks that can learn the structures of the training data. NN can extract a higher level of features from input data.
In the literature related to the prediction of AUD, SVM was the most popular among researchers, followed by logistic regression. As shown in Table V, [10], [45], [47]- [49], [51]- [53], [55] employed SVM to build predictive models. Logistic regression was identified as the second most common algorithm among the literature related to the prediction of AUD [10], [48], [52], [54]. Some studies aimed to compare different algorithms to identify the one with the highest accuracy. In two studies, [55] and [52], six different ML algorithms were compared to identify the best accuracy for their study. This was also done by other studies [10,47,48,51] on a smaller scale by comparing two, three, and five different algorithms. Three studies, [51], [52], and [50], employed different types of NN algorithms for the prediction of AUD, in which one [50] developed a deep NN (DNN)  application for the prediction of patients with AUD, one [52] used several ML algorithms, including NN, to identify unhealthy drinkers, and another [51] developed the application of NN as well as a convolutional neural network (CNN) combined with long-and short-term memory (LSTM) to classify alcohol-dependent and non-alcohol-dependent groups.
A constructed predictive ML model's performance can be measured using several evaluation metrics, such as accuracy, F1 score, ROC curve and AUC ratio, sensitivity, specificity, and macro-and micro-averaging of accuracy. Four primary ratios can be used to compute the value of these metrics, including true positives (TPs), false-positives (FPs), false negatives (FNs), and true negatives (TNs). A detailed discussion on these performance metrics can be found in a previous study [80]. Different types of performance assessments were used to determine the predictive performance of the constructed ML models in the selected primary studies. The commonly used performance metrics were the ROC curve and AUC ratio, which were used by eight studies. Table V shows the ML algorithms used in the experiments of the relevant identified studies together with the performance measures, as well as the proportion of training and test sets.

IV. DISCUSSION
This systematic review examined the application of MLbased methods for the prediction of AUD. The primary studies used for this SLR were extracted from six academic databases based on different exclusion and inclusion criteria and focused on different aspects, including database collection sites and the characteristics and types of datasets. Moreover, pre-processing and sampling techniques, types of features and techniques of feature reduction and selection of the most relevant subset of features, utilization of ML methods, and evaluation metrics used in performance evaluation of an ML model were evaluated.
The findings showed that researchers used different sources to collect datasets. This review revealed that many studies developed datasets based on single collection sites, of which two major weaknesses were detected. First, the predictive model was developed using a one-dimensional dataset, which was collected from a single source of information, considered as one subject. Thus, a classification model trained using this kind of dataset cannot be used on a wide scale because many sources and subjects used for creating a sample dataset and for predicting AUD are available. Each source may have its own quality of the population, with different features and parameters. Therefore, multidimensional datasets are recommended for the development of a predictive model. Multidimensional datasets collected from different sources, and a variety of subjects related to the prediction of AUD should be considered. Moreover, multidimensional datasets can produce accurate predictive models. They can also be used on a wider scale.
Second, some studies used datasets that contained insufficient sample data [81]; hence, the reported predictive model may have suffered from overfitting or underfitting. Although the studies employed many techniques to overcome limitations, many of the datasets require additional records to achieve excellent predictive accuracy [82] for AUD. Some datasets that were used for predicting AUD in the selected studies suffered from class imbalance. This refers to a problem where the sample size across the classes has not been equally distributed. This issue usually results in a high rate of false negatives [83] in the final predictive model. For instance, one study [55] used a publicly available dataset to predict AUD. Class labels were used for the training set, which consisted of 595 samples of the majority of classes and only 67 samples of the minority class. In other words, approximately 89% of the sample data were class label 0 (non-event), and approximately 11% were class label 1 (event). Therefore, it is recommended that researchers who use imbalanced data perform an appropriate sampling method that is popular and convenient to apply for ML tasks. The basic idea of the sampling technique is to produce balanced classes by either adding or removing sample data from the primary datasets. Sampling methods may reduce learning time while accelerating execution time in supervised learning algorithms [84] and outperform bagging and boosting [85]. Sampling methods can be categorized into three groups: oversampling (to increase the size of the minority class to obtain balanced classes), undersampling (draw the random set of samples from the majority class to balance the classes), and hybrid sampling (a combination of undersampling and oversampling) [86].
In this paper, several features were identified from the studies on AUD prediction. The most commonly used features included demographic features, such as age and gender, drinking behaviour, school-related variables, psychological variables, and health-related information. Feature engineering techniques that were used in the process of predicting AUD were also reviewed. The filter methods were the most widely used feature selection approaches [47], [49], [51]- [53]. Feature selection for classification tasks can generally be categorized into filter methods, wrapper methods, and embedded methods [87]. Generally, in filter methods, the selected features are general characteristics of the dataset as well as measures (such as consistency, dependency, correlation, distance, and information) of the features in comparison to target values, and they are independent of the classifier [64]. Wrapper methods use a specific classifier to evaluate the quality of the selected features. They also report features with the highest quality. Previous studies [68], [88] reported that the wrapper model has a better predictive accuracy than the filter model, but it is computationally expensive. Therefore, some studies used embedded models to extract the best features for classification tasks. Embedded models are hybrid techniques that have the advantage of both filter and wrapper models. Given that they combine feature selection and classifier construction and use filter models, they are less computationally intensive than wrapper methods [89]. In the identified studies, only one [54] used the embedded models for feature selection.
The 'no free lunch' theorem [90] indicates that having the most accurate ML-based application in a domain requires the testing of a variety of ML algorithms. However, the algorithm that outperforms others using a collected dataset should be determined. Each of the identified studies used its own customized datasets, with several experimental settings. Therefore, statistically evaluating the performance of such studies is not practical. Out of the twelve primary studies, four studies [10], [45], [49], [53] used the SVM for predicting AUD. One [46] used linear regression, one [54] used logistics regression, and one [50] employed DNN application for the prediction of patients with AUD. Among those studies that compared several algorithms, two, [55] and [52], compared six different classifiers, one [47] compared four classifiers, and one [51] compared three different classifiers for predicting AUD. The results indicate that traditional machine learning algorithms were more dominant than NN algorithms in AUDrelated studies. The primary benefit of the NN, in comparison to traditional machine learning algorithms, is that feature engineering steps by human experts are not needed. In NN, through the learning process, features are automatically learned from the training dataset. However, a predictive model based on NNs is a black box [91], which in health care-related studies creates adoption difficulties for clinicians since they prefer to fully understand and justify the actions for which they are essentially responsible [92].
Following a review of the primary studies' findings, it was determined that scientists prefer to use standard evaluation metrics such as accuracy, sensitivity, and specificity for determining the success of developed predictive models. Moreover, almost all studies preferred to use a part of the selected dataset as a training set and the other part as the test set. Furthermore, none of the primary studies conducted external validation using unseen datasets.

V. FUTURE RESEARCH DIRECTIONS
This SLR reviewed twelve studies that used ML techniques for predicting AUD. The reviewed aspects were inspired by the basic ML development method, which is shown in Figure  1. In this section, we present various research directions and problems that the researchers of the selected studies were unable to address. These research directions will be described according to the four steps of building a predictive model using ML techniques, including dataset collation; preprocessing and sampling techniques; types of features, dimensionality and feature selection techniques for feature engineering; and ML technique utilization process. These important research problems and challenges must be improved by applying different techniques to increase the feasibility and performance of AUD prediction. The research challenges that need to be addressed are discussed below.

A. DATASET COLLECTION, PRE-PROCESSING AND SAMPLING FOR THE PREDICTION OF AUD
A dataset's quality is one of the primary research problems that must be considered in future research. As shown in Table III, only three primary articles used a dataset collected from multiple sites and a combination of multiple data sources. One of the main problems that may arise due to the lack of multiple-multiple (multidimensional) datasets is losing the chance of having different documentation patterns or styles in the development of the predictive model. These differences could lead to generalizations in the final constructed classifiers. Data collected from multiple sources or the use of multidimensional datasets may reduce the chance of generalizations in the final classification models. Moreover, multidimensional datasets can help to build a predictive model on one dimension and validate it on the other dimension. For example, a predictive model developed based on a dataset containing a combination of patients' EHRs and interviews from a Western country could be validated using a dataset from an Eastern country.
The lack of multidimensional datasets may increase the risk of bias in the developed predictive models. Although the proposed methods in the primary studies offered a reasonably accurate result, there might be a risk of bias in their results since most of them collected data from a single site. Nevertheless, since multidimensional datasets are usually in the form of big data, big data tools and techniques must also be considered. By considering multiple sources for data collection, researchers would be able to handle security and privacy issues, especially when dealing with data collected from hospitals.
Publicly available datasets are another issue that needs to be considered. Although a few studies did not provide details for the dataset used in their study, all reported a reasonable prediction accuracy on AUD. This means that their results may lead this domain to publication bias since low result experiments can be disclosed. To overcome this issue, some public standard datasets must be available for benchmarking. Therefore, future works should focus on publishing their datasets in the form of public corpora for the perfection of AUD prediction.
Furthermore, class distribution in many of the datasets was imbalanced. Numerous studies tried to overcome class bias in the majority class using sampling techniques. However, there are still major challenges regarding imbalanced data in ML studies [93]. Future studies in the prediction of AUD may employ various sampling techniques, including resampling and reweighting techniques, to overcome the challenges of imbalanced data [86].

B. FEATURE ENGINEERING IN THE PREDICTION OF AUD
Several studies reported that demographic features, drinking behaviour, and educational backgrounds represented the most useful features for AUD prediction. Reviews of selected studies have demonstrated that features extracted primarily from survey questionnaires tend to consider the filter model and the wrapper model as techniques. This means that future studies should consider using new datasets that include different types of features and try to apply a variety of feature extraction and feature selection techniques. This will help them present the potential features for the prediction of AUD to clinical scientists and improve the performance of ML applications. Future researchers may design feature selector frameworks that can select the best subset of features and present clinical biomarkers and risk factors for the development of AUD. This would help the research discipline keep up with new trends in the prediction of AUD.
The lack of using EHR as datasets for the prediction of AUD is another issue that must be considered. As shown in Table IV, only three primary articles considered EHRs for the development of predictive models for AUD. EHR datasets are usually the medical records of patients that are stored in hospital databases. Such datasets consist of information regarding patient admissions and visits to hospitals, along with diagnoses and treatments. All of these can be stored in the form of ICD codes. Moreover, these types of datasets consist of clinical records, including laboratory results in the form of numeric values, magnetic resonance imaging of the brain and the description of the result in the form of text, clinical reports by general practitioners in the form of text or audio, etc. Extracting features from such datasets can assist clinicians in having a better understanding of the most important features, which could have a huge impact on the accuracy of the predictive ML-based models for AUD.

C. MACHINE LEARNING TECHNIQUES IN THE PREDICTION OF AUD
As shown in Table V, only three studies used NN algorithms for the prediction of AUD. Thus, future research may consider investigating NNs and DNNs for the prediction of AUD. One of the primary advantages of NN algorithms is that the feature engineering step by a human expert can be skipped [94]. In NN, during the general learning process, features are automatically learned from the training data. Therefore, NN approaches might be the most suitable approaches for use in ML-based predictive models in clinical problems with high-dimensional data when a human expert is unable to reduce high dimensionality among the datasets.
The primary health care facility is normally the first place where most patients with AUD have their initial communication with health care systems [95]. Nevertheless, the majority of the world's primary health care system lacks systematic screening for AUD [96]. One study [95] stated that interventions for the treatment of AUD could begin with brief advice or guidance to discourage hazardous drinking behaviour. However, at harmful levels of drinking, more serious treatments, such as lifestyle changes and additional psychological and pharmacological treatments, are required [95], [97]. Integrating these into primary health care remains a challenge in many countries. This is due to the lack of qualified staff and the lack of familiarity with formal psychological therapies among general practitioners [97]. One of the factors leading to the early detection or prediction of harmful drinking is the detection of early initiation of hazardous drinking [96], [98].
Our findings indicate that this issue has not been considered in primary studies. Therefore, the early detection of AUD in patients based on the progression of their alcohol misuse would be a necessity since this has not been thoroughly analysed, nor has it been reported. For this to happen, we need to consider feeding the temporal sequence of the AUD data into the ML algorithms. Assuming that a national dataset was collected and labelled with sequences that would indicate the progression of AUD levels in patients (based on blood tests, clinical courses or questionnaires), early detection of patients with AUD could be achieved by designing and developing a sequential ML application.
Alcohol consumption has traditionally been a maledominated activity, with men consuming more alcohol and causing more alcohol-related damage to themselves and others than women [99], [100]. This is because female drinkers drink only approximately one-third of the overall amount of alcohol consumed by male drinkers each year [100]. However, the gaps in AUD between men and women are shrinking [99]. The rising rate of AUD in women has become a concern, since women encounter the adverse health and behavioural effects of alcohol use earlier and at lower levels than men [101]. There are several factors distinguishing AUD between sexes. Women are typically smaller than men, with a lower overall body water level and higher total body fat. As a consequence, alcohol is absorbed more in a woman's body. Women's blood alcohol intake also increases more rapidly and remains higher and longer than for men [102]. Additionally, one study [102] mentioned that gender variations exist in brain structure, neurochemistry, and function. Therefore, we highly suggest that researchers consider gender disparities when building predictive models for the prediction of AUD.
EHRs usually contain patient admission and discharge data of each admission episode, which are stored in the form of ICD codes. Such data may often share information about risk factors for the development of a disease. Recently, due to the growing number of EHRs, social network analysis (SNA) has gained attention among scientists for predicting disease risk [103] as well as for identifying patterns and the nature of disease comorbidities [104]. SNA can help understand AUD progression by considering the comorbidities that occur over a period prior to the development of AUDs. This will also lead to an AUD trajectory, which can be used for the prediction of AUD with ML techniques.
In the primary studies, several evaluation metrics, such as precision, recall, F1-Score, predictive accuracy, ROC, and AUC, were used by researchers to evaluate the performance of the developed predictive models. However, almost all models were trained and tested using datasets derived from simple split (Holdout) or k-fold cross validation. If the models' performances were solely used as an evaluation metric, the developed model would not be justified as a real-world predictive model. The performance of a predictive model is dependent on a variety of considerations, including the specific application, the collected data, the sample size, and the quality of the data. We suggest that in our future works, researchers establish standard evaluation protocols, encourage comparative models, and use external datasets (national and internal, or multidimensional) to validate their predictive models' performance.

VI. THREATS TO VALIDITY
This section discusses potential threats to our SRL's validity.

A. SEARCH STRATEGY AND SEARCH RESULT
The search strategy included selecting digital libraries and searching for predefined keywords. This step may be jeopardized by factors such as missing or excluding relevant articles. To mitigate this risk, we used three strategies. First, the first author and a research librarian identified the primary search keywords and search queries. Second, we created search queries using different strings by creating various combinations of the selected keywords and synonyms based on MeSH and Boolean logic techniques. Third, to increase the possibility of identifying relevant articles, we ran search queries on the six digital libraries most relevant to our scope. We did not apply the snowballing process because our search yielded 3,355 papers, which we believe covered most of the papers relevant to our scope. Although we considered six digital libraries, we did not consider the Scoops digital library, which may have increased the chance of missing relevant study due to human error. Therefore, we cannot state that the results of this SLR cover all published studies on the prediction of AUD.

B. DATA SCREENING AND SELECTION CRITERIA
Article screening and selection may potentially pose a threat to validity, as it may arise from the authors' objective judgement in selecting research, resulting in the exclusion of important articles or the inclusion of irrelevant publications. To mitigate this risk, we established the inclusion and exclusion criteria in advance, with all authors participating in the validation. Following that, we adhered strictly to these criteria throughout the article selection process. As stated in section II, this SLR utilized twelve main inclusion and exclusion criteria for data screening, which may violate the exclusion procedure and lead to an article being erroneously excluded.

C. QUALITY ASSESSMENT AND DATA EXTRACTION
In the process of data extraction and quality assessment, threats may arise due to incomplete information extraction from the primary studies and their poor quality. To avoid threats to the validity of the quality assessment, a procedure was created and voted on by three authors. However, the existence of bias in the primary studies was not discussed in this SLR. To avoid threats to the validity of data extraction, we strictly followed the five aspects of building predictive models, which are presented in Figure 1, and then the extracted information was discussed by all authors.

VII. CONCLUSIONS
This SLR presented a comprehensive review of twelve studies that used ML techniques to predict AUD. The articles were published from 2010 to 2021 and were systematically extracted from six different academic databases. These studies were comprehensively reviewed from five different aspects, including collection sites, types and characteristics of datasets, data pre-processing and data sampling techniques, feature types, feature selection and feature extraction techniques, ML algorithm utilization and performance evaluation metrics. Several datasets with unique characteristics were identified and reported. In most of the selected studies, the investigators' own collected datasets were used. Different sampling and pre-processing techniques were used to overcome the imbalanced class distribution in the datasets as well as removing noise and irrelevant information from the datasets. In terms of the features, demographic features together with FH, studyrelated features, and clinical features in the form of ICD codes were identified. To overcome high dimensionality and feature redundancy, several feature selection methods, including the filter method, wrapper method, and embedded method, were used. With respect to ML algorithms, most studies used SVM as the main ML algorithm for predicting AUD. However, the lack of deep learning techniques for predicting AUD was evident and is suggested as one of the future research challenges for the prediction of AUD. Moreover, the lack of predictive models for the early detection of AUD, considering gender disparities, and the lack of a trajectory network or path towards AUD based on EHRs are other important research directions were suggested in this study. Focusing on performance metrics, it appears that the overall accuracy and ROC curve were the most popular evaluation methods among the studies. However, the lack of external validation in the primary studies is an important issue that must be considered in future works. The significance of these review findings was discussed in the separate discussion section. This comprehensive literature review provides unique insights into AUD prediction studies using ML techniques published during the past decade and outlines challenges and open issues that require additional attention in the future. Table 1: Quality assessment criteria of the twelve included studies.