Speech and language processing with deep learning for dementia diagnosis: A systematic review

Dementia is a progressive neurodegenerative disease that burdens the person living with the disease, their families, and medical and social services. Timely diagnosis of dementia could be followed by introducing interventions that may slow down its progression or reduce its burdens. However, the diagnostic process of dementia is often complex and resource intensive. Access to diagnostic services is also an issue in low and middle-income countries. The abundance and easy accessibility of speech and language data have created new possibilities for utilizing Deep Learning (DL) technologies to be part of the dementia diagnostic process. This systematic review included studies published between 2012 – 2022 that utilized such technologies to aid in diagnosing dementia. We identified 72 studies using the PRISMA 2020 protocol, extracted and analyzed data from these studies and reported the related DL technologies. We found these technologies effectively differentiated between healthy individuals and those with a dementia diagnosis, highlighting their potential in the diagnosis of dementia. This systematic review provides insights into the contributions of DL-based speech and language techniques to support the dementia diagnostic process. It also offers an understanding of the advancements made in this field thus far and highlights some challenges that still need to be addressed.


Introduction
As a neurodegenerative disease characterized by cognitive decline and functional impairment, dementia is a major challenge in clinical practice.The major causes of dementia include Alzheimer's disease (AD) and vascular and frontotemporal dementia (Knopman et al., 2003).Dementia affects cognitive functions, such as language, perceptual, and executive functions, and often affects individuals' behavior (Thabtah et al., 2020).It is an irreversible and progressive disease ranging from preclinical to mild, moderate, and severe stages (Vermunt et al., 2019).Dementia is a relatively common condition in adults aged 65 years and above (Jarvik et al., 1982), and no cure has yet been discovered (Nistico and Borg, 2021).Similarly, Mild Cognitive Impairment (MCI) is a diagnosis that refers to mild subjective and objective cognitive dysfunction but without significant functional impairment.MCI increases the risk of developing dementia, but not all individuals with MCI will progress to dementia.Dementia is often under-diagnosed, and with the aging population globally, there is an increased need for using smart technologies to facilitate remote diagnosis and improve access to healthcare services (Gauthier et al., 2021;Ibáñez et al., 2021).
Detecting dementia at an early stage allows for the implementation of timely interventions and treatment strategies, which may help manage symptoms, reduce the progression of the disease, and improve the individual's quality of life.Various medications and therapies are more effective when initiated early, potentially providing better outcomes (Livingston et al., 2017;Minstry of Health New Zealand 2021).However, the early and accurate diagnosis of dementia can be challenging since it requires clinical judgment, including history taking, cognitive assessment, and investigations to exclude other medical and psychiatric causes of cognitive impairment (Johnson et al., 2021).Globally, it is estimated that 75 % of individuals with dementia are not diagnosed (Gauthier et al., 2021).
Using data-driven tools such as Artificial Intelligence (AI) and Machine Learning (ML) technologies in healthcare provides new opportunities for clinicians to deliver their services more efficiently.ML technologies can provide readily accessible and cost-effective clinical tools.They can form part of the dementia diagnostic process and support clinicians to diagnose dementia accurately.In particular, Deep Learning (DL) algorithms, a subfield of ML, have become the most effective technology for processing unstructured big data and learning complex data patterns.
DL focuses on developing Artificial Neural Networks (ANNs) capable of learning and making intelligent decisions without being explicitly programmed to mimic the structure and functionality of the human brain for processing and analyzing complex data.In DL, the term "deep" refers to multiple hidden layers in a neural network architecture.Deeper layers allow the network to learn increasingly complex patterns and relationships within the data, enabling it to capture intricate and nonlinear representations (Tirumala and Shahamiri, 2017).Deep ANNs are also known as Deep Neural Networks (DNNs).DL applications in dementia diagnosis have progressed gradually in recent years, leading to notable achievements.For instance, Vuong et al. (2020) employed DNNs in wearable sensing devices to track the wandering patterns of individuals, thereby enabling the recognition of dementia.Computer vision techniques have also been used to identify dementia by analyzing magnetic resonance imaging (MRI) brain scans (Ucuzal et al., 2019).
Automatic Speech and Language Processing (SLP) approaches can effectively identify cognitive impairment (Mueller et al., 2018).These approaches leverage the power of DL algorithms and can easily collect speech and language biometrics to identify indicators of dementia.DL technologies can automate the entire pipeline for processing speech and language data and learning dementia indicators to a significant extent.Once DL models understand how dementia affects speech and language patterns, they can assist clinicians in diagnosing dementia.
Unlike other forms of artificial intelligence, deep learning AI systems have demonstrated superior capabilities and accuracy in processing and modeling speech and textual data.Similarly, systems that rely on manual interventions can create substantial provisioning costs to process complex speech and textual data manually, potentially leading to limited access to such systems.While the initial development and design of DL technologies can be expensive and intricate, their automation and deployment via software solutions and cloud services can make them affordable and widely accessible once designed.Thus, considering the convenience of collecting speech and language data, automated technologies for detecting speech traits and language use caused by cognitive impairment have become attractive options.In this study, we systematically report the application of DL algorithms to support the dementia diagnostic process using speech and language data processing.
A few studies have reviewed the use of AI technologies in supporting the dementia diagnostic process.For instance, the potential of AI in detecting AD and MCI by utilizing speech and language features and picture description tasks was discussed by Mueller et al. (2018).These tasks assess the participants' cognitive status by having them describe what they see in a given picture, such as the Cookie Theft Picture (Kaplan and Goodglass, 1981).There are also studies that explored the detection of dementia using multimodal AI technologies that integrated various types of data modalities, including images, audio, and text (Graham et al., 2020;Li et al., 2021;Palliya Guruge et al., 2021).The application of AI and robotics in healthcare has also been examined.The results suggest the potential of conversations as interventions with promising outcomes in patients' speech (Gochoo et al., 2020).Martínez-Nicolás et al. (2021) and Gulapalli and Mittal (2022) reviewed the progress made in automatic SLP studies in detecting AD.
However, these studies did not employ a systematic approach to paper selection and analysis.Consequently, their comprehensiveness, precision, and potential biases could not be determined.Among the related systematic reviews, Petti et al. (2020) systematically reviewed 33 papers and reported the progress made in automatic AD detection from speech and language data.Similarly, de la Fuente Garcia et al. systematically reviewed 51 papers on AI and SLP approaches used for monitoring AD, emphasizing the efficiency of natural language processing (NLP) for detecting AD (de la Fuente Garcia et al., 2020).
However, both systematic reviews focused on the broader concept of ML methods, in which DL was only briefly covered.Furthermore, studies published in the 2020 and 2021 INTERSPEECH dementia recognition challenges were omitted because the last search dates of the two systematic reviews were in 2019.INTERSPEECH is a prestigious annual conference that serves as a prominent international forum for researchers, professionals, and practitioners in speech science and technology.
In contrast, our systematic review focused on using DL approaches for dementia diagnosis through SLP rather than broader ML approaches.This emphasis was driven by the notable success of advanced DL algorithms, which have proven highly effective in handling unstructured sequential data, such as speech and text.DL offers several advantages over other ML techniques in this specific context, including automatic extraction of language features, elimination of the need for manual features engineering, the ability to model complex relationships that may elude traditional ML algorithms, and facilitating a deeper understanding of intricate and subtle changes in speech and language (Goldberg, 2017;Young et al., 2018).Furthermore, DL enables hierarchical representations of the input data, allowing the extraction of meaningful information at different levels of abstraction.These factors contribute to the superiority of DL in diagnosing dementia using SLP.
This systematic review differs from the previous ones because we a) systematically focused on DL-based speech and language technologies that could be used to diagnose dementia, b) included both the 2020 and 2021 INTERSPEECH challenges in the area of interest, c) narrowed down the experiments on the English language for automatic processing, and d) included the latest research up to the end of 2022, which is particularly important as 2020 to 2022 were the most productive years of the field at the time of this writing.
This systematic review answered the following research questions: 1) What were the focus areas of the DL-based SLP research conducted to support dementia diagnosis?2) What and how were DL algorithms applied in these studies?3) What performances were achieved by these diagnostic technologies?
The rest of this paper is structured as follows: Section 2 introduces the methodology used in this systematic review.Section 3 reports the search and selection results.Section 4 presents the findings of our data synthesis.Section 5 discusses our findings.Section 6 concludes the systematic review.

Methodology
In this study, we followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 protocol described by Page et al. (2021).It was selected because of its suitability for reviewing multidisciplinary studies with clinical considerations.In summary, PubMed, Web of Science, IEEE Xplore, Science Direct, Springer, Engineering Village, and the International Speech Communication Association (ISCA) were queried using the dementia-and deep learning-related keywords explained in Section 2.2.Papers that met the eligibility criteria mentioned in Section 2.3 were considered for inclusion, and their titles and abstracts were studied.Studies that met the inclusion criteria were selected for full systematic data extraction, as explained in Section 2.6.The quality assessment checklist proposed by Kitchenham and Charters was applied to assess bias risks (Kitchenham and Charters, 2007).As a result, the risks were identified as low.The rest of this section describes the PRISMA methodology and process in detail.

Information sources
We searched six online research databases: PubMed, Web of Science, IEEE Xplore, Science Direct, Springer, and Engineering Villages.An additional search was performed in the ISCA online archive, a popular venue where speech and language processing researchers present and share their findings annually.All INTERSPEECH 2020 ADReSS (Alzheimer's Dementia Recognition through Spontaneous Speech) and 2021 ADReSSo (Alzheimer's Dementia Recognition through Spontaneous Speech) challenge papers closely related to the topic of interest are also archived here.We manually searched the archives using a defined timeline to identify the studies that met our inclusion criteria.
We conducted the first database search on 1 st September 2021, the first updated search on 31 st December 2021, and the last search on 20 th January 2023.

Search strategy
The search terms were: (dementia OR "cognitive impairment" OR "cognitive deficit" OR "Alzheimer's Disease") AND ("deep learning" OR "machine learning") AND ("speech processing" OR "automatic speech recognition" OR "natural language processing").Also, when searching in the selected databases, we set "year" to "2012-2021," then we updated the search and set it to "2022-2023," "type" as "articles," "in the English language" and "peer-review" as our searching filters.

Eligibility criteria
We excluded other review papers and specified our criteria based on experimental studies.We included studies that applied DL algorithms and all types of DNNs or ANNs to provide a complete picture of developments in the field.The inclusion criteria were as follows.
a) Applied DL/DNNs/ANNs for data processing or modeling, b) Performed detection or classification of dementia, c) Used speech and/or language processing to detect dementia, d) Included English language corpus, and e) Were peer-reviewed, full-text papers in English and accessible.

Study selection
First, the titles and abstracts of the searched papers were examined using the inclusion criteria.The first screening was more inclusive if the titles and abstracts did not specify the processed cognitive impairments, whether the English speech and language datasets were processed, or whether DL techniques were applied.We then strictly screened full-text papers according to the inclusion criteria.We excluded papers that did not provide full-text access or did not respond to our clarification request.We carefully reviewed the papers that did not meet the eligibility criteria to ensure that our study selection was comprehensive.For instance, we re-examined studies in which the ML model was not explicitly identified as DL or ANNs or in which cognitive impairments were mentioned without clearly indicating the focus on dementia.Our team, comprised of a speech and language processing expert, a DL expert, and a clinical psychiatrist specializing in dementia, facilitated this evaluation.Our collective expertise enabled a thorough discussion to determine whether the papers met the inclusion criteria.

Data collection
Information was collected directly from the selected papers and saved in Excel tables for subsequent analysis.One author searched and extracted all the required data to maintain the consistency of the criteria for the review.

Data items
Inspired by the SPICMO tables (Study aims and designs, Population, Interventions, Comparison groups, Methodology, and Outcomes) (de la Fuente Garcia et al., 2020), we systematically extracted information from the selected papers following a similar pipeline.Each study's research focus and aims were examined to answer the first research question, which explored the research focus in the field and study context (Champiri et al., 2015).Our second research question was then answered by extracting the datasets used for processing information relevant to computational feature extraction, DL techniques, and the overall systems built by the researchers to explore how DL was applied in building state-of-the-art technologies.Finally, to answer the third research question, we examined the outcomes of each study and summarized their performance in dementia detection.
We present and interpret DL systems in both narrative and tabulating ways to explain complex architectures because they are often composed of different types of layers that cannot be easily classified as one DNN type, such as long short-term memory (LSTMs) or Convolutional Neural Networks (CNNs).

Risk of bias assessment
We identified and classified the possible risks of bias as low because they did not severely influence the objectivity and quality of this review.The quality assessment checklist proposed by Kitchenham and Charters was applied to the paper selection and data extraction processes (Kitchenham and Charters, 2007).We identified and applied the 14 questions in Table 1 and excluded papers that failed the assessment.Disagreements were resolved by the authors of this review through group discussions when applicable.

Effect measures
As we considered DL the intervention method for this study, we selected evaluation metrics that explained how DL systems were evaluated in the resulted studies.The performance of DL models is usually evaluated in hold-out or cross-validation procedures, in which evaluation metrics such as error rate, accuracy, precision (the ratio of correctly predicted dementia patients out of all participants predicted as dementia-positive), recall (the ratio of correctly identified dementia patients out of all patients with dementia diagnosis), F-scores (such as the F1 score that combines precision and recall scores of a model), receiver operating characteristic, or area under the curve (ROC-AUC) scores are measured.Therefore, these were extracted and compared in this study.

Synthesis methods
As the data were extracted from the sequence mentioned in Section 2.6, we further synthesized the data according to categorical groupings and graphical displays.
No. Question 1 Are the aims clearly stated? 2 Was the study designed with their research questions in mind? 3 What population was studied? 4 Was the sample representative of the studied population? 5 Was the population size reported?6 Was there a comparison system or control group?7 Did the researchers explain the data types?8 Was the technology clearly defined?9 Were the technological methods clearly described?10 Were the measures in the study fully defined?11 Was statistical significance assessed?12 Have all study questions been answered?13 Were negative findings presented?14 What implications did the report have for practice?
M. Shi et al.
previously explained methodology.Information on the publication years and research countries of the resulting studies is also provided.

Search and screening
The search results are shown in Fig. 1.The initial database search yielded 3150 papers across all databases.Initial title and abstract screening and duplicate removal resulted in 216 studies.After applying the eligibility and inclusion criteria as well as the quality assessment explained previously and conducting full-text screening, 72 studies were selected for a detailed analysis.

Basic publication information
Fig. 2 shows the distribution of the selected studies across the period considered in this review concerning the publication year of the selected studies.Although we reviewed studies from 2012 to 2022, research studies meeting our inclusion criteria emerged in 2015, and 2021 and 2022 were the most productive years in the field.
Table 2 displays the countries associated with the research organizations mentioned in the selected papers, as determined by the affiliations of all authors involved in the studies.We found that 41 papers had research teams from four developed native English-speaking nations, while 39 involved research teams from twenty-three non-Englishspeaking regions.Except for one study conducted by industry, all papers were published by researchers from universities, educational institutes, or their laboratory collaborations.

Data synthesis
This section provides detailed information on and analyses of the selected 72 studies.We demonstrated and explained the reported research focuses, datasets, input features, and representations used for processing, the applied DL models, and the results achieved by the selected research.

Research focus areas
We observed and categorized four aspects of research focus identified by the selected studies.

Methodology
We identified three main methodological focuses for the automatic diagnosis of dementia using speech and language data with Deep Learning by studying the selected papers.First, more than half of the selected papers showed that using DL to build dementia diagnosis systems using SLP could be more efficient.However, they recognized that traditional ML algorithms also provided promising results.Second, the choice of modality for dementia detection is debatable.We refer to DL technologies that process acoustic or linguistic features or construct multimodal systems with both types of features by modality.More than 61 % of the selected papers emphasized modality choices in their experiments.The choice of modality was also influenced by the datasets and data types available to researchers for their experiments.The third type of focus was on using manual transcriptions and manually extracted features or automatically extracting features via embedding techniques and automatic speech recognition (ASR)-produced transcriptions.This focus on automatically generated transcriptions or features in a DL system can assist in constructing end-to-end systems that can fully automate the process.Embedding techniques refer to methods used in natural language processing and machine learning to represent words, sentences, or documents as mathematical vectors that capture the semantic and contextual relationships between words.They enable computational models to understand and analyze textual data more effectively.

Data
Although speech data are cheaper and easier to collect than other types of data, obtaining clinical speech data requires complex processes for data collection, privacy management, and ethical approval, which correspondingly add difficulties to data collection and processing.Reviewing the literature, we identified five types of data-related limitations that require technological resolution.

Implementation
Computational implementation was the primary focus of selected studies.Researchers have attached significance to the computational realization and implementation of knowledge and theories derived from standard clinical diagnostic research in automatic dementia diagnosis.Specifically, in the selected studies, speech impairment and language misuse were believed to be detectable by implementing digitalized acoustic and linguistic features for automatic processing.Most studies have emphasized the importance of searching, digitalizing, or processing features.Correspondingly, implementing these features with automatic speech processing (Lopez- de-Ipina et al., 2018;Rosas, 2019), NLP techniques (Chen et al., 2019;Pappagari et al., 2021), or a combination of both (Qiao et al., 2021;Sarawgi et al., 2020) has been a major research focus.Researchers believe that pauses during speech are significantly informative in detecting dementia via speech (Rohanian et al., 2021).Second, 46 % of the selected studies believed that different acoustic and linguistic features should be extracted automatically using computational algorithms (Ammar, 2018;Vats et al., 2021;Yuan et al., 2020).Third, researchers have focused on applying and selecting suitable DL algorithms and architectures (Pan et al., 2021;Zhu et al., 2021).Finally, Lindsay et al. pointed out that the application of these automatic diagnostic methods in clinical diagnostic settings has yet to be studied (Lindsay et al., 2021).However, Rekha et al. demonstrated that their system could be considered indispensable for hospitals to adopt (Rekha et al., 2022).

Verification
First, the verification focused on evaluating the trained models based on predicting cognitive impairment-related tests and scores commonly adopted to identify dementia stages, such as the Mini-Mental State Examination (MMSE) score (Folstein et al., 1983).MMSE is a widely used cognitive screening test for assessing cognitive impairment and determining the severity of cognitive decline.In this respect, the Root Mean Squared Error (RMSE) was mostly used to measure the model performance, considering its ability to measure the average difference between the values predicted by a model and the actual values.As mentioned in Section 2.8, other standard DL classification metrics have been used to measure the model performance when trained on picture descriptions or similar diagnostic tasks.
Another challenge for researchers is to verify the diagnostic SLP results as a tool for dementia diagnosis.This verification is challenging because of the difficulties caused by cross-database benchmarking (Pompili et al., 2020), and the different features used for processing may require various evaluation metrics (Lindsay et al., 2021;Pompili et al., 2020).Therefore, although digitalized methods make dementia detection more objective, the researchers of the selected studies explored ways to standardize the speech and language data collection process and evaluate the results generated by DL models.

Experimental aims
In general, we found that the researchers of the selected studies found solutions for three sets of tasks:1) discriminating dementia patients from healthy control (HC) subjects, 2) predicting the severity or stages of dementia in patients, and 3) developing language and task agnostic systems.Multiple studies have attempted to solve more than one task simultaneously (Pérez-Toro et al., 2022).Among these tasks, experiments on the development of language and task-agnostic systems have rarely been studied.
The main purpose of detecting dementia using an SLP was to automatically classify and label the speech and language data collected from the participants as HC or dementia.For instance, the third task of the 2021 ADReSSo Challenge completed by some of the selected papers (Z.Liu et al., 2021;Rohanian et al., 2021;Syed et al., 2021;Zhu et al., 2021), was to predict the progression of a cognitive decline, also resulted in models generating in a binary format, although within the given timeline.But when data on more cognitive conditions were available, multiclass classifications were considered (Padhee et al., 2020), such as discriminating between AD, MCI, and HC or Probable AD-Possible AD-MCI-HC cases.Among the selected papers, only (Farzana and Parde, 2020) did not focus on classification but the regression tasks.However, predicting the severity or stages of dementia in patients can be seen as a regression task in which DL algorithms are applied to predict MMSE scores (Koo et al., 2020;Z. Liu et al., 2021;Searle et al., 2020).

Speech datasets for dementia detection
Appendix 1 summarizes the datasets used in the surveyed studies.Note that we did not include the PGA-CITA dataset (CITA Alzheimer, 2012), one of the datasets used by Lopez-de-Ipina et al. ( 2018), because we could not identify the language(s) of the dataset.Audio samples, occasional transcriptions, and videos were obtained from the datasets.
Notably, using multiple datasets for experiments and comparisons was common among the selected studies.Among the datasets used, the DementiaBank database (DementiaBank, 2007) was used more often than the others.The dataset included audio recordings of the Cookie Theft Picture Description Task and Mini-Mental State Examination (MMSE) scores.Participants were mainly labeled as AD or HC after performing the required diagnostic tasks.However, MCI and the types of speech impairments of different speech tasks can also be found in this database.Among the subsets of the DementiaBank database, the age, sex, and label-balanced INTERSPEECH 2020 ADReSS (Luz et al., 2020) and 2021 ADReSSo (Luz et al., 2021) datasets sampled from the Pitt Corpus were the most frequently used.
It is worth noting that the study conducted by Guo et al. (2021) was the only study using the Wisconsin Longitudinal Study (WLS) corpus (Herd et al., 2014) that met our inclusion criteria.The authors indicated participants with a verbal fluency score below the threshold as "Cases," given that dementia-related diagnoses were yet to be provided in WLS.Therefore, our review adopted the same terminology when referring to WLS participants.
There were also multilingual and non-English datasets, namely, AZTIAHO (K.López- de-Ipiña et al., 2015), the iFLY (N.Liu et al., 2021), and NCMMSC (Ying et al., 2022) Chinese datasets, a Chilean Spanish dataset (Sanz et al., 2022), and the EIT Digital French dataset (König et al., 2018), which enabled the experiments to process multiple languages.They were included to ensure comprehensiveness because some selected studies applied them to conduct cross-dataset experiments with English datasets, which meant they met our eligibility criteria.
Other datasets also allowed the selected studies to be researched with participants from different backgrounds because of the different methods and standards applied when collecting speech and language data.To expand the DementiaBank dataset for training DNNs, out-ofdomain non-dementia datasets, such as the Wall Street Journal (WSJ) (Paul and Baker, 1992) and Visual Storytelling (VIST) (Holzinger, 2016), were used by Kong et al. (2021).The model-agnostic English GLUE Benchmark (Wang et al., 2018) was applied to evaluate the experiment conducted by Youxiang Zhu et al. (2022).Finally, although open-access datasets were available, the public availability of multiple datasets was unclear in several selected studies.
Except for the unnamed datasets, we summarized the participants' background information that clinicians could consider when diagnosing dementia in Appendix 2. In contrast, Appendix 3 provides further information on the datasets in the context of SLP.The majority of the data were collected using the picture description task and spontaneous speech, except for (Soni et al., 2021), who used a verbal fluency task, and (Alkenani et al., 2021), who built a multimodal fusion system on natural speech datasets and a handwriting dataset called the Alzheimer's Disease Blog Corpus (ADBS) (Masrani et al., 2017).
Since DL algorithms typically require big data, each dataset's data was considerably small for DL in the selected studies.Even though the DementiaBank and FHS datasets are larger than the others, their data contain significant noise, which requires preprocessing procedures that lead to potentially less usable data.Only ADReSS and ADReSSo were demographically balanced among the datasets.Therefore, many researchers have experimented with unbalanced sampling of information.

Input features and embeddings
Features and embeddings represent the extracted attributes and representations derived from speech and language data, which are subsequently modeled and mapped to enable diagnostic predictions made by DL algorithms.Below, we describe and list the features and embeddings used in the reviewed studies.

Acoustic and linguistic features
The selected papers reported on extracting and using various linguistic and acoustic features to build DL models.Appendix 4 provides a list of the features summarized in the selected studies.Additionally, the embeddings generated by non-DL algorithms are listed.Appendix 4 lists the studies that manually provided features or embeddings to the DL algorithm rather than using embeddings extracted and learned automatically via DL algorithms.
Although not all participants' information was provided, demographic features such as age and gender were still used for processing in a few of the selected papers, in addition to acoustic and linguistic features (Mahajan and Baths, 2021;Yangyang, 2020).
Many studies have been restricted to the features they apply; however, some have considered many relevant features (Warnita, 2018).According to the surveyed papers, both acoustic and linguistic features and embeddings were informative in indicating the signs of dementia.However, using various extracted features requires professional domain knowledge (i.e., speech processing or linguistics) and further feature selection procedures.

Deep learning embeddings
As previously stated, DL embeddings provide automatically generated informative representations from raw data and can be directly used as inputs to DL neurons.Speech embeddings are acoustic representations of speech sequences of a fixed size.Language embeddings follow a similar mechanism but represent words, phrases, and sentences for processing natural language text.Embeddings are often pre-trained before they are applied to target datasets. .Table 3 summarizes the embeddings pre-trained using the DL algorithms used in the surveyed literature.
Language embeddings were applied more frequently than speech embeddings, as we observed.VGGish (Hershey et al., 2017), a pre-trained deep CNN for processing large-scale audio classification tasks, was used in four studies.The X-vector (Snyder et al., 2018), an embedding that uses Mel-Frequency Cepstral Coefficients (MFCCs), was also utilized in five studies.MFCCs are acoustic feature extraction methods that capture the spectral characteristics of speech signals and provide a compact representation suitable for analysis and modeling.Wav2Vec (Schneider et al., 2019), a pre-trained speech representation that utilizes CNNs with unsupervised methods, and its updated version Wav2Vec 2.0 (Baevski et al., 2020), a self-supervised learning speech representation, were applied in eight studies and in addition to representing acoustic features, Wav2Vec and Wav2Vec 2.0 also served as ASRs.Trill (Shor et al., 2020) and allosaurus (Li et al., 2020) models have also been applied to better represent acoustic information.
Among all the language embeddings, Bidirectional Encoder Representations from Transformers (BERT) (Devlin and Chang, 2018;Devlin et al., 2018) were the most frequently implemented models, appearing in 31 studies, followed by Roberta, which appeared in eight papers.Other variations exist, such as transformers (Vaswani et al., 2017), transformer-XL (Dai et al., 2019), and Enhanced Representation through Knowledge Integration (ERNIE) (Sun et al., 2019).It is pertinent to note that "Generic Word Embeddings" (Jain et al., 2020) is not included in the table because of the lack of enough experimental studies and sufficient explanation of its mechanism.

Diagnostic performances
DL uses neural layers and algorithms to decompose the complexities observed in the features and representations extracted from the data as the hierarchical neural layers become deeper.Once the information is learned and refined, DL algorithms recompose the "knowledge" into representations and make predictive decisions accordingly.In dementia diagnosis, a decision is made to generate diagnostic results.DL models use different types of neurons and layers and are usually architectured manually before they can automatically make predictions.This section reviews different types of DL architectures and systems constructed in the selected studies.Appendix 5 summarizes the top performances reported by the authors of the selected studies.We also demonstrated the tasks, datasets, and feature types used for processing, the neural networks constructed, the testing techniques applied, and the performances achieved in all 72 studies.We found that 34 selected studies considered linguistic features, language embeddings, or language models by examining the participants' language.The use of spoken language in these studies indicates that researchers believe spoken language is more informative for manifesting dementia-related characteristics.Language embeddings like BERT are used more frequently than acoustic or linguistic features.In some instances, very high performance has been reported using BERT.For example, using an unnamed dataset from a referential communication task, Z. Liu et al. achieved an accuracy of 99.8 %, 99.7 % sensitivity, and 100 % specificity for AD and HC classification tasks using features from combined transcripts of all familiar and unfamiliar images (Z.Liu et al., 2022).The constructed model was rather simple: language embedding output by BERT was passed to a ReLU and then a Sigmoid dense layer (ReLU and Sigmoid are two different neuron activation functions that squash the neuron outputs into specific ranges).However, because the availability of the dataset is unknown, it is unlikely to replicate the experiment or benchmark the model's efficacy.Nonetheless, introducing new datasets with tasks other than the Cookie Theft picture description task into the research area demonstrates that there could be more effective ways to explore new solutions.It is important to note that data and data collection methods could be key to addressing this issue.
Multi-Layer Perceptrons (MLPs) are ANNs composed of multiple layers of interconnected artificial neurons organized into input, hidden, and output layers (Shahamiri et al., 2022).Among the studies that have considered MLPs, the study conducted by López-de-Ipiña et al. ( 2015) achieved an accuracy of 96.89 % on the multilingual AZTIAHO dataset using a simple MLP model, which is close to the best performance achieved by traditional language models to date.MLPs with the stacked fusion model proposed by Alkenani et al. (2021) achieved an accuracy of 97.37 % and an F1 score of 97.67 %.This performance was similar to the 97.18 % accuracy and F1 score of 97.09 % achieved by BERT models (Liu et al., 2021).Appendix 5 also indicates whether the transcripts were provided automatically via the ASR.If not indicated, manual transcription was used for the raw processing of text materials for linguistic/language models when applicable.An advantage of using ASR and BERT is the construction of end-to-end systems that offer more automation, which was adopted in nine of the selected studies.
Acoustic and linguistic features and embedding were considered in 26 of the selected studies.The best-reported accuracy was 94.4 %, using a best-first greedy algorithm with an MLP model (Sadeghian et al., 2017).On average, most models achieved accuracies greater than 80 % for both features.Many authors have pointed out that these two feature types can provide complementary information when used together.However, linguistic features and language models are likely the most informative tools for diagnosing dementia based on speech.Notably, in Bertini et al. (2022), acoustic features in spectrogram image features were then processed to detect dementia.The results indicated an accuracy of 93.3 %, an F1 score of 88.5 %, a precision of 90.7 %, and a recall of 86.5 %.A spectrogram is a visual representation of the spectrum of frequencies in a signal as it varies with time.It visualizes the frequency content of a time-varying signal, such as an audio signal.Thus, an emerging trend in 2022 is the combination of visual acoustic features (via spectrograms) and linguistic features to diagnose dementia.
For MMSE score prediction, the lowest RMSE of 3.74 was reported by Koo et al. (2020) on the ADReSS dataset with manual transcriptions, which was the best performance reported for this task.This performance was followed by an RMSE of 3.76 on the ADReSSo dataset with audio data only (Z.Liu et al., 2021).Generally, DL performs very well in the binary classification and MMSE score prediction tasks.
Finally, for experiments on language and task-agnostic studies, an accuracy of 96.89 % was achieved by using a multilingual dataset (López- de-Ipiña et al., 2015).Lindsay et al. also reported AUC of 85 % and 84 % on monolingual and English-French multilingual datasets, respectively (Lindsay et al., 2021).
Notably, Studies 71 and 72 are presented as one study in Appendix 5.This change was because the models, datasets, and results reported in the two experiments were identical.

Discussion
Among the 72 selected studies, six were also reviewed by Petti et al. (2020), and four overlapped with the systematic review conducted by Garcia et al. (2020).There were also a few DL-related studies that previous reviews covered but did not meet our inclusion criteria and were excluded from our study.The excluded studies either considered non-English corpora or did not focus on dementia.Nevertheless, the findings of our systematic search indicate a surge in interest in applying SLP with DL to diagnose dementia, considering that more than two-thirds of the identified studies were published after 2020.
We identify four major research foci to address the first research question: methodology, data, implementation, and verification.We also identified three main tasks/aims among the studies: dementia-HC classification, regression for predicting the severity of dementia using MMSE scores, and language and task-agnostic studies.We summarized the datasets used, features and embeddings, and DL models, along with the best performances reported to address the second and third research questions related to the DL techniques applied and the performances achieved in SLP with DL.A trend we observed was the greater automation that DL offers for diagnosing dementia in speech and language, which could also increase diagnosis accessibility.
This section explains the empirical findings from the data synthesis conducted on the 72 analyzed studies.In particular, we explain the progress made by SLP with DL in detecting dementia concerning the focus identified in the research area.We also present the research gaps that are of interest to researchers.Finally, we summarize the contributions of our systematic review.

Datasets and demographic information
As explained previously, the amount of publicly available standardized speech and language data that can be used to detect dementia using DL systems is insufficient.The trend we noticed in the literature was using the open-source DementiaBank database and its subsets, particularly the balanced 2020 ADReSS and 2021 ADReSSo datasets.This trend demonstrates that these datasets have become a standard in the field.All the previously mentioned datasets are suitable for SLP with DL, although many do not indicate public availability.However, the scarcity of data was evident for satisfying data-intensive DL architectures, which resulted in several studies experimenting with more than one dataset.
It was also common among the surveyed DL studies to adopt spontaneous speech obtained from the Cookie Theft picture description tasks.
This approach is similar to the reported findings (Garcia et al., 2020), in which non-deep learning algorithms were mostly analyzed on picture-description speech datasets.However, Liu et al. (2022) reported that experimenting with referential communication tasks and using features from combined transcripts of all familiar and unfamiliar images achieved an accuracy of 99.8 % for AD and HC classification tasks (Z.Liu et al., 2022).Therefore, it is highly recommended that researchers consider different speech tasks when collecting data for dementia diagnosis.
Regarding multimodality, some datasets provide only audio data, posing a challenge for building multimodal systems.However, this challenge inspired some studies to use spectrogram images converted from audio data and consider the transcripts generated by ASR systems.Another advantage of DL is its ability to use pre-trained embeddings, which can ease issues caused by domain-specific data sparsity and enable a streamlined pipeline via end-to-end architectures for building diagnostic speech and language technologies.For instance, over twothirds of the studies considered state-of-the-art ASR technologies using transformers and embeddings.
Although building very deep DNNs for dementia detection is limited because of data scarcity, using data from different modalities indicates that constructing multimodal systems beyond SLP can be the next research step.Similarly, building models with augmented speech and language data sources, such as clinician notes (Myszczynska et al., 2020), is also worth experimenting with.Therefore, data gaps and techniques to augment the available data, generate synthetic data, and leverage transfer learning are worth exploring in future studies.In this regard, DL has proven effective in constructing diagnostic speech and language technologies for dementia diagnosis, at least in the datasets reported here.
We noticed that a few studies, such as (Sadeghian et al., 2017), (Yangyang, 2020), and (Kong et al., 2021), included demographic features, such as age and gender during their model training and reported improved performances, but other studies may have neglected valuable patient demographic information.Given that clinicians value such information during the diagnostic process, future researchers should study whether including demographic features in addition to speech, data can further improve the generalization capabilities of DL models.Additionally, providing an analysis of the classification results concerning the patient's gender, ethnicity, socioeconomic status, first language, level of education, or other similar patient-specific features that may confound a clinical diagnosis of AD can help researchers better understand the results and study potential bias in the DL models.
Regarding dementia detection tasks, few studies have investigated decline prediction; however, approximately half of the studies have attempted to distinguish the stages of dementia using MMSE scores.Hence, another useful future direction is to use an automated SLP to identify and predict patients who will develop the disorder.As stated previously, Task 3 in the ADReSSo Challenge predicted cognitive decline by predicting changes in cognitive status over time for a given speaker.The ability to perform this task was facilitated by the fact that the speech data were collected as a cohort study in the DementiaBank Pitt Corpus.However, only a few participating teams (Studies No. 29,41,44,and 47 in Appendix 5, noted by Decline Inference) engaged in this task, and the results acquired were less desirable compared to the other tasks in the ADReSSo Challenge.Interested researchers could advance in this direction and study whether SLPs can detect subtle speech and language abnormalities that could lead to MCI among the general population or patients with MCI who may further develop AD.We recommend that public challenges such as ADReSS and ADReSSo be held more frequently to encourage more researchers and developers to study the issue and propose and experiment with possible solutions to a target problem.We observed a decline in the types of tasks tested in 2022 other than the AD-HC classification task.
Almost all the reviewed studies performed binary AD-HC classification; however, few explored MCI cases and performed multiclass classification to discriminate MCI from AD and HC, which is consistent with the observations reported by Garcia et al. (2020).Therefore, dementia progression modeling and MCI detection should be explored further.The detection of MCI should be considered for two reasons.First, a quick, accessible, and accurate diagnosis of early AD and MCI is important for managing early cognitive concerns, as clinicians can also find it challenging (Edmonds et al., 2016).Second, satisfactory performance of this task is yet to be achieved (Xue et al., 2021).
Finally, exploring various types of dementia deserves more attention, considering that almost all the dementia datasets indicated here mainly include AD dementia, the leading cause of dementia among our reviewed studies.Similarly, experiments using multiple languages and different English accents should be considered.

Implementations and performances
Firstly, the use of acoustic and linguistic features, or both, was recommended in the review conducted by Garcia et al. (2020).Our selected studies also applied various feature extractions, especially embedding techniques, indicating that searching for and implementing the most informative digital acoustic and linguistic feature representations are still significant to researchers, as shown in Appendix 4 and Table 3.However, non-embedding approaches are no longer attractive because they require manual intervention and domain expertise.We also observed a new trend in converting acoustic into visual features by 2022.Researchers would then either process the visual data alone or combine it with acoustic or linguistic features to detect dementia indicators from speech.
In addition, despite wider ML techniques being the focus of the study conducted by Petti et al. (2020), the authors reported using DL classifiers in their review because DL is a branch of ML.They covered a few deep networks that overlapped with selected studies.The findings of their review also demonstrated the advantage of applying neural networks to achieve better average accuracies than traditional ML methods.Nonetheless, the ability of DL models to leverage pre-trained embeddings further enables DL techniques to obtain better overall efficacies in diagnosing dementia, which indicates the improvement of automated SLP and the advantage of applying DL techniques.Additionally, the studies we analyzed have not considered the rapid advancement of big neural language models, such as the GPT models (Brown et al., 2020) and ChatGPT (OpenAI, 2022).These technologies can enable researchers to provide more intuitive and useful tools to support diagnosing dementia and automated memory tests using a more intuitive conversational method.
The performance achieved by multimodal systems with acoustic and linguistic features for AD-HC classification shows that the two types of features possibly compensate for each other.We also noticed that linguistic features and language models, especially DL embeddings, are more capable of learning dementia characteristics in speech than acoustic or multimodal approaches.However, a problem with the linguistic approach is the acquisition of accurate transcriptions to gain better performance because manually transcribed texts would allow better processing for language models (Cummins et al., 2020), despite significantly increasing processing costs.Multiple experiments applied ASRs to generate transcripts, especially on the audio-only ADReSSo datasets.However, considering that dementia can degrade speech quality and lead to inaccurate ASR transcription, future studies should further explore which ASRs to use and what makes an ASR most suitable for processing speech from older people, especially those with different levels of cognitive impairment.
Regarding the best performance achieved in AD-HC classification, the study conducted by Z. Liu et al. (2021) presented an end-to-end DL model by utilizing ASR-generated transcripts and BERT language embeddings, resulting in an accuracy of 97.18 %.A micro-F1 of 81 % for 3-class (AD-MCI-HC) classification and 80 % for 4-class (Probable AD-Possible AD-MCI-HC) classification were reported using BERT language embeddings and an SVM classifier (Padhee et al., 2020).Nevertheless, despite decent performance, none of the selected studies conducted clinical investigations to evaluate their models, measure how well their models generalize in clinical settings, and assess the level of acceptability among clinicians, patients, and caregivers.Thus, verifying and refactoring technologies in clinical trials are recommended.
We also identified another problem related to the lack of measurable explainability in the DL models constructed in the selected papers.While reviewing these studies to determine whether they explicitly addressed explainability, we found that although most of the 72 reviewed studies offered qualitative interpretations and explanations of DL models, their architectures, and reported performances, only four mentioned explainability (Bertini et al., 2022;Ilias et al., 2022;Hali Lindsay et al., 2021;Zheng et al., 2022).However, none of these studies measured relevant metrics such as Interpretable Architectures, Counterfactual Explanations, and Uncertainty Estimation (Chaddad et al., 2023).
The interpretability and explainability of DL models could be crucial for their practical use and acceptance in clinical settings, as they provide transparency to healthcare professionals and patients.Despite the impressive performance of the DL models reported in this study, their inherent complexity makes it challenging to understand how they generate predictions.Clinicians themselves may not always be able to explain the reasons behind their diagnoses, and historically, some highly effective drugs, such as aspirin and penicillin, have been widely used before their mechanisms of action are fully understood (Wang et al., 2020).In line with this, influential DL researcher Geoff Hinton has argued against the insistence on explainability by clinicians and regulators, stating that "People can't explain how they work for most of the things they do" (Google's AI Guru Wants Computers to Think More Like Brains, 2018), the issue of whether artificial intelligence models should be held to a higher standard of explainability than drugs and physicians is raised.This topic remains an open debate, as mentioned by de la Fuente Garcia et al. (2020).We acknowledge that Explainable Deep Learning and AI are ongoing research questions that require further investigation.
Finally, regarding evaluating DL systems, most of the surveyed studies conducted cross-validations.We also recommend that researchers consider cross-database validation, which could further establish confidence in the generalizability of the results.However, very few studies, as shown in Appendix 5, have considered this.The classification accuracy is the most widely used metric.However, other important metrics, such as specificity, sensitivity, and AUC-ROC, are not always provided.Similarly, when the dataset was class-imbalanced, measuring per-class metrics could have been reported to improve confidence in the results, considering that standardized evaluation metrics are very important for research comparison and benchmarking.Hence, future studies should include these factors in their evaluations.

Summary of contributions
We summarized the contributions of this systematic review into six main aspects: 1) We comprehensively searched and selected studies published between 2012 and 2022 on the development of diagnostic DLsupported SLP for detecting dementia in English.2) We have summarized and listed the research focuses considered by researchers for the automatic detection of dementia via SLP with DL. 3) We comprehensively listed and described the speech datasets applied to diagnose dementia and the features and DL embeddings used to build neural networks.4) We provided a detailed list of DL architectures constructed for diagnosing dementia via SLP.5) We compared and analyzed the outcomes of these DL systems.6) We identified and reported the research gaps and recommendations for future studies.

Conclusion
In this study, we systematically followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses protocol.We reviewed research on the diagnosis of dementia using speech and language processing, focusing on deep learning technologies.We systematically synthesized and discussed the findings of the selected 72 studies.We conclude that deep learning has primary advantages in engineering speech and language data features for building diagnostic speech technologies and detecting dementia by summarizing and analyzing the research focus, aims, datasets, and techniques used for building deep learning systems and the performances reported.State-of-the-art performances were achieved compared to more traditional machine learning approaches.Therefore, we believe that deep learning has further pushed research on automated dementia diagnosis via speech and language technologies in English to the point of being ready for clinical investigations.We recommend that researchers interested in this subject continue investigating the remaining gaps.These gaps arose because of insufficient data and demographic information regarding the participants in the data collection.
Furthermore, exploring novel approaches to constructing automated end-to-end systems is beneficial.We also recommend that future studies consider per-class metrics to portray the performance and reliability of systems better.Finally, our review shows that detecting more types of dementia, detecting mild cognitive impairment from dementia, and healthy individuals requires further investigation to obtain better performance.

Declaration of generative AI and AI-assisted technologies in the writing process
During the preparation of this study, the authors used ChatGPT to improve the readability and language of the manuscript.After using this tool, the authors reviewed and edited the content as needed and took full responsibility for the publication's content.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Table 2
Countries of research organizations.

Table 3
Deep learning embeddings used in the selected papers.