The use of artificial intelligence systems in diagnosis of pneumonia via signs and symptoms: A systematic review

Artificial Intelligence (AI) systems using symptoms/signs to detect respiratory diseases may improve diagnosis especially in limited resource settings. Heterogeneity in such AI systems creates an ongoing need to analyse performance to inform future research. This systematic literature review aimed to investigate performance and reporting of diagnostic AI systems using machine learning (ML) for pneumonia detection based on symptoms and signs


Introduction
Pneumonia is a form of acute lower respiratory infection. Pneumonia is generally characterized by specific symptoms such as fever, chills, cough with sputum production, chest pain and shortness of breath [1]. Many factors affect how serious pneumonia is, such as the type of pathogen causing the lung infection, age, and overall health status. Pneumonia tends to be more serious for children under the age of five, adults over the age of 65, people with certain conditions such as heart failure, diabetes, or COPD (chronic obstructive pulmonary disease), or people who have weak immune systems due to HIV/AIDS, chemotherapy (a treatment for cancer), or organ or blood and marrow stem cell transplant procedures [2].
When an individual has pneumonia, the alveoli, small sacs within the lungs, are filled with pus and fluid, which makes breathing painful and limits oxygen exchange [3]. There are more than 30 different causes of pneumonia, and they are grouped accordingly: bacterial pneumonia, viral pneumonia, mycoplasma pneumonia and other pneumonias. Moreover, pneumonias can be also categorized as community-acquired (CAP), hospital-acquired (HAP) (excluding ventilator-associated [4], which occurs in immunocompromised patients such as patients with human immunodeficiency virus (HIV) infection (see Pneumocystis jirovecii Pneumonia [5], or aspiration pneumonia, which occurs when large volumes of upper airway or gastric secretions enter into the lungs [6][7][8][9]. An accurate definition and diagnosis of pneumonia is contentious for several reasons [2]: low specificity of symptoms of lower respiratory tract infections; difficulty in identifying the underlying pathogen in individuals and lack of widespread availability of laboratory tests and imaging. Diagnosis is suggested by a history of cough, dyspnoea, pleuritic pain, or acute functional or cognitive decline, with abnormal vital signs (e.g., fever, tachycardia) and lung examination findings. Diagnosis should be confirmed by chest radiography or ultrasonography.
This uncertainty and the above-mentioned categorizations lead to empirical treatment selection. However, pneumonia is a leading cause of hospitalization in both children and adults. Most cases can be treated successfully, although it can take weeks to fully recover [2]. In many instances, pneumonia is severe, requiring hospitalization and in some cases, people with severe health conditions need to be treated in ICUs (intensive care units). In the last decades, new complications of viral infections by Coronaviruses have been identified. Coronaviruses are a large family of viruses that cause illness ranging from the common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). SARS-Coronavirus 2 (SARS-CoV-2) is a new strain firstly identified in humans in 2019 and causes Coronavirus Disease  that can spread to the lungs, causing pneumonia. It presents predominantly with fever, persistent cough, fatigue, dyspnoea, loss of smell and taste [10][11]. While many people recover, some develop SARS requiring hospitalisation, with escalation to intensive care support with oxygen, mechanical ventilation and, eventually, death [12]. A novel approach to improve diagnosis and prognosis of pneumonia is the use of biomarkers [13][14]. The diagnostic and prognostic role of procalcitonin (PCT) and mid-regional-pro-adrenomedullin (MR-proADM) were investigated in patients with pneumonia with high positive predictive value.
Confirmation of pneumonia is not trivial and will be by nature context dependent, relying on a combination of what is available from clinical presentation, laboratory tests and diagnostic imaging.
Recent studies push towards the adoption of artificial intelligence (AI) models amplifying diagnostic accuracy in radiology [15]. However, radiography suffers several disadvantages: low sensitivity to early stage pneumonia, lack of standardised interpretation [16], inter-rater variability [17][18], absence of abnormalities in the chest radiographs of children [19] and potential harm due to exposure to x-rays. The biggest shortfall is that radiography is not widely available in low-income settings, which represent the areas with the highest disease burden.
The breadth of challenges to diagnosing pneumonia, especially in low and middle-income countries (LMICs) highlight the potential benefit of a specific, sensitive diagnostic tool for pneumonia. In particular, a major issue is mistaken diagnosis of respiratory diseases due to overlapping symptoms which may offer similar clinical presentation but have differing underlying causes and respond best to different treatments [20], for example pneumonia may be caused by bacteria and require antibiotics, whereas viruses may be the most likely cause of bronchitis [21].
One subset of AI, known as machine learning (ML), which is able to learn, reason, and self-correct without explicit programming, has the potential to provide such a solution. ML could play a major role within the practice of clinical medicine. Moreover, in the last few decades a particular subset of ML, so-called deep learning (DL) based on artificial neural networks (ANNs), is expanding the potential of ML in clinical practise [22].
In the case of pneumonia, ML has been shown to be promising in strengthening diagnostic accuracy when applied to hospitalized patients [23][24]. Despite numerous publications in this field, there are few cases of successful translation of ML techniques to clinical settings across the board [25].
In light of this, it is of great importance that researches consider the clinical setting and end user of their models.
As such, a set of predictors which are easily recognised or even selfreported and a model which is suitable for incorporation into a referral or diagnostic tool such as an APP for mobile phones will be key requirements for assisting diagnosis of pneumonia in low or middleincome settings. Therefore, the use of ML systems to detect respiratory diseases via non-invasive measures such as signs and symptoms is gaining momentum. Indeed, such diagnostic tools are emerging as a route to facilitating successful task redistribution and improving access to accurate diagnosis in araeas with low numbers of qualified clinical staff [26]. However, due to the heterogeneity and diversity of ML systems, there is an ongoing need to assess their performance in order to identify gaps in the research, impact improvements in practices and facilitate future comparative studies. To the best of our knowledge this is the first review of the application of ML to symptom-based detection of pneumonia. The research question we addressed is what symptom-based ML predictive models have been developed and how well do they perform? In this way, the aims of this study were to assess both the performance of published ML methods to diagnose pneumonia based on symptoms or signs, and the reporting quality of these studies.
Therefore, the main contributions of our work were: (1) to show a systematic synthesis of the existing studies which proposed ML algorithms to diagnose pneumonia based on signs and symptoms, (2) to identify common most frequently used symptoms as ML features, (3) the best ML methods and performance.
Based on our findings, we provided a recommended pipeline to design and implement predictive algorithms, with critical steps to follow to achieve a generalised and robust ML model. We anticipate that our findings and recommendations will be constructive in guiding future research and facilitating it is translation into clinical tools.

Materials and methods
This systematic review was conducted and reported in accordance with PRISMA guidelines for systematic reviews and meta-analyses (PRISMA checklist) [27][28] and the recommendations from the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy [29]. The methods of the literature review as well as the inclusion and exclusion criteria were specified beforehand in a protocol (available from the authors on request). All types of studies were included if they reported on the use of artificial intelligence (AI) systems such as machine learning (ML) or deep learning (DL) techniques applied to distinguishing pneumonia based on signs and symptoms. The STARD 2015 tool [30], which has been developed to assess the reporting quality of diagnostic accuracy studies, was used to rate the quality of the studies included in the systematic review.

Search strategy
Potentially relevant studies were identified by searching PubMed, Scopus, and Embase (through Ovid sp ) electronic databases published form 2010 to May 2021. It has been shown that searching multiple databases increases the overall recall in systematic reviews, however, there is a limit to the practicality of searching many databases [31]. Therefore, we selected three databases with good evidence of recall, which are appropriate for multidisciplinary research [32][33]. Only studies in English were selected for screening.
A broad search strategy including combinations of search terms for (i) the index test under evaluation and (ii) the target condition of interest was first developed for PubMed, and then adapted to all other databases. The full search strategy is reported in Supplementary Table 1. During the search, no methodology search filters to identify diagnostic test accuracy studies were used to avoid missing relevant records. In addition, in order to identify recent diagnostic accuracy test studies concerning the diagnosis of pneumonia in patients affected by the recent pandemic of SARS-CoV-2 coronavirus infection, the MedRxiv server of preprints was searched using different combinations of search terms including "COVID-19", "SARS-CoV-2", "diagnosis", "Pneumonia", "signs", "symptoms", "artificial intelligence", and "machine learning". Finally, a linear reference search was conducted by checking the references of the studies identified in the index search that met the review's inclusion criteria. Two researchers (CF and KS), who were blinded to the author information of the articles, independently screened all identified records for inclusion. In case of disagreements, a third author (RC) was consulted.

Inclusion and Exclusion criteria
A set of inclusion and exclusion criteria were defined among the authors before the study.
Studies were considered for inclusion if they were classifiable as accuracy diagnostic test studies and their declared objectives were to differentiate individuals with clinically diagnosed pneumonia from controls or other diseases (e.g., bronchitis). Studies only focusing on determining the severity of pneumonia or its aetiology were considered out of the scope of this review to reduce heterogeneity among studies.
Data collection could be both prospective and retrospective i.e., planned either before, or after the reference tests were performed. All types of study methods to recruit participants were allowed, including studies using a single set of inclusion criteria for patients with and without the target condition (Cohort type accuracy studies) and studies using different set of criteria (Case-control type accuracy studies). Prognostic accuracy studies, such as those using AI systems to identify patients who may develop pneumonia in the future, or experience pneumonia-related adverse events were excluded from the present review.
The target condition had to be defined as pneumonia, without any limitation regarding its pathogenesis (e.g., viral, including SARS-CoV related pneumonia or bacterial pneumonia), or the classification system used (e.g., the WHO IMCI classification). Study participants of all ages and clinical characteristics were admitted. Studies not on human subjects were excluded.
The characteristics of the index test evaluated in the studies were required to (i) use algorithms, defined as machine learning in the study, including appropriate apparatus, such as learning or training, aimed at seeking optimal answers and (ii) include signs and symptoms as predictors in the machine learning algorithms.
Signs and symptoms were defined as any subjective (symptoms) or objective (signs) abnormality that may indicate the presence of pneumonia, such as cough, fever, dyspnoea, chest pain, chest, indrawing, sweating and shivering, breathing rate, etc. No limitations were set concerning other predictors that might have been used alongside signs and symptoms, including epidemiological and demographic parameters (e.g., age, gender, rural/urban site, season, region, etc.), imaging or laboratory test results.
Studies had to report at least one accuracy measure of the index texts, such as sensitivity, specificity, accuracy and the area under the curve (AUC). Lastly, no pre-defined limitations were applied to the type of test used as reference standard (e.g., imaging examinations, microbiological tests), and the spectrum of study participants with and without the target condition. Review articles were not included directly, but references were screened and included individually if they met the review's inclusion criteria.
In summary, the inclusion criteria were as follows: 1. studies classifiable as accuracy diagnostic test studies; 2. the objective of the study was to differentiate individuals with clinically diagnosed pneumonia from controls or other diseases (e.g., bronchitis); 3. algorithms, suggested as machine learning in the study, including appropriate apparatus, such as learning or training, aimed at seeking optimal answers; 4. studies had to report at least one accuracy measure of the index texts, such as sensitivity, specificity, accuracy and the area under the curve; 5. studies had to include signs and symptoms as predictors in the machine learning algorithms.
The following exclusion criteria were also applied: 1. all review articles, letters, comments, abstracts, conference papers and case reports; 2. studies only focusing on determining the severity of pneumonia or its aetiology and/or without diagnostic confirmation; 3. prognostic accuracy studies; 4. non-human subjects (e.g., animals); 5. non-English papers.

Data extraction and outcomes of interest
Two review authors (KS, CF) performed the title and abstract screening and extracted the data from included studies and a third author (RC) checked the extracted data. For the final set of included records, the following information was retrieved: (i) literature datatitle, first author and publication date; (ii) study design; (iii) study participantskind of pneumonia, mean age and class, clinical setting (e. g., primary, secondary or tertiary care), sample size, sign and symptoms and other diseases; (iv) information regarding the reference standard, i. e., methodologies to distinguish pneumonia patients from control group or other respiratory diseases. In addition, data were also recorded on (v) the specific methodologies used to process and classify data for use in machine learning algorithms, including features selection methods, ML parameters and final predictors. Finally, data were also extracted on (vi) the summary measures for the predictive ability of the identified AI systems, including the systems' sensitivity, specificity, accuracy and AUC measures. In addition, references to relevant ethical issues regarding the studies were also extracted. This study was not meant to estimate an overall measure of the accuracy of ML systems to diagnose pneumonia based on symptoms, but rather to provide a broad overview of the characteristics of the different approaches proposed. Therefore, only a qualitative synthesis of the study results was planned, anticipating a broad heterogeneity in the types of study design, participants, test methods, type of analysis and reported accuracy measures in the included studies.

Certainty assessment
The reporting study quality of the included studies was rated via the STARD 2015 tool [30], which consists of a checklist of 30 items that should be included in the reports of diagnostic accuracy studies in order to ensure the interpretability of results, enhance the reproducibility of research and improve completeness and transparency.
Given the characteristics of ML tests, items 22 and 25 were considered not applicable and excluded from the quality assessment. When assessing adherence to the STARD 2015 checklist, each reporting requirement was rated as yes, no, maybe, or not applicable, with all disagreements resolved by consensus between the 2 reviewers. If, for each item, information was fully reported in the relevant section of the manuscript or provided in the supplementary material (including online-only material), the item was scored as a "yes". If an item was only reported partially, it was scored as a "maybe", whereas if an item was not applicable to the study was scored as NA. To optimize interobserver agreement, a training session was done for all reviewers using 2 articles.
Three reviewers (KS, CF and RC) completed the study checklist for one third of the included records each. A cross-check by another author was done for 10% of the studies and any disagreement was resolved by discussion. Results of the quality assessment were analysed qualitatively through a narrative summary of the main reporting issues identified in the studies.

Study selection
According to the search strategy described above, 876 titles were identified in PubMed, Scopus, Ovid SP , and Pre-print servers. After removing duplicates, 775 titles were considered. Of these, 726 were excluded after reading the title and abstracts as they did not meet the inclusion criteria. From the remaining 49 full-text articles, 34 were removed due to the exclusion criteria. One article was identified through a linear search of the references included in the final studies. Finally, 16 full texts were included in the qualitative analysis. A flow chart of the literature search results is shown in Fig. 1.

Certainty assessment of included studies
A summary of STARD 2015 adherence by item is presented in Fig. 2. The STARD items reported for each study is listed in the Supplementary Table 2. The STARD items are grouped in macro-categories such as: title or abstract, intro, methods, results, discussion and other info. Each item is coloured in green if information was fully reported in the study ("Yes"); light blues, if an item was only reported partially in the study ("Maybe"); red, if information was not reported ("No"); whereas if an item was not applicable to the study was coloured in grey ("NA").
Overall studies had a moderate reporting quality for all subitems in the sections of the STARD tool concerning the title and abstract, the description of the study design and participants, and discussions, but less so in the sections of the methods concerning the description of the test methods, including the index and reference tests; the analysis of the data; and the results sections including the description of the study participants and the results of the tests.
STARD items were described as frequently reported (if ≥66% of the total studies reported a specific item), moderately reported (33%-66% of the total studies reported a specific item), and infrequently reported (≤33% of the total studies reported a specific item) [34].
Seventeen of the 28 items were frequently reported in whole or in part (i.e., "Yes" or "Maybe") by the included studies. Some of the frequently reported items are of relevance to this study. In the method section related to the test methods, subitem 10.a, which relates to the description of the machine learning method used in the study (i.e., the index test) was fully reported by 11 studies (69%), partially by 1 study (6%), whereas no sufficient information was provided in 4 studies (25%). Similarly, subitem 10.b, related to the description of the reference standard used to calculate the accuracy of the index test was reported fully by 9 studies (56%). Moreover, in the results section, item 24, related to the estimates of diagnostic accuracy and their precision (such as 95% confidence intervals), was reported by 8 studies (50%).
Six of the 28 items were moderately reported, in whole or in part by the included studies. These include for example item 11 in the method macro-area (i.e., rationale for choosing the reference standard) was reported in full by 7 studies (43%). Another important item is 12 (i.e., definition of and rationale for test positivity cut-offs or result categories of the index test, distinguishing pre-specified from exploratory), which was only reported in full or partially by 6 studies (37%). In the results macro-area, item 20, which regards baseline demographic and clinical characteristics of participants, was reported by 10 studies (62%).
Five of the 28 items were infrequently reported, in whole or in part by the included studies. These include item 15 (i.e., how indeterminate tests were handled; reported by 3 studies (18%)), item 17 (i.e., whether analyses of subgroups and heterogeneity were prespecified or exploratory; reported by 2 studies (12%)), item 18 (i.e., whether intended sample size and how it was determined; reported by none of the studies (0%)), item 19 (i.e., flow of participants, using a diagram; reported by only two studies (12%)) and item 23 (i.e., cross tabulation of the index test results; fully reported by 1 study (6%)).

Characteristics of the included studies
Study characteristics regarding study design and subject population as well as machine learning methods and performance measures were extracted and presented in Tables 1 and 2.

General Study Characteristics
A summary of types of pneumonia, reference standards, study populations and other clinical characteristics of the included papers is given in Table 1.
The vast majority of studies identified concerned CAP, with patients presenting symptoms of pneumonia to EDs or other healthcare facilities. In one case (Rother et al. [41] it is unclear whether HAP or CAP is considered. Interestingly, one study by Porter et al. [46] included pneumonia identified through presentation to the ED, inpatient wards and ambulatory care units, suggesting an inclusion of both HAP and CAP. Two studies focused specifically on the detection of COVID-19 pneumonia.
Unsurprisingly, radiography, which is widely considered the gold standard for confirmation of pneumonia, was the most used reference test. In several papers the reference standard was unclear [36][37]41,[48][49]. Grigull et al. [36], Yu et al. [48] and Huang et al. [49] gave no mention of how diagnosis of pneumonia was performed, Bejan et al. [37] described diagnosis as being performed by a 'nurse with 6 years of experience' but not the criteria used for classification, Rother et al. [41] mentioned 'standard diagnostic criteria' but offered no further detail. Further, as well as omitting the details of how pneumonia positive cases had been established, Yu et al. [48] and Huang et al. [49] did not make clear how cases of COVID-19 and pneumonia had been confirmed, i.e., whether PCR test results were available.
14 out of 16 studies provided information on study population age. Of these, 8 focused on childhood pneumonia, 5 on pneumonia in adults and one in a mixed age population. Three of the included studies were focused on LMIC settings and the specific challenges regarding diagnosis of childhood pneumonia in such areas [23,42,44], indicative of those most vulnerable to the disease. Of these studies, only Pervaiz et al. [44] included other respiratory diseases which may be difficult to distinguish Yes" was assigned if information was fully reported; "Maybe" was assigned if an item was only reported partially; an item was scored as "No" if information was not reported; whereas if an item was not applicable to the study was scored as "NA". and indeed are likely to be highly common in presenting patients. Indeed, only 6 studies concerned differential diagnosis of pneumonia from other diseases while the remainder either did not specify or specifically excluded other respiratory conditions from the population. Inability to distinguish between other, similarly presenting respiratory diseases is a clear limitation to the utility of any proposed diagnostic tool. Finally, in 11 of 16 studies the data used was not provided in the paper nor registered in any opensource database, although in 6 of these studies availability was granted on request to the authors. In 3 cases data was made available in public repositories.

Artificial intelligence study characteristics
Summaries of feature selection, machine learning methods, validation process and their performance to automatically detect pneumonia are presented in Table 2, for more details see Supplementary Table 3. The best ML method was chosen as the one presenting a higher AUC, which is a good estimator of sensitivity and specificity. In case a study explored different problems connected to pneumonia such as classification of different grades of severity (e.g., high, moderate and mild), only the methods employed by the included studies to automatically detect pneumonia from controls or other respiratory disease were extracted and tabulated in Table 2.
ML model choice varied from relatively simple methods such as logistic regression (31% of the selected papers) [50] to deep learning algorithms such as artificial Neural Networks (aNNs) [51] or Convolutional Neural Networks (CNNs) [52] (19% of the selected papers) as also shown in Fig. 3.
Logistic regression is a simple technique for binary classification problems, and it is often used to model the probability of a certain class or event existing [50]. Whereas aNNs and CNNs are more advanced machine learning models, which are capable of learning any nonlinear function; in particular CNNs is a type of aNN mainly used in image recognition and processing that is specifically designed to process pixel data [52].
Several studies [40,42,[44][45]47] (5 out of 16 studies) selected regression-based models as the method achieving the best overall performance. The regression-based models ranged from simple or multivariate logistic regression [53] to more sophisticated techniques such as LASSO regression [54]. In particular, LASSO is a penalized regression approach that estimates the regression coefficients by maximizing the log-likelihood function (or the sum of squared residuals) and automatically deletes unnecessary covariates.
Five [23,[38][39]43,55] out of 16 studies employed a tree-based model to automatically detect pneumonia. The majority of tree-based algorithms were Random Forest (RF) [56] and CART trees [57]. CART algorithm is a decision tree based on Gini's impurity index as splitting criterion. It is a binary classifier built by splitting single nodes into child nodes repeatedly. On the other hand, RF is a bootstrapping algorithm based on the CART tree model. In particular, RF creates multiple CART trees based on "bootstrapped" samples of data and then combines the predictions. The combination is an average of all CART models predictions. Random Forest can achieve better predictive power than a CART model but, RF rules are not easily interpretable.  Two studies investigated the combination of several ML methods via voting [36,41]. Voting is one of the easiest ensemble methods. In particular, ensemble methods are techniques that create multiple models and then combine them to produce improved results [58]. Grigull et al., [36] and Rother et al., [41] showed that the combination of different ML methods such as SVM, aNNs, fuzzy logics and more traditional ML (RF, LR, etc.) achieved higher accuracy to discriminate pneumonia versus other diseases.
Two studies [38][39] employed probabilistic ML methods to detect patients with pneumonia and without pneumonia. DeLisle et al. [38] reproduced a previously reported model presented in [59], whereas Haug et al. [39] developed Bayesian networks, built around directed links reflecting mathematical relationships between variables. Support vector machine (SVM) was frequently employed in several studies but only one study [37] reported SVM as the best performing model. SVM belongs to a general field of kernel-based machine learning methods and are used to efficiently classify both linearly and nonlinearly separable data [60].
Only one study achieved the best performance using a deep learning algorithm (i.e., artificial neural networks) compared to other traditional machine learning methods [46]. Two recent studies [48][49] also employed deep learning methods achieving high accuracy and AUC value. Yu et al. [48] presented a novel deep learning algorithm for the disease identification stage, including adaptive feature infusion and multi-modal attentive fusion in order to fuse structured and text data together. Huang et al. [49] explored a deep learning based dual-tasks network, named FaNet [61], performing both diagnosis and severity assessments for COVID-19 based on the combination of CT imaging and clinical symptoms.
The reason for the use of mostly linear classifiers may be due to the fact that those models are mainly developed to be implemented in a Decision support system or CAD system. In such systems less-complex, more interpretable models to clinicians and non-AI experts are preferred over more advanced AI methods such as deep learning, which are often referred as "black boxes" along with more complex ML algorithms such as SVM.
Among the selected studies, two studies [47,49] investigated patients with suspected COVID-19 pneumonia. Feng et al. [47] employed a Logistic regression (LASSO) method to discriminate among COVID-19 pneumonia and healthy patients using combinations of symptoms and laboratory parameters. Whereas, Huang et al. [49] detected COVID-19 pneumonia patients from healthy patients using a state-of-the-art deep learning algorithm (FaNet) by using CT images and symptoms.

Ethical aspects
Several ethical issues were addressed by the included studies. Some of them were not fully investigated, such as informed consent, reference to minors or, generally, to the age of patients (considering that that the youngest and the oldest are the most affected by pneumonia) or to gender issue [36,[41][42][43][44]46,48].
Aspects that were frequently mentioned were related to the allocation of resources [23,43,62], in particular for limited resource settings (LRSs) [42][43][44]48], in which morbidity or infant mortality for pneumonia [42] is high, the lack of resource [42,48], but also the need for specific training of staff [44], doctors and/or nurses [42,48] on more advanced tools. This raises an important ethical question that is a global challenge [47]: the difficulty of LRSs in complying with medical and technological international standards.
Another recurring theme was the man-machine relationship [36,39], still controversial from an ethical point of view. There is unanimous recognition that technology is a support tool for doctors [36,39,41,43] that adds objectivity, precision [44][45][48][49] and speed to the diagnosis. However, the entry of technology in the moment of diagnosis somehow changes the doctor-patient relationship, objectifying the care relationship [36][37]40], which on one hand leads to its depersonalization and on the other to greater rigor (for sensitivity and specificity) which could limit medical malpractice and the consequent litigation [36]. The question of a possible replacement of man by machine for the doctor decision making is also mentioned in [23,36,39], but it is made clear that doctors themselves are aware of the urgent clinical need for algorithms and that they do not perceive this as competition [41,45].

Discussion
This systematic literature review provided a comprehensive overview of the existing studies which proposed ML algorithms to diagnose pneumonia based on signs and symptoms. The use of AI for image-based detection of pneumonia and particularly COVID-19 has been reviewed [63][64], but to the best of our knowledge systematic review of symptom-based models is lacking in the existing literature. This is particularly timely as AI based diagnostic tools begin to appear in medical devices. However, the practicality of AI in current medical practice is still not fully understood by clinicians. AI could help to reduce mistaken diagnosis. In fact, respiratory diseases can present overlapping symptoms which may offer similar clinical presentation but have differing underlying causes and respond best to different treatments. Therefore, the advances made in machine learning models could assist clinicians in diagnosing pneumonia in rapid time by considering a high number of variables related to patient care and medical history. To address the difficulties effectively and efficiently, it may be worth considering the inclusion of AI in medical practice. This could positively contribute to the patients' condition by analysing treatment personalization strategies as a result of predicting clinical situations that could deteriorate patients' health. With the dramatically fast spread of COVID-19, analysing complex medical datasets based on machine learning can provide opportunities for developing a simple and efficient COVID-19 diagnostic system. Nevertheless, several issues, such as poor realised performance in clinical settings, as discussed by van Schalkwyk et al. [65], may be alleviated by proper ethical, contextual and performance evlaution during their conception and design.
Of the hits retrieved in the systematic review, many studies were published from 2020 to 2021 and concerned detection of COVID-19. However, the majority of these articles did not meet the inclusion criteria as they either focused on symptomatic detection of early disease (not associated with pneumonia) or imaging-based detection with no input from symptoms or signs. The included studies were highly heterogeneous concerning the study design, the healthcare setting, the study population and the ML algorithm employed. Specifically, three papers focused on diagnosing childhood pneumonia in LMIC settings. This is a very relevant context which warrants more research, as application of AI algorithms in countries with highly constrained healthcare settings and deprived populations may be of even higher value compared to higher income countries.
The reporting quality was satisfactory for some sections of the STARD checklist, but less so for relevant sections such as the description of the index (ML model) and reference tests and the analysis of the data in the methods section as well as the description of the study participants and the results of the tests in the results section.
For example, items which were less frequently reported included the details of the reference and index tests, such as a clear description of the reference standard used as benchmark, the definition of rationale for test positivity cut-offs or result categories, or the way that indeterminate results of the reference test were handled. Noteworthy and concerning was the fact that such details remained absent even in the most recent publications, which focused on providing improved detection of COVID-19 pneumonia, with only one study providing any details on either diagnosis of pneumonia or method for confirmation of SARS-CoV-2 infection. In addition, the characteristics of the study participants, such as the distribution of severity of disease in those with the target condition, or the distribution of alternative diagnoses in those without the target condition were also less frequently reported. All these aspects warrant more careful consideration and higher reporting standards to allow a clear judgment on the risk of bias in the accuracy estimates and to allow replication and validation of the proposed ML-algorithm in other settings or populations. Similar issues and deviations from best practice have been highlighted concerning the relative explosion in the publication of ML algorithms for diagnosis and management in response to the spread of COVID-19, the result of which clouds the most clinically beneficial routes and prevents the realisation of benefits to patients [66].
The references to ethics found in the selected texts suggest that there is an overall awareness of the importance of ethical principles and guidelines to guarantee the protection of people's rights. However, the urgency of adopting a shared ethical reference framework emerges (i.e., European Commission, Ethics guidelines for trustworthy AI). Furthermore, in order to make ethics a real tool of concrete support and not just a humanitarian embellishment, it should be considered a decisive reference also in the design and implementation phase of AI algorithms, to better guarantee users' rights.
Concerning the ML algorithms, there was a huge heterogeneity among studies and many pitfalls were identified in the development of a reliable and generalisable ML model to diagnosis pneumonia via symptoms and signs. Few studies employed a feature selection step in the development of the ML model. Nevertheless, feature selection is a critical step to develop a robust classifier in medical and health applications. In fact, in order to minimize the over-fitting risk in a ML model, the number of features used in the model and its cardinality should be limited by the number of subjects presenting the event to detect (i.e., pneumonia) in the training folder or in a separate folder specifically designed to conduct the feature selection process [67][68][69]. The splitting of the dataset in subfolders is crucial in order to avoid bias and overfitting problems and increase the external validity of the model. If data availability is not a problem, the dataset could be split into three different folders, where folder 1 is designed for feature selection via several existing techniques [70][71]; folder 2 to train and validate the model; an independent dataset (folder 3) to test the final model and assess the overall performance [67][68][69]. However, although the best approach is to select the minimum set of features using a different folder from the one adopted to train the machine learning model [67][68][69], in case the dataset is small, feature selection and model training can be performed on the same folder (folder 1). As reported in Table 2, some studies did not employ a clear feature selection method and in case they did, they performed the feature selection on the whole dataset or during the training of the algorithm. It is important to bear in mind that a significant small set of clinical features strongly simplifies the physiological interpretation of results, by directing attention only on the most informative features [67]. In the detection of pneumonia, the identification of symptoms that can be used as final predictors is of extreme importance to the physicians. Therefore, hand-crafted features and the use of PCA is not recommended. As reported in Table 2, there is a mixture of manual and automated approaches to feature selection process in the selected studies. Manual methods have a clear focus on clinical utility and application. Some key criteria used were: (i) measurable in a point of care setting [23]; (ii) parameters frequently investigated [36]; (iii) ease of availability [35,62] and (iv) reliability [35]. Haug et al. [39] make an interesting comparison between a fully automated ML model, from feature selection to performance, and a semi-automated model in which features are chosen manually based on medical relevance by clinicians. The large dataset available in this study allowed selection of 40 features by both methods. Of these features there was considerable overlap, notably certain symptoms picked up by both methods were: heart rate, respiratory rate, temperature, abnormal breath sounds, moderate cough, wheezes, productive cough and rales breath sounds. It seems certain features such as 'Not oriented to place', which are selected in the automated process are absent in the manual, perhaps due to a lack of direct clinical/biological relevance to pneumonia. Interestingly slightly better performance was achieved using the manually created model, which may highlight the motivation for a firm evidence basis in ML design. Other popular methods were uni/multi variate analysis and logistic regression. One technique appearing in the most recent publication [47] was Least Absolute Shrinkage and Selection Operator (LASSO). LASSO builds on classic regression models and is emerging as a more interpretable clinically useful method for selecting predictors, as by nature it strives to create sparse models (fewer predictors) [72]. Five studies [38,41,44,46,49] did not employ any feature selection process.
As far as the validation process is regarded, the training dataset is not known to have a sub-category, whereas the validation dataset can be further divided by types: (i) internal validation, whose sample originates from the same sample as the training dataset, (ii) external validation, whose sample is composed of independently sampled data, (iii) internalsplit validation, which uses a sample that has been separated from the original dataset for the purpose of validation, and (iv) internal-cross validation, which repeats validation process over a sample that is left out of the training dataset. Five studies [38,[42][43][44][45] did not employ either internal or external validation techniques, making the developed models difficult to generalize and compare with other diagnostic tools. Only three out of 16 identified studies [23,36,48] employed both crossvalidation and testing on an independent set of data. Two studies [47,49] tested the models on an independent subset of data. The remaining studies developed the ML models using training and internal validation techniques.
The majority of the included studies employed big datasets which were highly unbalanced. In medicine, a well-balanced dataset is vital to develop a good prediction model [73]. In fact, when the imbalance is large, it is hard to build a good classifier using conventional learning algorithms. The cost in miss predicting minority classes is higher than that of the majority class for imbalanced datasets; this is particularly relevant in medical datasets where high risk patients tend to be the minority class (e.g., pneumonia cases). Therefore, there is a need of a good sampling technique for medical datasets. Among the selected studies, only four out of the 16 studies [23,36,40,43] adopted a bootstrapping or oversampling technique to address the problem of unbalanced datasets.
Among the selected studies, there are a variety of predictors used to develop the machine learning algorithms. Eight studies used a combination of laboratory results and symptoms as their final predictors, with only 5 papers using symptoms alone. Symptoms/signs which occurred often included: fever (5 studies), temperature (5), abnormal breathing (4), cough (3), productive cough (2), dyspnoea (2), absence of runny nose (2) and chest in drawing (2). Other population differences are also reflected in the final predictors, for example chest in drawing is only used in studies concerning childhood pneumonia. This is consistent with the known age-specific presentations of the disease [74] and thus highlights a potential challenge in production of a general model. The utility of C-reactive protein (CRP) level as a biomarker in classifying pneumonia was addressed by 4 studies. As it has been recognised already in the literature [13][14], there were some contradictions between studies of its performance as a pneumonia predictor. Naydenova et al. [23] and Groeneveld et al. [45] specifically investigated addition of CRP to models based on symptoms, vital signs and age and find that CRP worsened model performance in diagnosis of pneumonia. It is worth noting, however, that both authors described the utility of CRP as a beneficial predictor for pneumonia severity and aetiology. Interestingly the reference standard of pneumonia in these studies is not the same, Naydenova et al. [23] had a subject population of children and use clinical evaluation based on WHO and IMCI guidelines, whereas the subject population in Groeneveld et al. [45] was adult and their reference was consolidation on X-ray. In contrast to this, Steurer et al. [35] and van Vugt et al. [40] found CRP to be a useful predictor. Indeed, Steurer et al. [35] found CRP to be the strongest indicator of radiographically confirmed pneumonia in adults from a set of mostly symptomatic predictors. Together, this highlights the need for further investigation of biomarkers as candidate features for diagnostic classifiers to gain further understanding of the seemingly complex presentation of these levels. Only one study [49] used a combination of 3D CT imaging and clinical symptoms via deep learning model (FaNet) to detect patients affected by COVID-19 pneumonia. Their experimental results illustrated that FaNet achieves fast clinical assessment for COVID-19 with an accuracy of 98.28%. The proposed framework consisted of 4 modules: encoding for symptoms, feature extraction from CT image sequences, fusion, and prediction. They developed a Symptomfused Channel Attention Module to fuse the clinical symptoms and the CT image sequences. Finally, the prediction module predicts the clinical assessment based on the fused feature.
Some studies used a combination of many signs or symptoms, even though they employed large datasets, the class to predict (i.e., pneumonia) is often the minority class. According to Foster at al. [67], as rule of thumb, for each predictor at least 10 observations and/or patients are needed for the event to detect. In the case of some studies, the number of predictors overcome the number of patients included in the target classes, incrementing the risk of overfitting of the model.
Comparison of predictors and ML model performance across all studies is strictly limited for several reasons: (i) variation in pneumonia type/reference standard; (ii) variation in subject population and (iii) differential reporting of performance metrics. The overall performance AUC varied to 75-99%. However, not all the included studied reported AUC measure. Moreover, there is great heterogeneity in performance reporting of the diagnostic tools used in the included studied. The reference standard to report performance of ML methods is described by [75]. The lack of homogeneity among the selected studies in ML development and performance reporting were the main reasons of conducting a qualitative systematic review as meta-analysis was not possible with the available gathered data.

a. Recommendations when designing and implementing AI tools
In light of this scenario, recommendations of on how to develop a ML method is given to researchers to improve the efficacy of AI tools to automatically detect pneumonia or any other respiratory diseases. The recommended pipeline is formed by: 1. Pre-Processing step. For building any ML model, it is important to have a sufficient amount of data to train the model. The data is often collected from various resources and might be available in different formats. Due to this reason, data cleaning and pre-processing become a crucial step, which include: impute the missing values, encode categorical variables (in case of symptoms), normalize and/or scale the data if required. Moreover, clinical information and reference standard results should be available to the performer of the ML model. More important, explanations on how indeterminate reference standard results were handled should be provided. 2. Dataset splitting. In the case where the dataset presents an adequate number of instances, the whole dataset can randomly be split per subjects and or instances into two or more sub-folders. For instance, one folder (usually the 20 % of the total data) can be used for feature selection; a second folder 2 (usually the majority of the data, 60%) can be used for training and validating the classification models; finally, a third folder (e.g., 20 % of the data) is adopted to evaluate the performance of the developed classification models. In the case of a highly imbalanced dataset, each folder should contain the same proportional percentage of minority instances and techniques to address this problem should be employed (e.g., SMOTE, oversampling, under sampling or boosting). 3. Identifying features to predict the target. The number of features used in a machine learning algorithm should be strongly limited by the number of subjects and or instances presenting the event to detect in each folder, in order to minimise the risk of over-fitting. However, selecting the minimum set of features using the same folder utilised to train the machine learning algorithm can reduce the generalisability of the final decisional algorithm. Researchers using manual feature selection based on clinical usage or more advanced techniques to reduce the number of features, should always bear in mind that the maximum number of features that can be used in the classification process is strongly limited to the number of subjects (i.e., belonging to the minority class) presenting the event to detect or predict. 4. Designing the ML Pipeline using the best model. Different ML methods can be used to develop classifiers aiming to automatically detect the event based on the selected combinations of features. Regarding algorithm parameters, they should be tuned during training and carefully reported in the study to guarantee the reproductivity of the results. The training of the ML methods should be performed using cross-validation procedure, which needs to be repeated K times, with K equal to or greater than the number of instances belonging to the minority class. This procedure needs to be performed for each machine learning method used to develop predictive algorithms. 5. Predict the target on the unseen data. 6. Reporting performance according to standards. Moreover, researchers are highly encouraged to define the rationale for test positivity cutoffs or result categories of the ML method, distinguishing prespecified from exploratory results. b.

Limitations of the study
This study has provided several new insights on the existing approaches to predicting pneumonia based on signs and symptoms and the aspects that warrant consideration both in the design and implementation phases of the tests and in the reporting of the findings. However, several limitations are also outlined. To the authors' knowledge, there is no available and reliable tool for the quality assessment of studies incorporating ML, and as a result, the quality of the studies that have been found in this area could not be systematically assessed. Second, while the medRxiv preprints database was included in the search strategy, in order to capture all possible recent contributions addressing SARS-COV related pneumonia, the search was conducted using simple combinations of search terms due to the limited flexibility of the available search options. Therefore, relevant records in this dataset may have been missed. Lastly, we limited our search in the bibliographic databases to the last 10 years. This choice was driven a motivation to evaluate the recent use of ML techniques.

Conclusion
This systematic literature review found huge heterogeneity among studies using ML to detect pneumonia based on symptoms and signs. Many differing study designs, healthcare settings, populations and ML