A case-based reasoning framework for early detection and diagnosis of novel coronavirus

Coronavirus, also known as COVID-19, has been declared a pandemic by the World Health Organization (WHO). At the time of conducting this study, it had recorded over 11,301,850 confirmed cases while more than 531,806 have died due to it, with these figures rising daily across the globe. The burden of this highly contagious respiratory disease is that it presents itself in both symptomatic and asymptomatic patterns in those already infected, thereby leading to an exponential rise in the number of contractions of the disease and fatalities. It is, therefore, crucial to expedite the process of early detection and diagnosis of the disease across the world. The case-based reasoning (CBR) model is a compelling paradigm that allows for the utilization of case-specific knowledge previously experienced, concrete problem situations or specific patient cases for solving new cases. This study, therefore, aims to leverage the very rich database of cases of COVID-19 to solve new cases. The approach adopted in this study employs the use of an improved CBR model for state-of-the-art reasoning task in the classification of suspected cases of COVID-19. The CBR model leverages on a novel feature selection and the semantic-based mathematical model proposed in this study for case similarity computation. An initial population of the archive was achieved from 71 (67 adults and 4 pediatrics) cases obtained from the Italian Society of Medical and Interventional Radiology (SIRM) repository. Results obtained revealed that the proposed approach in this study successfully classified suspected cases into their categories with an accuracy of 94.54%. The study found that the proposed model can support physicians to easily diagnose suspected cases of COVID-19 based on their medical records without subjecting the specimen to laboratory tests. As a result, there will be a global minimization of contagion rate occasioned by slow testing and in addition, reduced false-positive rates of diagnosed cases as observed in some parts of the globe.


Introduction
The novel coronavirus disease, also referred to as COVID-19, was first reported in China in December 2019. The virus has so far affected 213 countries and territories around the world and 2 international conveyances. It is now considered a major global health concern due to its pathogenicity and widespread distribution across the globe. The COVID-19 virus is a highly contagious respiratory disease that has spread rapidly around the world since it first reported in China in late December 2019 [53]. According to the WHO's official reports on COVID-19 [13], by July 6, 2020, it had affected more than 11,301,850 million people and caused more than 531,806 deaths, with a total of 6, 586,354 million recovered. Considering the exponential growth in the confirmed and death cases of COVID-19, this has expedited efforts by the scientific and research community in proposing and developing several novel epidemiological model approaches to mitigate the spread of the COVID-19 outbreak.
Some mathematical and statistical models have been developed recently to critically analyze the transmission pattern of the ongoing COVID-19 and other related disease outbreaks [14][15][16][17][18][19]. It is equally important to recognize all the different epidemiological contributions towards estimating the transmission dynamics of the virus. Still, most of the existing proposed models are parameter-dependent, and they rely mainly on multiple assumptions [20] for them to be effective. Moreover, because during an outbreak of an epidemic, it is often not easy nor reliable to estimate parameters using real data sets, which are not readily available for experimental testing of such proposed models [21,22]. Furthermore, in most of the reported model parameter settings, one can discover that rather than using the actual parameter values that seem close enough to the real-world values derived from the statistical properties of the actual data sets, the authors of those models opted to use hypothesized parameter values. However, the use of hypothesized parameters, in this case, is highly limited because it does not fit the data very well [20]. Therefore, considering the aforementioned challenges associated with the current existing mathematical and statistical epidemiological models, it would be challenging to attribute any high predictive accuracy level for using these models to estimate and forecast the exponential growth of COVID-19 outbreaks correctly. As it stands for now, despite all these measures and attractive modeling proposals, the virus has maintained its capacity to spread exponentially from country to country and continent to continent stretching the functionality and capability of even the most robust healthcare systems of many countries. The exponential spread of the virus has placed an enormous burden on health facilities (testing and laboratory centers, and ICUs) of nations across the globe. Therefore, there is an urgent need for automated systems to speed up the classification of suspected cases rather than manual approaches of diagnoses. Besides, the delayed diagnosis has built up apprehension in cases of common flu and pneumonia which share some characteristics of COVID-19.
Although many related artificial intelligence (AI) based proposed studies in the literature appear to be well-designed for the tasks of handling the current coronavirus pandemic in terms of estimating confirmed cases and forecasting the speed of COVID-19 spread, these models may deteriorate in performance and accuracy due to their heavy reliance on many inaccurate decision variables and imprecise parameter estimations. For instance, the use of deep learning models in the detection of COVID-19 cases is primarily based on digital images [23,[56][57][58] with little efforts being geared towards exploiting the very useful, revealing and abundant knowledge that is domiciled in the patient electronic health record system (EHRs). Consequently, it is assumed that the limitations mentioned above can lead to conflicting forecasting outcomes, which may invariably lead to unsatisfactory and imprecise results. This would have a negative impact on public health planning and policymaking. Therefore, to overcome the abovementioned limitations of the existing epidemiological and AI-based model approaches, the current paper presents a promising alternative diagnostic and forecasting framework with the aim to achieve more accurate results and avoid the previous limitations by combining the strengths of ontology-based natural language processing with case-based reasoning for early detection and diagnosis of the novel coronavirus pandemic. The rich database of cases of confirmed COVID-19 supports the adoption of case-based reasoning (CBR) paradigm as an authentic reasoning structure for improving diagnosis.
The CBR is an artificial intelligence paradigm that has proven to be useful in medical systems and also exploits the similarity of cases in its knowledge base in providing a solution to a new case or problem. Case retrievals that are closely related to the new case are usually computed using different similarity computational models like Euclidean distance which have been adopted by different researches. However, CBR systems all have the challenge of features extraction and formalization. Furthermore, the choice of selecting the best distance measure model for computing similarity of cases is a problem demanding optimal solution considering the sensitivity of medical cases. CBR reasoning means using old experiences to understand and solve new problems. In case-based reasoning, a reasoner remembers a previous situation similar to the current one and uses that to solve the new problem [38]. CBR and expert systems have a long tradition in artificial intelligence. CBR has been formulated since the late 1970s. CBR is an approach for problem solving and learning of humans and computers [39]. Case-based reasoning is useful in problem solving and automation of learning by an agent. Because empirical evidence has shown that reasoning with CBR is more powerful, this has made reasoning by re-using past cases a powerful and frequently applied way to solve problems for humans. An essential feature of case-based reasoning is its coupling to learning and its strong association with machine learning [40]. Ben-Bassat et al. [41] enumerated some features of CBR, and these include: cases that present similar symptoms and findings results from same faults/disease, and "Nearest Neighbor" algorithm is used to identify unknown diagnosis from the known. More so, CBR avoids the knowledge-based acquisition bottleneck of RBR, it compiles past solutions, mimics the diagnostic experience of human experts, avoids past mistakes, interprets rules, supplements weak domain models, facilitates explanation, supports knowledge acquisition and learning, and exploits the database of solved problems so as to learn.
In this paper, we introduce the concept of combining natural language processing, ontology learning and artificial intelligence techniques to the most critical challenges in responding to the novel coronavirus or COVID-19 pandemic. Consequently, the main goal of this study is to apply the concept of natural language processing (NLP) for ontology learning and population task before using an improved CBR technique to the problem of classifying cases of COVID-19 as either positive or negative even when the disease is still in its early stage of manifestation in the presented case. An NLP model for feature extraction of the presented case was designed and implemented. The originality of the current study lies in the robustness and efficiency of the sentencelevel extraction of feature-value pair for all a priori declared features. Furthermore, the case retrieval similarity metric applied to the proposed NPL-based CBR framework contributes to the interesting performance of the proposed system. Specifically, the technical contributions of this study are as follows: i. Design of an ontology learning algorithm for feature extraction and mapping from suspected cases of COVID-19. ii. Proposal of a novel mathematical model for semantic-based and feature based case similarity computation. iii. Incorporation of the proposed mathematical model into an improved CBR framework. iv. Implementation of CBR framework which allows for the detection or classification of suspected cases of COVID-19 as either positive or negative.
The remainder of the paper is organized into six sections, namely: related works, proposed approach, experimentation, results, discussion, and conclusion. The related works section presents a comprehensive review of related studies on COVID-19. In Section 3, a detail of the approach proposed for the CBR framework is presented, while Section 4 discusses the experimentation and system configuration for the experimentation. In Section 5, we present a comparison of the performance of the proposed approach with some related studies and then conclude the study in Section 6.

Related works
In recent times, artificial intelligence (AI) has been considered as a potentially powerful tool in the fight against many evolving pandemics such Ebola hemorrhagic fever (2014-2016), Swine flu (2002)(2003), SARS (2002SARS ( -2003, Middle East respiratory syndrome coronavirus (MERS-CoV) (2012-present), and novel coronavirus (COVID-19) (2019ongoing). Regarding the ongoing 2019-2020 novel coronavirus pandemic, dozens of research efforts have emerged, and most of the published papers focused on the importance of harnessing artificial intelligence technologies to curb the global COVID-19 Pandemic. This section provides a selective review of recent articles that have discussed the many significant contributions of the application of AI technologies in the fight against COVID-19, as well as the current constraints on these contributions. Specifically, in Ref. [1], six areas where artificial intelligence technologies have emerged as key solutions to combatting coronavirus were identified. These areas include: i) early warnings and alerts, ii) tracking and prediction, iii) data dashboards, iv) diagnosis and prognosis, v) treatments and cures, and vi) social control. Therefore, most of the subsequent discussions presented in this section are focused on investigating to what extent AI has been partly or fully utilized in combatting the spread of the aforementioned pandemic. The selected review discussions presented in this section only cover those articles that have been published in a peer-reviewed journal. Preprinted articles are outside the scope of the current review discussion.
In [2], the analysis of confirmed cases of COVID-19 through a binary classification using artificial intelligence and regression analysis was investigated. In their study, the authors employed the binary classification modeling with group method of data handling type of neural network as one of the artificial intelligence methods of accurately predicting confirmed cases of the COVID-19 epidemic. The study chose the Hubei province in China for their model construction. For the input and output variables, some important factors such as maximum, minimum, and average daily temperature, city density, relative humidity, and wind speed, were considered as the input dataset, while the number of confirmed cases was selected as the output dataset for 30 days. Moreover, the outcome of the investigation revealed that the proposed binary classification model was able to provide a higher performance capacity in predicting the confirmed cases in the province. Also, the analysis of the results showed that certain weather conditions based on the input variables, namely relative humidity with an average of 77.9% had a positive impact on the confirmed cases and maximum daily temperature with an average of 15.4 � C had a negative effect on the confirmed cases.
Mohammed et al. [3] presented the application of two optimization metaheuristic techniques to enhance the predictive performance accuracy of the proposed adaptive neuro-fuzzy inference system that is used for estimating and forecasting the number of confirmed cases of novel coronavirus in the upcoming ten days based on previously confirmed cases that were recorded in China. The developed hybrid metaheuristic based adaptive neuro-fuzzy inference system comprised of Adaptive Neuro-Fuzzy Inference engine and two metaheuristic algorithms, namely, flower pollination algorithm and salp swarm algorithm. The enhanced flower pollination algorithm was utilized by the author to train the neuro-fuzzy inference system by optimizing its parameters, while the salp swarm algorithm was incorporated as a local search method to enhance the quality of the solution obtained by the model. The results of the model implementation show that it has a high capability of predicting the number of confirmed cases within the projected ten days. It was further established that the hybrid system, when compared with other methods, obtained more superior performance accuracy in terms of the following performance metrics: root mean square error, mean absolute error, mean absolute percentage error, root mean squared relative error and coefficient of determination.
Ting et al. [4] explored the potential application of four inter-related digital technologies combating the widespread of the novel coronavirus. These technologies include the Internet of Things, big-data analytics, Artificial Intelligence and blockchain. The authors in their work [4] presented some valid reasons why the four aforementioned digital technologies can be employed to augment the already strained traditional based public-health strategies for tackling COVID-19. Some of the conventional based public healthcare strategies that have been put in place and are continually being used across the globe include (1) monitoring, surveillance, detection and prevention of COVID-19; and (2) mitigation of the impact to healthcare indirectly related to COVID-19. The authors further suggested that digital technologies can be helpful in the following ways. The Internet of Things technology can be used to provide a platform that allows public-health agencies access to data for monitoring the COVID-19 pandemic. The big data technology can be instrumental in providing opportunities for performing modeling studies of viral activity and for guiding an individual country's healthcare policymakers to enhance preparation for the outbreak. The blockchain technology can be vital in the manufacturing and distribution of COVID-19 vaccines once they are available. Similarly, blockchain can be utilized to facilitate the delivery of patients' regular medication to the local pharmacy or patients' doorstep. The AI and deep learning technology can be used to enhance the detection and diagnosis of COVID-19. Further, the utilization of various AI-based triage systems could potentially alleviate the clinical load of physicians.
Vaishya et al. [5] in their study highlighted the significant roles that some of the new technologies such as artificial intelligence, Internet of Things, Big Data and Machine Learning are likely to play in the fight against the new diseases and also the possible forecasting of any pandemics. The authors in Ref. [5] focused on presenting a brief review regarding the utilization of artificial intelligence platforms as a decisive technology to analyze, prepare us for prevention and fight against COVID-19 and any other similar pandemics. In their findings, seven significant application areas of artificial intelligence technology were identified for tackling the spread of COVID-19 disease. These areas as mentioned in Refs. [5] include early detection and diagnosis of the infection, monitoring the treatment, projection of cases and mortality, development of drugs and vaccines, reducing the workload of healthcare workers, and prevention of the disease. Furthermore, the technology was also identified as having the capability to detect clusters of cases and predict the possible location of the virus spread through collecting and analyzing all previous data.
Leung and Leung [6] presented a discussion on the way forward in terms of crowdsourcing data to mitigate epidemics. The authors surveyed different and varied sources of possible line lists for COVID-19. The sources considered by the authors include data clearinghouses or secondary repositories and official websites or social media accounts of various Health Commissions at the provincial and municipal levels in mainland China. Some of the main bottlenecks attributed to the process of crowdsourcing were linked to the rigorous tasks involved in carefully collating as much relevant data as possible, sifting through and verifying the data, extracting intelligence to forecast and inform outbreak strategies, and thereafter repeating this process in iterative cycles to monitor and evaluate progress [6]. However, a possible methodological breakthrough in alleviating these challenges would be to develop and validate algorithms for automated bots to search through cyberspaces of all sorts, by text mining and natural language processing to expedite these processes. Next, we present a brief discussion of some applications of CBR to healthcare with a specific focus on its utilization for analysis, prediction, diagnosis, and recommending treatment for patients.
The CBR is an appropriate methodology to apply in the diagnosis and treatment of a wide range of health issues. Research in CBR has grown to an extent, starting from the early exploration in the medical field by Koton [7], Bareiss [8] in the late 1980s and Gierl et al. [9] in the late 1990s. However, there are still some associated shortcomings with the design and implementation of CBR, especially in the adaptation mechanism. Blanco et al. [10] reported the results of a systematic review of CBR application to the health sector. In their work, the authors proposed some enhancement procedures that could be applied to overcome some of the limitations of CBR, which is focused on preparing the data to create association rules that help to reduce the number of cases and facilitate the learning of adaptation rules.
CBR has equally received noticeable attention in the aspect of disease predictions and diagnosis. In Ref. [11], a hybrid implementation of neural networks and case-based reasoning was proposed for the prediction of chronic renal disease among the Colombian population. The neural network-based classifier, which was trained with the demographic data and medical care information of two population groups, was developed to predict whether a person is at risk of developing chronic kidney disease. The result of the classifier showed that about 3, 494,516 people were identified as being at risk of developing the chronic renal disease in Colombia, which in this case is 7% of the total population.
Benamina et al. [12] proposed the integration of fuzzy logic and data mining technique to improve the response time and the accuracy of the retrieval step of case-based reasoning of similar cases. The Fuzzy CBR proposed in Ref. [12] is composed of two complementary parts, namely, the part of classification by fuzzy decision tree realized by Fispro and the part of case-based reasoning realized by the platform JColibri. The main function of fuzzy logic is to reduce the complexity of calculating the degree of similarity that can exist between diabetic patients who require different monitoring plans. The authors compared their results with some existing classification methods using accuracy as performance metrics. The experimental result that was generated by the proposed system revealed that the fuzzy decision tree is very effective in improving the accuracy for diabetes classification and hence improving the retrieval step of CBR reasoning.
Ozturk et al. [23] adopted the deep learning technique for the task of detecting COVID-19 by proposing the use of you only look once (YOLO) in combination with DarkNet. Their approach successfully classified cases of COVID-19 through binary (COVID vs No-Findings) and multi-class (COVID vs No-Findings vs Pneumonia) classifications which yielded an accuracy of 87.02%. Although the deep learning approach is gaining attention among researchers, they have, however, not obtained widespread deployment for practical use compared to CBR systems. Besides, some studies which have combined CBR with deep learning have leveraged the later for the acquisition of domain knowledge or extraction of feature weights, while the former performs the role of detection [24,25]. Although the combined approach yielded good results [52], they are, however, often limited by unavailable datasets, especially in the case of COVID-19 disease [54].

Proposed approach
A detail presentation of the methods adopted and adapted in this study is covered in this section: an overview of the entire approach, feature extraction using a natural language processing technique, formalism of cases in the proposed case-based reasoning (CBR) method, and lastly the CBR engine.

An overview of the approach
The proposed NLP-Ontology-CBR method accepts a text-based patient file as input and then extracts and formalizes the new case using an ontology representation, as shown in Fig. 1. The extracted case features are further passed to the domain-based feature extraction component which maps each extracted feature at the previous layer to domainbased features. The extracted and mapped features are formalized using description logic (DL) based on a knowledge representation format to allow for efficient computational operations in the CBR engine. Finally, the formalized features are passed on to the CBR-engine as a new case (nc) that support the application of the reasoning paradigm of CBR.
The pipeline of information flow and processing described in Fig. 1 was therefore adapted to detect the case of positive COVID-19 patient from early stage to the advance stage. A further discussion in the following subsections details the components of the framework.

The NLP method for feature extraction
The proposal in this study is a text-CBR (TCBR) and whose datasets were derived from patient medical records archived using natural language. And since NLP techniques have archived outstanding performance for textual CBR, we decided to build on the state-of-the-art approach by using NLP to drive a better case representation through feature extraction. The field of natural language processing techniques is an exciting and relevant aspect of artificial intelligence (AI) with a wide range of applications to medicine and even the large number of textbased documents on the internet. Moreover, electronic health record (EHR) systems are now pervasive and are provided as services to other automated healthcare delivery systems. The NLP method for feature extraction described in this section adopts some components and algorithms from Dasgupta et al. [26]. The ontology learning method is widely used for mining information from natural language text to generating an ontology representation of the mined data. Such ontology representation is to provide formal expressivity and a platform for reasoning with an NP-text document. Although this study assumes a similar procedure, we implemented a skeletal outline of the entire procedure. Fig. 2 shows the modified model of a patient text-based medical record natural language processing (NLP) and the feature extraction pipeline. The model is called a pipeline because of its approach of processing raw file-based text (in the English language) through different procedures which eventually yields the feature (COVID-Fs) for further processing in the CBR-engine.
The following is a breakdown of the components of the NLP processing pipeline, as shown in Fig. 2

File Loader and Text Input (FLTI):
The FLTI is a very simple component with support for file format and safety authentication, file loading and text-content unloaded into a buffer.

NL Pre-processing (NL-P):
The second component consists of other sub-modules named Spelling Checking, Lexical normalizer, and Sentence normalizer, which does a pre-processing of the buffered text in FLTI layer. Generally, the NL-P is aimed at carrying out operations like spell-corrector, tokenization, sentence boundary detector, text singularizer, POS-tagger, co-reference resolver, and named-entity recognizer (NER) by leveraging on Stanford coreNLP toolkit [27]. Our approach of applying NL-P to the buffered text in FLTI was to allow the spell-corrector to scan through the complete buffer and correct wrongly spelt words and to allow for efficient and intelligent mining of features from the buffered text -the improved output of FLTI. This was then converted into a token of sentential forms (SF) in a list format and then sorted according to their appearance in the original document. In each SFs, we attempted to normalize each plural form of its constituents into a singular form through the use of a singularizer. These SFs were extracted from buffered text using sentence boundary detector and annotated with POS-tagging, and the SFs were preserved in an orderly manner to sustain the semantics of health records. Meanwhile, due to the translation task of the raw text to ontology format, we further employed NER models to identify and mark entities and after that their instances which form the elements of taxonomy-box (TBox) and assertion box (Abox) respectively in the resulting ontology. Once the SFs had been pre-processed, we applied them to the next sub-module named lexical normalizer (LN). The use of LN in our study is simply for identification of quantifiers and special symbols (like >, <, ¼ , þ, -, and other medical related symbols which may hold meaning in the usage) of subject/objects appearing in the SFs. Our approach in LN allows for such quantifiers/numeric representations and symbols to be normalized into normal forms supportive of the token-to-feature translation in RTCF component of Fig. 2. The role of applying the sentence normalizer (SN) is to ensure that complicated sentences are broken down to simple forms so that an element of SFs, say sf i , is normalized into simpler forms assuming the template of the NSC component to be discussed later. Hence, the resulting simplified sentences of sf i replace it in SFs.
Normalized Sentence Component (NSC): Based on the structural formation of a sentence in the English language, Dasgupta et al. [26] described a particular template or syntax, in their study. We adopted two of the templates, namely the simple and complex sentences as listed in the following: Q: under-lined notation indicates optional component with at most 1 occurrence in the template, e.g. quantification. M*: under-lined notation with an asterisk (*) indicates 0 or more consecutive occurrences in the template, e.g. adjectives. Q 1 : subject quantifier that includes lexical variations of the set: a, an, the, some, all. Q 2 : object quantifier that includes lexical variations of the set: the, some, all. Finally, we ensured that all the sentences in SFs were adapted to the template described above, and then we applied their Template-Fitting algorithm to all elements of SFs.

Normalized Sentence Component as Token (NSCaT):
The CBRengine to be described in Sub-section 3.4 does not expect input in sentential format but tokenized features which maintain its sentence form syntax and semantics. Therefore, each sf i in SFs are further tokenized into a list of raw (un-normalized features) tokens in the form of t ij such that i represents the position of the sentence in SFs and j represents the position of the token in the sf i of SFs that is being processed. The output of NSCaT is, therefore, an irregular 2D array of raw tokens.
Mapping Tokens to Domain Knowledge (MTDK): We assumed  that not all the tokens from NSCaT are correctly represented based on the domain knowledge. As a result, we proposed an MTDK layer which was aimed at mapping each token in the NSCaT to its correct recognized name in the domain. We relied strongly on Wordnet (WordNet) and the domain-based lexicon model in this study, as shown in Fig. 3. The role of the Wordnet lexicon is to generate all likely synonyms of each t ij in NSCat. Therefore, this means that each t ij indexes into a sub-array of its synonyms. Thereafter, our mapping algorithm aligns each t ij to its respective sub-array.

Represent Tokens as COVID-19 Features (RTCF):
The output of MTDK is further refined to assume the standard feature categorization and typing as listed in Table 1. The implication of this is that we   attempted to extract known features of COVID-19 from the output of MTDK and assigned their values, as illustrated in Fig. 4.

Raw Features Buffer (RFB):
This last component simply buffers the output of raw features collected from previous layers. The RFs buffered in RFB are then translated into ontology formalism described in Section 4.2.
The features described in Table 1 were based on recent studies on COVID-19 that were discussed by Michelen et al. [28] and Yang et al. [55].

Ontology-based formalization of extracted features
In this stage of our proposed CBR-framework, we processed the raw features buffered in the RBF component of Fig. 2 into ontology formalism. Recall that the proposed framework relies on the CBR paradigm to reasoning over the cases presented to it. Hence, each case was modeled using a formalism supporting computational reasoning operation. Fig. 4 demonstrates an illustration of a case denoted by Case N. We assumed that based on clinical protocols of COVID-19, a case representation must have a relationship to Diagnosis Case (Suspected, Confirmed, Presumed status); Symptoms (as listed in Table 1); Epidemiology (as listed in Table 1); Radiology/Laboratory manifestations (as listed in Table 1); Clinical Diagnosis (Mild, Acute, Severe); and Treatment (as listed in Table 1). Each case of COVID-19 extracted by the NLP pipeline described in Fig. 2 was formalized into this structure, as illustrated in Fig. 4. The Diagnosis Case entity assumes a 1-1 relationship with every case; Symptoms, however, presents with a 1 to many (1-M) relationship for each case; also, the Epidemiology entity allows each case to manifest one-many (1-M) relationship; the Radiology/Laboratory manifestations entity also presents each case in a one-many (1-M) relationship given the number of lab tests and radiological operations that might be exercised for each case; Clinical Diagnosis, however, allows for one-one (1-1) relationship due to the fact that a case can only assume one of the states listed in clinical diagnoses. Finally, the Treatment entity allows for one-many (1-M) because one case may respond to one or more treatment/therapy administered to it.
Moreover, each entity illustrated in Fig. 4 consists of variables/features which are expected to have values. For instance, considering the Symptom entity, it may have variables/features like Cough, Fever, Chest Pain and others. Each of those variables is expected to take values from a particular data type. Hence, potential data types as captured in Fig. 4 are numeric, nominal, ordinals, datetime, and Boolean (which forms the most extensive representation for most values of variables in the representation).

The CBR model
All previous stages of the proposed CBR-based framework may be classified as data/input pre-processing and formalization operations. However, the main reasoning task is embodied in the CBR engine to be described in this section. Meanwhile, we shall first present a brief  Urea nitrogen (mmol/L) -range2.6-7.5 Upper respiratory symptoms such as pharyngeal congestion, sore throat, and fever for a short duration or asymptomatic infection; Positive RT-PCR test for SARS-CoV-2; no abnormal radiographic and septic presentation.
Moderate case: Mild pneumonia; symptoms such as fever, cough, fatigue, headache, and myalgia; and absence of complications and manifestations related to severe conditions.
Severe case: A case presenting with mild or moderate clinical features described above; rapid breath (�70 breaths per min for infants aged <1 year; �50 breaths per min for children aged >1 year); hypoxia; lack of consciousness, depression, coma, convulsions; dehydration, difficulty feeding, gastrointestinal dysfunction; myocardial injury; elevated liver enzymes; coagulation dysfunction, rhabdomyolysis, and any other manifestations suggesting injuries to vital organs.
Critical illness case: Respiratory failure with need for mechanical ventilation, persistent hypoxia that cannot be alleviated by inhalation through nasal catheters or masks; septic shock; organ failure that needs monitoring in the ICU, and acute respiratory distress syndrome (ARDS). Cases presenting with ARDS may show: i. Mild ARDS: 200 mmHg < PaO2/FiO2 � 300 mmHg. In notventilated patients or in those managed through non-invasive   These clinical types of COVID-19 have been described to allow for their use in the CBR engine, which will be described below.
The CBR method is a reasoning paradigm that depends on a knowledge base of archived cases that have been proven and tested with valid solutions for handling new cases/problems which may share similar features with those archived. As earlier stated, this study builds on this paradigm to carry out the detection and diagnoses of COVID-19 in patients manifesting symptoms of the disease and those presenting with asymptomatic cases. Fig. 5 illustrates our concept of the CBR engine embedded in Fig. 1. The major components of the model are similar to the conventional CBR model, which usually consists of the RETRIEVE, RESUSE, REVISE, and RETAIN steps (4Rs). In addition, the model shows the knowledge base of archived cases which allow for carrying out computational reasoning on the new case presented. The distinctiveness of our proposed CBR model lies in its ability to model its cases using ontology formalism and as well to measure the similarity of cases using features listed in Table 1 and two other important factors: time (temporal) and spatial (location). We shall detail the operations in each level of the 4 R s in the following discussion.

Retrieve
Based on the general concept of the CBR paradigm, the RETRIEVE procedure/algorithm simply uses some efficient distance or similarity computation models like the Euclidean distance, Cosine Similarity [30], and Manhattan distance. Our approach for the procedure of the RETRIEVE algorithm is described as follows: Consider new case nc and an archive of stored cases in the CBR knowledge base SC ¼ {sc 1 , sc 2 , sc 3 ….sc n } such that the CBR model RETRIEVE the most similar sc i or some sc i from SC. However, the process of retrieval of some sc i depends on Eq. (1). The smaller the value of Sim(nc, sc i ), the more acceptable the case sc i becomes for adoption for REUSE. Here is a summary of procedures in the RETRIEVE step: i. Query generator and parser are used to construct a query that will fetch all similar cases from the case archive SC. The queries are generated based on the extracted features in the previous stage of the framework described in Fig. 8. ii. Semantic Query Web Rule Language (SQWRL) (details later) is employed for modeling the constructed query in the preceding step. iii. The output resulting from the SQWRL query is sorted in the order of the most similar to the least similar cases. Cases are assumed to be similar if their measure of look alikeness (based on the corresponding features) is non-negligible. The smaller the value of the similarity, the higher the likelihood of the new case (nc) to share close similarity with a sc i or some sc i , while the bigger the value of the similarity/distance metric, the lower its tendency to match up with nc. iv. Hence, our problem can, therefore, be modeled as a classification problem whereby some sc i SC are classified to share some similarity with nc while another class of some sc i 2 SC are categorized as dissimilar cases based on attainment of a threshold. For instance, if some sc i 2 SC results in a meagre similarity value, say k where k < (threshold/4), then such cases are not used. Execution � Cosine Similarity: This similarity metric measures the dot product of the two features compared. Based on the cosine computation which yields 1 for 0 0 and less than 1 for other degrees, it implies that a cosine similarity of 1 signals that features A and B are similar cases while a cosine value of À 1 indicates non-similarity. Eq. (2) models Cosine similarity which has strong application in data with sparse vectors. Besides, the Cosine similarity (CS) can perform that Euclidean distance (ED) in cases where ED sees two cases to be distantly similar; CS might observe a closer similarity among the two cases based on their oriented closeness.
where jjxjj and jjyjj are the Euclidean norm of vector x ¼ ðx 1 ; x 2 ; …; x p Þ and Euclidean norm of vector y ¼ ðy 1 ; y 2 ; …; y p Þ respectively, and vector � Manhattan distance: Another similarity or distance metric, also known as Manhattan length measures distance between points along an axis at a right angle. Eq. (3) models Manhattan distance.
MD ¼ jx ak À x bk j þ jy ak À y bk j � Other similarity measures are the Jaccard similarity (use for sets), inverse exponential function and Minkowski distance equations. The inverse exponential function is given by Now, because our cases in the proposed framework were modeled/ formalized in ontology representation, there was a need to be able to carry out quantitative measures of similarity between features of cases, hence the need to use ontology-based semantic similarity between terms. There are six (6) significant techniques for computing such similarities of features in ontology: ontology hierarchy approach, information content, semantic distance, approach based on properties of features, an approach using ontology hierarchy, and hybrid methods [31]. Therefore, to compare two cases, we make the following assumptions: i. Two cases are similar if their ontologies demonstrate similarities in both feature values and structure (of their ontological representations). ii. That an arbitrary weight w i value is assigned to each property (object and data properties) which may sum up to at most a particular maximum value say M1 in other to represent a case. For example, all properties (denoting object properties) as shown in Fig. 4  However, a case may present n features (with relation to data property on the second lower level) associated with the pre-sentsSymptom object property, i.e a case with symptoms: cough, fever, anosmia and ARDs. So, we compute d i .fw i where fw i where f denotes weight of each symptom (cough, fever, anosmia and ARDs) and the weight of the data property denoted by d i . This is shown in Eq. (6), which indicates the summation of all (d i .fw i ). We normalized all level-based weight summations to be 1, so that sum of weights at the three levels evaluates to 1 such that d i .fw i implies 1*1*1 ¼ 1.
Note: all our objects (such as presentsSymptom and hasEpidemiology) and data type properties are detailed and discussed in Section 4.3. Now that we have established our distance/similarity functions and underlying assumptions for case retrieval, here is the formula for computing similar cases in the archived compared to the new case (nc). This study adopts the approach based on properties and features described in Ref. [32]. The adapted similarity measure is that of Tversky (1977), as shown in Eq. (5): The model in Eq. (5) assumes that nc and sc i are cases whose features are collected in D 1 and D 2, respectively. Therefore, the similarity between nc and sc i is computed using three components of Eq. (5): distinct features of nc to sc i , distinct features of sc i to nc, and common features of nc and sc i , for 0 � μ � 1, a function that defines the relative importance of the non-common features. D1 and D2 represent the target and the base respectively while || stands for the cardinality of a set. Although we approve of the similarity model in Eq. (5), we, however, saw its limitation, which is based omission of the effect of weight on selected features. Therefore, we modified Eq. (5) so that we do not use the elements of the set alone, but the weight-value of the elements in each D i which is computed by Eq. (6). Hence, our modification to Eq. (5) is shown in Eq. (7), afterwards, the most similar sc i is RETRIEVE and forwarded to REUSE after applying Eq. (8).where D 1 or D 2 are computed using Eq. (6): According to Eq. (6), the resulting value of D is expected to be numeric. Hence, D 2 \ D 1 represents the computation of the values of data property (d i ), object property (w i ) and all f after obtaining common features between nc and sc i . Similarly, D 1 =D 2 represents the computation of the values of d i , w i and all f after a set difference of those features of nc and sc i . Hence, in both cases, we compute values resulting from obtainable features from their respective set operations.
Furthermore, since Simðncjsc i Þ represents our similarity between a new case (nc) and an arbitrary case sc i in the archived, we can compute the similarity score (SS), also known as the degree of similarity between nc and sc i using Eq. (8).
Similarity Score ðSSÞ ¼ T À Simðncjsc i Þ where T is pre-computed threshold value representing a maximum summation of all possible features, a case can have, which in our case T ¼ 1. The closer Simðncjsc i Þ is to zero (0) the more similar nc and sc i will be. Hence, cases with SS close to T are similar to nc, and as a result, such cases are retrieved. Our approach finds cases with higher similarities to the new case by discarding cases with low similarity values. Then their solutions are utilized to solve the problem. The higher similarity conditioned is measured by Eq. (9) which narrows the number of retrieved cases such that only SS > ¼ SS acceptable . In our case, we assumed that since the similarity measures of the best similar cases should be close to 0, acceptable SS (SS acceptable ) should, therefore, have minimal reduction such that it tends close T. Logically, we can assume that 0.9 is closer to 1.0 than 0.5. Therefore, a retrieved case with SS close 1.0 is enlisted as an element of SS þ ; while the remaining retrieved cases are classed into SS À .
where α is a function that evaluates SS acceptable close to T. Meanwhile, if no case is retrieved by Eqs. (7) and (8), we then conclude that nc might not have any similar case.
We further compute SS for positive cases and apply Eqs. (10) and (11) to determine the following: When SS covid19þ > SS covid19À the case is classified as a positive case of COVID-19, while if SS covid19þ < SS covid19À the case is concluded to be negative. However, an evaluation of SS covid19þ ¼ SS covid19À indicates inconclusive diagnoses, therefore necessitating more similar case(s) to be retrieved.

Reuse
The REUSE procedure allows the system to modify the RETRIEVE cases sc i in such a manner that we have only one similar case. The similar case is constructed to maintain a similar ontology structure with the nc case. This is achieved by rebuilding an anonymous case (ac) by extracting all similar features of the presented cases in sc i until ac assumes the form of nc. As such the modified ac is presented as a temporary solution to nc. The approach proposed here is different from methods used by Gu et al. [33], which relied on clinical protocols guidelines and medical experts, respectively. The ac case is therefore considered a solved case which will be passed on to the REVISE step for processing.

Revise
The evaluation of ac case at this stage is achieved by ensuring that the summation of case features of the proposed solution case is not greater than 1. If they evaluate to more than 1, some non-essential features are dropped and the weights of the features are recomputed until an appropriate value is obtained. The revised and evaluated case now becomes a candidate case for use, and it is called the repaired case (rc). Furthermore, rc is then used to solve the new problem nc presented to the system by presenting it as a candidate solution for adaptation by the physician. The solution to nc is passed to the RETAIN.

Retain
Finally, the RETAIN procedure simply stores the solution to nc as a case that has been learned and is fit to be stored/added to the knowledge base of CBR model for future use. However, it is not all new cases that need to be stored. We propose a retention policy that simply retains solution with tangible difference or improvement from those solutions already archived in the system.

Algorithm for case retrieval
Algorithm 1 details the complete procedure outlined in proceeding subsections, and describes how a new case of COVID-19 is classified as a positive or negative case using the CBR method. The input to the algorithm is an HER of the new case, and the out is Diagnoses Case (Suspected, Confirm, Presumed status).

Algorithm 1. Pseudocode using NLP-Ontology CBR framework for detecting and diagnosing COVID-19
In addition, Algorithm listing 2 outlines the procedures for ontology learning/population to formalize suspected cases which are presented in natural language representation. The procedures described in Algorithm 2 were defined at a high-level in Algorithm 1. Whereas Algorithm 1 describes a flow of data in the framework in Fig. 1, Algorithm 2, however, details the task of translating raw text in natural language (English) into ontology formalism (ontology learning). The task of ontology learning here is simply to learn terms/concepts and their instances from raw natural language text. The learned concepts are encoded as terminology box (Tbox) whiles their instances, and assertions (class and object) are encoded in the assertion box (Abox). Although Section 4.3 describes the domain ontology (largely the Tbox) engineered in this study, we however note that this does not include the formalization of unknown suspected cases of COVID-19 which the framework needs to translate into a feature-based representation.

Algorithm 2.
Pseudocode detailing the ontology learning/population approach for case-to-feature representation

Experimentation
In this section, the clinical data and experimentation environment used in this study are described. In addition, we develop the domain ontology (for COVI19 and other related COVID-based diseases) and also the case-based ontology for new cases. Finally, we demonstrate the implementation of the framework, as shown in Fig. 1.

Clinical data
The COVID-19 pandemic is currently a global emergency with limited access to health facilities and computerized patient records, which could have allowed access to datasets for computational research. Although there are statistical-based datasets accessible in the forms of structured, semi-structured, and unstructured (e.g. WHO, Johns Hopkins University, mainstream news media, and even social media), however, such datasets are still unfit for tasks, including the one described in this study. After a thorough search for publicly available patient HER-based benchmarked datasets of COVID-19 with none accessible, we decided to adopt the approach of curating new datasets of COVID-19 from some open data on standard domains.
The data curated was obtained from the Italian Society of Medical and Interventional Radiology (SIRM). SIRM is a scientific association which includes the majority of Italian radiologists and is targeted to encourage the progression of diagnostic imaging by promoting studies and research. The data source (https://www.sirm.org/en/italian-so ciety-of-medical-and-interventional-radiology/) listed English-like records (itemizing age, symptoms and signs manifested, and other laboratory details) and CT scans for each of the seventy-one (67 cases of adults and 4 cases of pediatrics which were lumped into one case) COVID-19 patients. We anonymized and cleaned the datasets where necessary, and extracted the essential information, storing them in a format appropriate for this study. Fig. 6 shows the snapshots of a randomly selected case.
The data needed for this study is the EHR-based datasets in natural language (NL) format. Hence, we focused on processing the English-like statements extracted for each patient leaving the image-based for future study using the approach of deep learning for classification of COVID-19 cases.
A careful examination of the curated datasets revealed that only 2 cases (case numbers 51 and 60) were confirmed negative, 19 cases (case numbers 6,13,14,15,17,33,38,45,47,53,54,58,59, 62, 66, 67 and three pediatrics cases) were presented as unconfirmed cases and the remaining 50 cases were confirmed positive cases. We, therefore, modeled the 2 negative and 50 positive cases accordingly in the archive of the CBR-engine. The 19 unconfirmed cases served as input for our framework. Furthermore, we normalized medical records of the positive and negative cases, removing the diagnosis made by physicians and passed each of them as input into the proposed CBR framework. This allows for subjecting our system to a similar examination carried out by the experts.

Computational environment setup
The implementation was on a personal computer with CPU of Intel (R) Core i5-4210U CPU 1.70 GHz, 2.40 GHz; RAM of 8 Gbytes; Windows 10 OS. Furthermore, we deployed Anaconda shipped with Python 3.7.3, SPIDER 3.3.6, and also installed NetBeans IDE version 8.1. The Python platform allows for the implementation of the NLP feature extraction pipeline shown in Fig. 2. At the same time, the NetBeans IDE provides support for implementing the feature to ontology representation and also the CBR-engine. Modeling of ontologies in this study was achieved using Prot� eg� e (Prot� eg� e).

Domain ontology modeling
Ontologies are a formalism for specification of concepts or abstract description of a system in a domain-specific knowledge composition. Ontologies as formalism stem from description logic (DL) and with support for reasoning it has received or has caused it to receive more and more attention in computational biology and bioinformatics. There are different ontology languages like RDF/RDFS, DALM þ OIL, and OWL. OWL is a DL-based ontology language with high expressivity and has three variants: OWL-DL, OWL-full and OWL-lite. This study models ontology using OWL2 [34,23], which is an improved version of OWL (sometimes known as OWL1). We have modeled three different ontologies: the first represents domain knowledge, the second is a formalism of the archived cases, and the third ontology formalizes new cases.
In Fig. 7, we show a visualization of the ontology representing a new case (nc). The ontology captures concepts/classes like Symptoms, ClinicalDiagnosis, ClinicalManifestation, Epidemiology, Radiol-ogyFeatures, LaboratoryFeatures Case (to denote a new case), Dis-easeCase, Cause (to capture the likely causes of the disease in a case), and Treatment (which represents treatments administered to a case). Each of these concepts is related/linked to another concept by a property (object property) with almost all ideas linked to Case. To the right is a list of the object properties. For example, the line connecting Case to Symptoms is the object property presentSymptom. The Case is the domain while Symptoms is the range for the object property pre-sentSymptom. Some concepts have the þ symbol at the top-leftmost corner of their bounding boxes. This is an indication that there are other subclasses in that concept/class which can be revealed by clicking on the þ symbol. Case formalization is therefore made possible through the Case-Based ontology file shown in Fig. 7. While that illustrates a case of COVID-19, we made a further effort to use an ontology approach to model the archive of stored cases in the CBR engine. To archive this, we represented the structure and the semantic of the information content of such archive using the ontology visualized in Fig. 8. As mentioned earlier, the ontology file was modeled and visualized in Prot� eg� e (Prot� eg� e). The ontology consisted of 459 axioms, 225 logical axioms, 213 declaration axioms, 196 Class, 11 object property, 8 data type property, 181 subclasses, and 15 instances (except for cases of COVID-19 which forms the archive of cases in the ontology). Fig. 8 captures the is-a relationship existing among classes, and Fig. 9 outlines all the classes (and their hidden subclasses), instances (individuals) of the declared classes, object and datatype properties. Now that we have formalism for archiving all cases in the proposed framework and also a formalism for modeling new cases extracted from electronic records, we shall consider how the proposed approach will implement its query for similar cases as modeled using mathematical models in item A of subsection 3.4. To archive an optimized and effective query of cases from the archive, we decided to construct our query from the mathematical models presented earlier using Semantic Query-Enhanced Web Rule Language (SQWRL), pronounced squirrel. SQWRL is a query language primitive to OWL and also an SWRL-based with the syntax of SQL-like and having operators for extracting information from OWL ontologies [35,36]. We chose SQWRL over SPARQL because of its suitability for use in OWL ontologies since it does not require serializing our OWL ontologies in RDF/RDFS, an operation which often causes a knowledge-base (ontology) to lose some semantics and expressivity as a result of serialization. Moreover, the rule-form of SQWRL and its compatibility with the rule language SWRL allows for improving our framework to use the inference engine, thereby improving the knowledge-base through inference. Prot� eg� e also provides a tab for executing our SQWRL queries against the ontology through the SQWRLTab plugging. We may take advantage of this tab to test our generated queries, although the framework proposed in this study has a mechanism for doing the query execution automatically through OWLAPI. Now for instance, given a new case (nc) presenting with the following features according to their category, we might be interested in translating our mathematical model into an SQWRL query such that similar cases are retrieved: Symptoms (Cough, Temperature, Nausea and vomiting, Shortness of breath, Contact with case(s)); Laboratory Features (Neutrophil, Lymphocyte, Active partial thrombin time); The following conditions can be assumed from our case: retrieve all cases according to their value of similarity (in descending order), which have values for all or some of the features: Symptom (Cough, Temperature, Nausea and vomiting, Shortness of breath, Contact with case(s)); Laboratory Features (Neutrophil, Lymphocyte, Active partial thrombin time). In Fig. 10, we present a sample SQWRL-query for extracting similar cases compared to the new case example described here.
The sample query in Fig. 10 was submitted to the Prot� eg� e application for execution and query of the underlying ontology through the SQWRLTab plugins. The syntax of the SQWRL query language aligns itself to the declared classes or entities, properties (both data type and object), and instances/individuals on the ontology. This is why you will observe that the predicates (unary and binary) names in the listed query in Fig. 10 derive their values from the declared classes or entities, properties (both data type and object), and instances in the ontology. This positions the SQWRL query above the use of SPARQL. A detail explanation of the query given in the following lines: The first line Case(?c) ^ has CaseID(?c, ?cid) extracts all cases and their case IDs from the CBR case archive and stores those two values in the ?c and ?cid variables. Furthermore, the second lines 2-5 of our query select instances of the following symptoms which were keywords/features extracted from the natural language input above: Cough, Temperature, Vomiting, and Shortness of breath, and their weight values. This is summarized in the following lines: Also, our natural language based query has some laboratory features which we also extracted their values for each of the cases retrieved due to the query on line 1. These laboratory features are queried as follows: We are also interested in the probable points/places of contacts/ visited by the cases that have been retrieved so far. Hence the line below: Now that all existing cases in the archive satisfying the above conditions have been retrieved, we further limit the cases to be extracted to the conditions below: The first and second lines simply ensure the cases retrieved have the time/date when the case manifested and either died or recovered. Line three also allows each case to fetch the result of its clinical diagnosis (Positive or Negative diagnosis). Finally, the respondedTo(?c, ?tr) predicate fetches the treatment (if any) options recorded against each case.
Once all these cases are matched by the rule-like left-hand-side (LHS) of our query (a simulated of semantic web rule langue SWRL), the righthand-side (RHS) uses the sqwrl:select predicate to fetch all cases (and their attributes/features) satisfied by LHS using the variables. Hence the lines below: Finally, we are interested in counting the number of cases retrieved after ordering them according to their case IDs. The line of query below does this: All cases retrieved by the sample query above must have its features represented in the ontology for the query to be able to match them. Case representation is covered in Section 3 of this paper, however, we have captured in Fig. 11, a formalization of sample patient record shown in Fig. 6. The case representation shown here is a Prot� eg� e interface format of the case, although the ontology notational is equally generated.

Implementation and experiments
The implementation of the CBR framework proposed in this study adopted JCOLIBRI [37]. JCOLIBRI is a library containing APIs for implementing a CBR framework and is written in Java. As a result, we employed the use of Java programming language to integrate with the JCOLIBRI to achieve the steps in the CBR (shown in the right box or component of Fig. 12). Python programming to implement the natural language to Normalized Sentence Component (NL-NSC), and finally, the combined use of the two languages made the implementation of the feature extraction and formalization components of Fig. 12 possible.
The complete implementation of the proposed framework is accessible through a graphical user interface (GUI) designed for this study and shown in Fig. 4. The file loader and raw text extraction component of Fig. 2 are implemented in the rightmost panel with a box and 'Open Case File' button in Fig. 12. Furthermore, from Fig. 9, the center panel containing a box and 'Map Case' button captures the implementation of the NL-NCS, feature extraction, and feature formalization components identifiable from Fig. 4. To achieve this, standard Python libraries and NL-based libraries (like NLTK and Stanford CoreNLP) were richly employed to carry out the tasks of sentence disambiguation, spelling correction, lexical normalization, and normalization of sentences into their corresponding structures or components, and tokenization of sentences to enhance the process of feature mapping. However, the feature mapping and formalization of cases in ontology format were achieved using OWLAPI, Wordnet API, and Pellet API (an OWL-based knowledge reasoning plugin) which were implemented through skillful use of Python and Java.
The result of the extracted and mapped features presented us with a challenge of accurately extracting values from the processed patient record. For example, we could have extracted features like 'Fever', 'Temperature' and so many other features which mostly rely on syntax and semantic parsing of domain lexicon. But the challenge we were faced with was detecting the semantics/meaning and context of usage of the features from the patient records. To circumvent this, we took advantage of the named entity resolution technique we applied to the text.
At a sentential-level, an attempt was made to search for values of features within the neighborhood of that feature. For instance, given the sentence: 'The temperature of the patent was 38 o c', careful parsing of the sentence using NLP technique will reveal that the feature (temperature) has 38 � Celsius. But consider the sentence: '80-year-old male patient with fever and dyspnea.' There are two features in the sentence (fever and dyspnea) which do not have an explicit declaration of values assigned to them. In cases like these, we developed a sentiment analysis component which enabled us to detect if such features were stated in the affirmative or negative form. The outcome of our sentiment analysis model outputs was: positive, negative and neutral. These outputs were used accordingly to formalize the feature and its value (true or false, as shown in Fig. 9 and Table 1) in the ontology.
The leftmost panel of Fig. 12 illustrates the implementation of the feature extraction and formalization process. This CBR-engine and the mathematical model presented in Section 3 were implemented with Java using the jcolibri API which models the Retrieve, Retain, Revise, and Reuse (4Rs) of CBR paradigm, allowing for users to adapt it to their frameworks. Meanwhile, we have also added a panel for monitoring the procedures for detection of the status of any presented case of COVID-19; this monitoring begins with file loading component to the CBRengine processes.
Experimentation using the datasets discussed in Section 4.1 revealed Fig. 10. A sample SQWRL query constructed to retrieve similar cases corresponding with the new case (nc) accepted as input. Fig. 11. An illustration of case representation as shown in Prot� eg� e for cases a 1, 2, and 3 from the 68 cases extracted from the data source.
that the implementation of the proposed CBR framework was successful. Figs. 13 and 14 show a demonstration of the File Loader and Feature Mapping components of the CBR framework. Meanwhile, the process of formalizing feature-value relationship was monitored and is shown in the Progress Monitoring panel in Fig. 15. Lines delimited and prefixed by the <<<Derived>>> symbol represents components of the generated new case ontology learnt by our Algorithm 2. Each line is an assertion resulting from the features extracted from the input raw-text. The progress monitoring panel output shown in Fig. 15 demonstrates how Algorithm 2 successfully extracts features from the first sentence of the patient record shown in Fig. 6. The output is then further translated into an ontology formalism representing the new case (or new problem) the CBR model receives as input for further extraction of similar cases using SQWRL-based query as illustrated in Fig. 10.
The CBR-engine then accepts the ontology representation of the new case for the purpose of reasoning operation. The task of classification of any suspected case of COVID-19 model in Fig. 15 now rests on the CBRengine, which is detailed in Section 3.5. In Table 2, the solution to the unconfirmed 19 cases is presented.
After a complete testing of the implemented framework using our datasets, we discovered that the classification accuracy of the improved CBR model yielded an interesting result as shown in Fig. 16.

Result and discussion
In this section, we present the performance of the proposed CBR framework compared to the performances of other similar systems. The following are the metrics and their corresponding formula used in analyzing the performance described in this section.

Comparison of the ontology of the systems with others
The ontologies developed in the research are very tangible in enhancing the performance of the proposed CBR framework. However, to measure the performance and importance of the knowledge representation formalism used in this study, we resolved to compare the efficiency of the proposed ontology with other related ontologies used in related studies on CBR by using the following metrics: Based on these metrics, the performance measurements in the following subsections are presented. Table 3 shows the derivation and description of the metrics used in computation of results in Table 4.
The results of Tables 4 and 5 shows the richness of the axioms, properties (object and data type) and instances of the proposed ontology used in this study.

Presentation of the accuracy, sensitivity and specificity of the proposed approach
In this section, we present the performance of the proposed CBR model using diagnosis metrics like accuracy, specificity, sensitivity, precision, recall, and F1-score. The choice of these metrics was informed by the peculiarity of the relationship of the values with the disease. For instance, diagnostic accuracy metric was used to evaluate the ability of a diagnostic test to correctly identify a target condition (COVID-19 in this case). This metric mainly is very applicable to cases of diagnoses in medicine since it allows for increased confidence and acceptability of results. In addition, the accuracy of diagnosis could help to determine the difference between life and death, so that a system which outperforms another may be seen from an improved accuracy, which also leads to the reliability of diagnoses results. Other metric considerations for performance measure in this study were sensitivity and specificity, which are also referred to as True Positive Rate (TPR) and True Negative Rate (TNR) respectively. Sensitive and specificity of our system as shown in Table 6, implies the number of COVID-19 cases with the condition who had a positive result, and the number of COVID-19 cases who did not have the disease and had a negative result respectively. The relationship between these two metrics with respect to the accuracy of diagnosis is that the latter allows for the evaluation of the former.
Looking at the results of the metrics above as presented in Table 5, we discovered the proposed CBR framework had good accuracy. In addition, the sensitivity and specificity value indicated that our system was able to correctly classify cases of COVID-19 as either positive or negative, respectively. The precision value means that an average of 2 COVID-19 cases can be effectively detected by the proposed CBR framework as negative, while the remaining 8 out of 10 cases are positive. Similarly, the recall value is 99% which means that approximately 10 out of 10 cases of COVID-19 are correctly classified as positive. F1 Table 3 An outline of metrics and with their respective counts in the COVID-19 ontology modeled in this study.    Table 6 Performance evaluation of the proposed CBR framework using the accuracy, sensitivity, specificity, precision, recall and F1 score. score presents the ability of our framework to classify the cases of the disease since the metrics represent a harmonious mean of precision and recall. The precision and recall results, therefore, portray the relevance of positive cases and the proportion of correct positive cases are respectively. These are sometimes referred to as Positive Predictive Value (PPV) and True Positive Rate (TPR), respectively. In the next section, we shall compare the performance recorded by the proposed CBR framework with similar studies.

Comparing the proposed approach with similar methods
A comparative analysis of the performance of our proposed approach was carried out with other case-based reasoning studies. Although the domain of application of the CBR models reviewed differs (medical and non-medical), we discovered that the most important factor lay in the formalism of cases and condition and similarity measures for retrieval of similar cases. An approach of CBR with dominance in the list of studies reviewed using fuzzy logic and those whose cases used ontology for formalism purpose. Also, we observed that some studies investigated the peculiarity and importance of different similarity metrics like Euclidean distance, cosine similarity and others. The effect of such choice of similarity measure helped them to discover the performance effect of a selected metric. The decision of the selection of distance/similarity measure is sometimes influenced by the formalism in which case features are represented. Considering the wide adoption of the use of ontologies as a tool for formalizing cases and its features, we discovered that the interesting performance of this study must have drawn many benefits from the ontology approach for knowledge modeling. An interesting consideration made in this study which makes it outperform other similar works is the choice of a semantic and ontology-based similarity measure metric. We observed that this allowed for a better comparison of cases during retrieval.
Furthermore, the novelty of the approach proposed in this study lies in the framework, the algorithm proposed, the case similarity function/ computation, and the case adaption/application of the CBR approach in handling novel Covid19 cases. In addition, we observed that the comparative work carried out with other similar studies in the last decade added to the impressive efforts of this study. Only our study adopted the use of NLP technique in a unique way (through the NLP model proposed) for the purpose of extraction of features represented in a presenting case. This allowed for non-partial automation of the process of diagnosis/detection/classification of cases. We argue that such an approach allows for an increase in the level of acceptance of the CBR paradigm. This deduction was made based on the popular manual approach for the extraction of cases and their features from documents represented using natural language. Although some fuzzy-CBR frameworks which were reviewed and compared in Table 7 demonstrate good performance, they are, however, surpassed by our model which combines the techniques of NLP and machine learning (sentiment analysis) in the extraction of features in any presenting case. As a result, we presume that an investigation into the hybridization of a CBR model using fuzzy logic, NLP and ontologies may yield a very encouraging performance, and thereby position CBR paradigm as a competitive option for reasoning tasks in artificial intelligence (AI).
In any medical system, the result of diagnosis is more important because the patient has so much to lose when there is a misdiagnosis. So, both under-diagnosis and over-diagnosis are both errors in medical systems and have been a source of concern to wide acceptance for AIbased diagnostic and detection systems in medicine. While overdiagnoses may have to do with overstating the condition of the diagnosed case, under-diagnosis is a condition where a diagnosed case does not go on to cause any symptoms or ill-health. This can result in the blurring of the borders between health and disease. Therefore, a diagnostic accuracy helps to investigate how well a particular diagnostic test can identify a target condition, in comparison to a reference test. In this study, we carried out our comparison of the proposed CBR framework with other similar studies using diagnostic accuracy. As seen in Table 7, the accuracy of our framework outperforms those of previous studies we compared. Most impressive is the capability of our CBR framework to detect the novel coronavirus (COVID-19) at a higher accuracy. This, therefore, positions this study as a candidate for further improvement of CBR models in future works seeking to diagnose any family of the Coronavirus diseases.

Conclusion
This study is largely focused on adapting a CBR concept to the problem of classifying cases of COVID-19 as either positive or negative by using an NLP model for feature extraction. Furthermore, a modified case retrieval similarity metric that was applied to the CBR framework appeared to be an interesting contribution to the performance of the proposed system. Meanwhile, knowledge representation in the proposed framework was achieved using ontology-based knowledge formalization technique. Result obtained shows that our proposed framework demonstrated an interesting performance in comparison with similar state-of-the-art CBR studies using fuzzyCBR. This, therefore, positions our Natural Language Processing and Ontology Oriented Temporal Case-Based Framework for adoption and generalization to the problems of diagnoses associated with other medical purposes and diseases. In future, we intend to investigate the performance of our retrieval algorithm over different similarity and distance measure metrics. This will allow for future studies using ontologies and CBR paradigm to select or even combine similarity metrics effectively. Also, we intend to hybridize the proposed method with machine learning methods which allow for application classification algorithm such as SVM.

Declaration of competing interest
The authors declare that they have no financial nor personal interests that could have influenced the work reported in this paper.

Table 7
A summary of some case-based reasoning (CBR) models and framework, and their domains of application, approaches/techniques used, description of approach and accuracy of the systems.

Studies [Ref]
Year Approach used for reasoning or diagnoses Domain of Application Accuracy (%)