Augmented Curation of Unstructured Clinical Notes from a Massive EHR System Reveals Specific Phenotypic Signature of Impending COVID-19 Diagnosis

Understanding the temporal dynamics of COVID-19 patient phenotypes is necessary to derive fine-grained resolution of pathophysiology. Here we use state-of-the-art deep neural networks over an institution-wide machine intelligence platform for the augmented curation of 15.8 million clinical notes from 30,494 patients subjected to COVID-19 PCR diagnostic testing. By contrasting the Electronic Health Record (EHR)-derived clinical phenotypes of COVID-19-positive (COVIDpos, n=635) versus COVID-19-negative (COVIDneg, n=29,859) patients over each day of the week preceding the PCR testing date, we identify anosmia/dysgeusia (37.4-fold), myalgia/arthralgia (2.6-fold), diarrhea (2.2-fold), fever/chills (2.1-fold), respiratory difficulty (1.9-fold), and cough (1.8-fold) as significantly amplified in COVIDpos over COVIDneg patients. The specific combination of cough and diarrhea has a 3.2-fold amplification in COVIDpos patients during the week prior to PCR testing, and along with anosmia/dysgeusia, constitutes the earliest EHR-derived signature of COVID-19 (4-7 days prior to typical PCR testing date). This study introduces an Augmented Intelligence platform for the real-time synthesis of institutional knowledge captured in EHRs. The platform holds tremendous potential for scaling up curation throughput, with minimal need for retraining underlying neural networks, thus promising EHR-powered early diagnosis for a broad spectrum of diseases.

In the COVIDpos patients, diarrhea is significantly amplified in the week preceding PCR testing ( Table 1; 2.2-fold; p-value = 3.9E-16). Some of these undiagnosed COVID-19 patients that experience diarrhea may be unintentionally shedding SARS-CoV-2 fecally 7 . Incidentally, epidemiological surveillance by waste water monitoring conducted recently in the state of Massachusetts observed copious SARS-CoV-2 RNA 8 . The amplification of diarrhea in COVIDpos over COVIDneg patients in the week preceding PCR testing highlights the importance and necessity for washing hands often.
Respiratory difficulty is enriched in the week prior to PCR testing in COVIDpos over COVIDneg patients (1.9-fold amplification; p-value = 1.1E-22; Table 1). Among other common phenotypes with significant enrichments in COVIDpos over COVIDneg patients, cough has a 1.8-fold amplification (p-value = 9.3E-25), myalgia/arthralgia has a 2.6-fold amplification (p-value = 2E-24), and fever/chills has a 2.1-fold amplification (p-value = 1.3E-36). Rhinitis is also a potential phenotype of COVIDpos patients that requires some consideration (1.9-fold amplification, p-value = 1.3E-07). Conjunctivitis, hemoptysis, and respiratory failure have too few COVIDpos patients currently for them to be statistically meaningful enrichments, but these phenotypes are worth tracking (Table 1). Finally, dysuria was included as a negative control for COVID-19, and consistent with this assumption, 0.94% of COVIDpos patients and 0.93% of COVIDneg patients had dysuria during the week preceding PCR testing.
Next, we considered the 351 possible pairwise conjunctions of 27 phenotypes for COVIDpos versus COVIDneg patients in the week prior to the PCR testing date (Table S1). As expected from the above results, altered sense of smell or taste (anosmia/dysgeusia) dominates in combination with many of the above symptoms as the most significant combinatorial signature of impending COVIDpos diagnosis (particularly along with cough, respiratory difficulty, fever/chills). Examining the other 325 possible pairwise symptom combinations, excluding the altered sense of smell of taste, reveals other interesting combinatorial signals. The combination of cough and diarrhea is noted to be significant in COVIDpos over COVIDneg patients during the week preceding PCR testing; i.e. cough and diarrhea co-occur in 12.4% of COVIDpos patients and in 3.9% of COVIDneg patients, indicating a 3.2-fold amplification of this specific symptom combination (BH corrected p-value = 4.0E-17, Table S1).
While explicit identification of SARS-CoV-2 in patients prior to the PCR testing date was not conducted, such prospective validation of our augmented EHR curation approach is being initiated. Nevertheless, this high-resolution temporal overview of the EHR-derived clinical phenotypes as they relate to the SARS-CoV-2 PCR diagnostic testing date for 30,494 patients has revealed specific enriched signals of impending COVID-19 onset. These clinical insights can help modulate social distancing measures and appropriate clinical care for individuals exhibiting the specific gastro-intestinal (diarrhea, change in appetite/intake), sensory (anosmia, dysgeusia) and respiratory phenotypes identified herewith, including for patients awaiting conclusive COVID-19 diagnostic testing results (e.g. by SARS-CoV-2 RNA RT-PCR).

Discussion
In order to identify potential cells and tissue types that may be associated with the EHR-derived clinical phenotypes observed above for COVID-19 patients, we analyzed Single Cell RNA-seq data using the nferX platform (see Methods) 9 . Given recent studies implicating the necessity of both ACE2 and TMPRSS2 for the SARS-CoV-2 lifecycle 10 , we scouted for human cells that co-express both genes. This co-expression analysis revealed that specific cell types from the small intestine/colon, nasal cavity, respiratory system, pancreas, urinary tract, and gallbladder co-express both ACE2 and TMPRSS2 ( Figure  1, Figure S1). Notably, multiple small intestine cell types co-express the two genes. These cell types include enterocytes, enteroendocrine cells, stem cells, goblet cells, and Paneth cells. In the pancreas, the cell types included ductal cells and acinar cells. The kidney cells co-expressing TMPRSS2 and ACE2 include proximal tubular cells, pelvic epithelial cells and type A intercalated cells. Co-expression of TMPRSS2 and ACE2 is also observed in the epithelial cells of the olfactory nasal cavity and the respiratory tract as well as in type II pneumocytes (albeit at comparatively lower level). While the identified tissues showing co-expression of ACE2 and TMPRSS2 in the gastro-intestinal, respiratory, and sensory systems correlate with the clinical  phenotypes of early COVID-19 infection as described above, these insights are conceivably from  normal/healthy tissues. This highlight the need for meticulous bio-banking of COVID-19 patient-derived  biospecimen and their characterization via single cell RNA-seq and other molecular technologies. Primary prevention is the most effective method to minimize spread of contagious infectious viruses such as SARS-CoV-2 ( Figure S2). In addition to population-based strategies such as social distancing, there are significant ongoing efforts to develop a prophylactic solution ( Table S2). As the immunodominant humoral immune response in patients is directed against the SARS-CoV2 spike protein, many vaccines under investigation target this viral protein. It remains to be determined whether anti-spike protein antibodies induced by natural infection or by vaccines induce neutralizing antibody responses. Chloroquine and its analogues have been shown to inhibit virus replication in-vitro 28 . Whether Chloroquine or Hydroxychloroquine have meaningful effects of SARS-CoV2 replication in patients remains to be understood, and are the subject of clinical trials, both as post-exposure prophylaxis and as treatment (Table  S2). Hydroxychloroquine was approved by FDA for emergency use in hospitalized COVID-19 patients who are not eligible for clinical trials on April 7, 2020 based on limited clinical data, but concerns have been raised about toxicity and risk of sudden death 29 .
Our findings from the EHR analysis of COVID-19 progression can aid in a human pathophysiology enabled summary of the experimental therapies being investigated for COVID-19 ( Figure 2, Table S2). Some of the earliest phases of intervention attempt to inhibit the entry/replication of SARS-CoV-2 by modulating critical host targets (e.g. renin angiotensin aldosterone system/RAAS inhibitors, ACE2 analogs, serine protease inhibitors) or directly inhibiting the function of viral proteins (e.g. viral RNA-dependent RNA polymerase inhibitors, protease inhibitors, convalescent plasma, synthetic immunoglobulins) (Box 1, Table  S3). In patients with more advanced stages of disease progression, who suffer from respiratory abnormalities, therapeutics are being advanced to target the inflammatory response that can lead to Acute Respiratory Disease Syndrome (ARDS) and is associated with high mortality (Box 1). These include anti-GM-CSF agents, anti-IL-6 agents, JAK inhibitors, and complement inhibitors. Another emerging option for patients at this stage is convalescent plasma, which has shown some clinical benefits in cases of COVID-19 and related viral diseases (SARS-1, MERS) at various stages of severity (Box 1). Administration of convalescent plasma containing active specific antiviral antibodies may prevent or attenuate progression to severe disease. Expanded access to convalescent plasma for treatment of patients with COVID-19 has been approved by the FDA for emergency IND use and is available through a nationwide program led by Mayo Clinic (Box 1).
In those who become symptomatic, it is imperative that diagnostic testing is done, at dedicated testing sites if available, to confirm diagnosis ( Figure S2). Meanwhile, patients are recommended to selfquarantine at home, use mask protection when social distancing cannot be obtained, and continue supportive measures. For patients with mild symptoms, such measures may be sufficient given the selflimited nature of viral syndromes. In the event of symptom exacerbation, often marked by worsening respiratory distress, medical evaluation is warranted, and possible hospitalization. The mainstay of treatment for COVID-19, remains supportive care, and as needed supplemental oxygen. Experimental therapies intended to block SARS-CoV2 viral entry and inhibit steps in the viral life cycle necessary for viral replication have been proposed at this early stage (Figure 2). The goal of these therapies is to reduce viral load, thus reducing the chance of overwhelming immune reaction by delaying progression of the disease.
Among the proposed treatment options for COVID-19, corticosteroid should be avoided outside a clinical trial, as suggested by the IDSA, until further clinical evidence can be established (www.idsociety.org/practice-guideline/covid-19-guideline-treatment-and-management). This is because there has been conflicting evidence and guidance on steroid use in COVID-19 30 . While steroids can play a role in control of inflammation, a collection of clinical evidence from steroid use in other coronavirus outbreaks suggest that the use of corticosteroids might exacerbate COVID-19-associated lung injury 31 .
As patients progress to severe or critical diseases, the primary objective of COVID-19 management is to provide respiratory support and control immune overactivation (Figure 3, Figure S3). Patients whose condition deteriorates to critical status primarily decompensate from a respiratory standpoint, but may also develop multi-organ failure (respiratory failure, cardiac failure, renal failure, hypercoagulable state, thrombotic microangiopathy), as well as severe inflammatory responses similar to cytokine release syndrome and eventually reactive hemophagocytic lymphohistiocytosis syndrome. A major manifestation of respiratory decompensation and cytokine release syndrome is acute respiratory distress syndrome (ARDS). Critical care support such as mechanical support from noninvasive to invasive mechanical ventilation and in, some instances, extracorporeal support, vasopressors, renal replacement therapy, anticoagulation, and are paramount to survival of these critically ill patients per SCC guidelines (SCCM/ESICM 2020). On the other hand, drugs such as immunomodulatory agents often used to treat cytokine release syndrome, may allow for some degree of improvement or recovery either leading into or during severe and critical disease (Figure 2).
This study demonstrates how the highly unstructured institutional knowledge can be robustly synthesized using deep learning and neural networks 32 . We started by leveraging a BERT-based deep neural network that was trained on ~18500 sentences containing 250 different phenotypes and symptoms related to cardiovascular, pulmonary, and cardiopulmonary diseases. We found that the model was translatable to the COVID-19 context, achieving an overall 92.7% true positive rate (see Methods and Table S5 for variations across symptoms). While these results speak to the scalability of our deep learningbased approach to teaching machines how to read and comprehend de-identified clinical text, rigorous statistical assessments of neural network performance across disease areas will be essential to further advancing diagnostic applications for clinical care.
Expanding beyond one institution's COVID-19 diagnostic testing and clinical care to the EHR databases of other academic medical centers and health systems will provide a more comprehensive view of clinical phenotypes enriched in COVIDpos over COVIDneg patients in the days preceding confirmed diagnostic testing. This requires leveraging a privacy-preserving federated software architecture that enables each medical center to retain the span of control of their de-identified EHR databases, while enabling the machine learning models from partners to be deployed in their secure cloud infrastructure. Such seamless multi-institute collaborations over an Augmented Intelligence platform that puts patient privacy and HIPAA-compliance first, is being advanced actively over the Mayo Clinic's Clinical Data Analytics Platform Initiative (CDAP). The capabilities demonstrated in this study for rapidly synthesizing over 8.2 million unstructured clinical notes to develop an EHR-powered clinical diagnosis framework will be further strengthened through such a universal biomedical research platform.
A caveat of relying solely on EHR inference is that mild phenotypes that may not lead to a presentation for clinical care, such as anosmia, may go unreported in otherwise asymptomatic patients. However, the augmented curation approach described here allows for the active monitoring of all such symptoms as they emerge in the EHR; this may accelerate the identification of novel disease symptomatology. As a case in point, the CDC recently expanded its list of possible symptom indicators of COVID-19 to include new loss of taste or smell, headache, muscle pain, and repeated shaking with chills (www.cdc.gov/coronavirus/2019-ncov/symptoms-testing/symptoms.html). As awareness of these symptoms grows, we expect their presence in the EHR to also grow, making statistical enrichment observations, such as those presented here, more robust. Another salient consideration is that as serologybased tests for COVID-19 with high sensitivity and specificity are approved, testing will become more aggressive and "day 0" as defined in the present work will likely occur earlier in the COVID-19 illness course. Additionally, as at-home serology-based testing is advanced, capturing symptoms that precede clinical testing will become increasingly important in order to facilitate the continued development and refinement of disease models; EHR-integrated digital health tools may help address this need. Finally, as multiple COVID-19 testing approaches are pursued and patients begin to receive multiple tests of different types, it will be important to leverage EHR curation tools to assess false positive rates of early serology testing and to gain insight into and optimize test sequencing.
As we continue to understand the diversity of COVID-19 patient outcomes through holistic inference of EHR systems, it is equally important to invest in uncovering the molecular mechanisms and gain cellular/tissue-scale pathology insights through large-scale patient-derived biobanking and multi-omics sequencing. As the anecdotal single cell RNA-seq (scRNA-seq) based co-expression analysis of ACE2 and TMPRSS2 on normal human samples conducted here highlights, the rich heterogeneity of cell types constituting various host tissues can be investigated in great detail by scRNA-seq. To correlate patterns of molecular expression from scRNA-seq with EHR-derived phenotypic signals of COVID-19 disease progression, a large-scale bio-banking system has to be created. Such a system will enable deep molecular insights into COVID-19 to be gleaned and triangulated with SARS-CoV-2 tropism and patient outcomes.
Ultimately, connecting the dots between the temporal dynamics of COVIDpos and COVIDneg clinical phenotypes across diverse patient populations to the multi-omics signals from patient-derived bio-specimen will help advance a more holistic understanding of COVID-19 pathophysiology. This will set the stage for a precision medicine approach to the diagnostic and therapeutic management of COVID-19 patients.

RAAS inhibitors and ACE2 analogs:
One class of experimental therapies intended to inhibit viral entry and early disease in COVID-19 includes Renin Angiotensin Aldosterone System (RAAS) inhibitors and recombinant ACE2 (Table S2). ACE2 is the primary host receptor for SARS-CoV-2, while serine protease TMPRSS2 is implicated in the spike protein priming after viral binding 10 . Recombinant ACE2 has been proposed as an early COVID-19 therapy based on in-vitro data 11 . At this time, the effect of RAAS inhibitors is uncertain in the context of COVID-19. Studies have investigated how ACE expression is modulated by coronavirus infection, and how that relates to lung injury 11 . Trials are ongoing with Angiotensin Receptor Blockers (ARBs) for treatment of COVID-19 by diminishing downstream harmful effects of angiotensin receptor activation (Figure 2).

Serine Protease inhibitors:
Given the TMPRSS2 involvement in viral entry (Figure 2), serine protease inhibitors such as Camostat are now under evaluation in trials and should also be considered in the early stages of SARS-CoV-2 infection.

Viral RNA-dependent RNA polymerase inhibitors:
Of these, Remdesivir, a nucleoside analog, has attracted much attention for in-vivo inhibition of SARS-CoV-2, and a recent observational study of 53 patients who received Remdesivir under compassionate use found that 68% of patients demonstrated improvement in respiratory status after a 10 day regimen 12 . Another nucleoside analog, Galidesivir, is also under evaluation in patients. Yet another viral replication inhibitor in clinical trials is Favipiravir (Figure 2). Favipiravir is a broad-spectrum viral RNA dependent RNA polymerase inhibitor that is shown to have in-vivo activity against a wide range of RNA viruses. In one RCT of 240 patients, Favipiravir was found to improve the clinical recovery rate of COVID-19 relative to Umifenovir, a viral entry inhibitor 13 (Table S2).

HIV Protease inhibitors:
This class of medication is widely proposed and used off-label based on postulates that HIV and HCV proteases share structural similarities with those of SARS-CoV-2 14 . Of these, Lopinavir/Ritonavir (combination) has shown promise but was found to have a non-significant benefit in a Randomized Clinical Trial (RCT) of 199 patients in China 15 , while Darunavir has shown no significant activity against SARS-CoV-2 in-vitro (Table  S2) 16 . Multiple randomized, controlled clinical trials are now underway in the USA to determine efficacy of these drugs in the treatment of COVID-19.
Other Antiviral Agents: Another emerging option for patients at this stage is convalescent plasma (Figure  2), which has shown clinical benefits in cases of COVID-19 17 and related viral diseases (SARS-1, MERS) at various stages of severity 18,19 . Administration of convalescent plasma containing specific antiviral antibodies may prevent or attenuate progression to severe disease. Expanded access to convalescent plasma for treatment of patients with COVID-19 is available through a program led by Mayo Clinic 20 . Synthetic hyperimmune globulins are also under development and evaluation.

Agents being advanced that target the inflammatory response in COVID-19
Anti-GM-CSF agents --A xenograft study found that granulocyte monocyte colony stimulating factor (GM-CSF) neutralization with Lenzilumab significantly reduced production of inflammatory cytokines 21 , offering evidence for efficacy of anti-GM-CSF agents in prevention of CART-induced cytokine release syndrome (CRS). Lenzilumab has been approved by the FDA for emergency IND use for CRS in COVID-19, while others such as Mavrilimumab and Gimsilumab aimed at controlling undesired inflammation from myeloid activation will be evaluated in clinical trials.
Anti-IL-6 agents: IL-6 is a pro-inflammatory cytokine, regarded as a driver of CRS 22 (Figure 2A-C). A recent report suggests IL-6 as a biomarker for respiratory failure in COVID-19 22 . As such, anti-IL-6 agents including Tocilizumab, Sarilumab, and Siltuximab are being evaluated in randomized trials (Table S2) and used off-label in severe COVID-19 patients. Tocilizumab was approved for the treatment of CRS in 2017. An observational study of 21 patients with severe COVID-19 pneumonia treated with Tocilizumab showed promising results 23,24 .

Anti-JAK agents:
A number of immunomodulatory agents not linked to CRS are also under trial for COVID-19 (Figure 2). Janus kinase (JAK) inhibitors such as Baricitinib, Fedratinib, and Ruxolitinib, indicated for Rheumatoid Arthritis and Myelofibrosis, have been tested in xenograft models for Chimeric Antigen Receptor (CAR) T-cell therapy induced CRS 25 . Ruxolitinib is available under an expanded access program in USA for severely ill COVID-19 patients (Table S1) and trials are underway in other countries.

Anti-Complement agents:
A recent study found that SARS-CoV-2 also binds to MASP2, a key driver of the complement activation pathway, leading to complement hyperactivation in COVID-19 patients 26 . Inhibitors of the terminal complement pathway such as Eculizumab have been tried in individuals with improvements observed after administration in China.

Agents targeting ventilation/perfusion defects in COVID-19-induced ARDS
Vasodilators: A recent report based on 16 cases in Italy and Germany noted that, contrary to the established understanding in ARDS, COVID-19 patients in ARDS retain relatively high lung compliance 27 and demonstrate ventilation/perfusion defects likely arising from perfusion dysregulation and hypoxic vasoconstriction. Therefore, patients with COVID-19 in ARDS may benefit from vasodilators to address this pathophysiologic mechanism. A trial is underway in China for use of inhaled nitric oxide in patients with mechanical ventilation (Table S2).

Augmented curation of EHR patient charts
The nferX Augmented Curation technology was leveraged to rapidly curate the charts of SARS-CoV-2-positive patients. First, we read through the charts of 100 patients and identified and grouped symptoms into sets of synonymous words and phrases. For example, "SOB", "shortness of breath", and "dyspnea", among others, were grouped into "shortness of breath". We did the same for diseases and medications. For the SARS-CoV2-positive patients, we identified a total of 26 symptom categories (Table  S4) with 145 synonyms or synonymous phrases. Together, these synonyms and synonymous phrases capture a multitude of ways that symptoms related to COVID-19 are described in the Mayo Clinic Electronic Health Record (EHR) databases.
Next, for charts that had not yet been manually curated, we used state-of-the-art BERT-based neural networks 32 to classify symptoms as being present or not present based on the surrounding phraseology. The neural network used to perform this classification was trained using nearly 250 different phenotypes and 18,500 sentences; it achieves over 96% recall for positive/negative sentiment classification (Table S6). We went through individual sentences and either accepted the sentences or rejected and reclassified them. The neural networks were actively re-trained as curation progressed, leading to stepwise increases in curation efficiency and model accuracy. In Step 1 of this process, we labeled 11433 sentences, 8737 of which were labeled as either 'present' or 'not present.' The model trained on this data set (90%-10% training/test split) achieved F1 scores of 0.93 and 0.84 for 'present' and 'not present' classifications, respectively. The model was then applied to an additional 3688 sentences in Step 2, rapidly corrected by a human for classification errors and re-trained to generate a newer version of the model. Step 3 was an iteration of step 2 on an additional 3369 sentences. The model achieved F1 scores of 0.96/0.91 after step 2 and 0.96/0.96 after step 3 for the classification of 'present'/'not present.' Due to the augmented nature of this approach, steps 2 and 3 required successively less input from the human annotator.
This model was applied to 15,775,993 clinical notes across 635 COVIDpos patients and 29,859 COVIDneg patients. First, the difference between the date on which a particular note was written and the PCR testing date of the patient corresponding to that note formed the relative date measure for that note. The PCR testing date was treated as 'day 0' with notes preceding it assigned 'day-1', 'day-2' and so on. BERT-based neural networks were applied on each note to provide a set of symptoms that were present at that point of time for the patient in question. This map was then inverted to determine for each symptom and relative date the set of unique patients experiencing that symptom.
Because the model had not yet encountered many COVID-19-related symptoms (Table S4), we performed a validation step in which the classifications of 4000 such sentences from the timeframe of highest interest (day 0 to day -7) were manually verified. Sentences arising from templates, such as patient education documentation, accounted for 10.2% of sentences identified. These template sentences were excluded from the analysis. The true positive rate (TPR), defined as the total number of correct classifications by the model divided by the number of total classifications, achieved by the model for classifying all symptoms was 92.7%; the corresponding false positive rate (FPR) was 13.4%. The model achieved true positive rate (TPR) ranging from 91% to 100% for the major symptom categories of Fever/Chills, Cough, Respiratory Difficulty, Headache, Fatigue, Myalgia/Arthralgia, Dysuria, Change in appetite/intake, and Diaphoresis. Classification performance was lower for Altered or diminished sense of taste and smell; here, the true positive rate was 64.4%. Detailed statistics are displayed in Table S5.
For each synonymous group of symptoms, we computed the count and proportion of COVIDpos and COVIDneg patients that were deemed to have that symptom in at least one note between 1 and 7 days prior to their PCR test. We additionally computed the ratio of those proportions which indicates the extent of prevalence of the symptom in the COVIDpos cohort as compared to the COVIDneg cohort. A standard 2proportion z hypothesis test was performed, and a p-value was reported for each symptom.
To capture the temporal evolution of symptoms in the COVIDpos and COVIDneg cohorts, the process described above was repeated considering counts and proportions for each day independently.
Pairwise analysis of phenotypes was performed by considering 351 phenotypic pairs from the original set of 27 individual phenotypes. For each pair, we calculated the number of patients in the COVIDpos and COVIDneg cohorts wherein both phenotypes occurred at least once in the week preceding PCR testing. With these patient proportions, a Fisher exact test p-value was computed. Benjamini-Hochberg correction was applied to account for multiple hypothesis testing.

This research was conducted under IRB 20-003278, "Study of COVID-19 patient characteristics with augmented curation of Electronic Health Records (EHR) to inform strategic and operational decisions".
All analysis of EHRs was performed in the privacy-preserving environment secured and controlled by the Mayo Clinic. nference and the Mayo Clinic subscribe to the basic ethical principles underlying the conduct of research involving human subjects as set forth in the Belmont Report and strictly ensures compliance with the Common Rule in the Code of Federal Regulations (45 CFR 46) on Protection of Human Subjects.

Analysis of cell-types expressing ACE2 and TMPRSS2 using single cell RNAseq
Since the successful entry of virus in the cell requires priming by cellular host protease -TMPRSS2, we hypothesized that cells that express both TMPRSS2+ and ACE2+ cells could harbor SARS-CoV-2 during the course of infection. Thus, we probed for the expression of ACE2 and TMPRSS2 in all the singlecell studies from human tissues available on the nferX Single Cell platform (https://academia.nferx.com/). For all the tissues that we profiled, we ensured that there is a minimum of 100 cells in the cell population and that there is a minimum of 1% of the cells in the cell population co-expressing (non-zero expression) both TMPRSS2 and ACE2 expression.

Augmented curation of the unstructured clinical notes from the EHR reveals specific clinically confirmed phenotypes that are amplified in COVIDpos patients over COVIDneg patients in the week prior to the SARS-CoV-2 PCR testing date.
The key COVIDpos amplified phenotypes in the week preceding PCR testing (i.e. day = -7 to day = -1) are highlighted in gray (p-value < 1E-10). The ratio of COVIDpos to COVIDneg proportions represents the fold change amplification of each phenotype in the COVIDpos patient set (phenotypes are sorted based on this column). Phenotypes highlighted with a superscript (*) are still rare in COVIDpos patients at this time, thus mitigating their statistical significance.    Figure S1. Cell-types connected to pathophysiology of COVID-19 as inferred from high expression of ACE2 and TMPRSS2 in human scRNA seq datasets. A scatter plot depicting the expression of ACE2 and TMPRSS2 inferred from the single-cell RNA-seq profiling of human tissues using nferX single cell platform. The x-axis represents the mean ln(cp10k+1) expression of ACE2 in all the cells and the y-axis represents the mean ln(cp10k+1) expression of TMPRSS2 in the corresponding cell-types from respective tissues. The colors on the scatter plot depicts the tissue origins. The size of the points on the scatter plot represents the percentage of single cells in the cell-type that co-express ACE2 and TMPRSS2 (non-zero expression).

Figure S2. Disease progression of COVID-19 can be divided into multiple stages, and appropriate therapeutics can be chosen based on the specific pathophysiological mechanisms. Using nferX
Knowledge Synthesis, the most associated molecular markers at each step of disease progression are also identified (see Supplementary Methods for details on nferX knowledge synthesis). In order to capture biomedical literature based associations, the nferX platform defines two scores: a "local score" and a "global score", as described previously

Convalescent plasma
Antibodies in recovered patients. Synthetic antibodies under development NCT04264858

Anti-IL6 Agents
Tocilizumab IL-6 inhibitor proposed to reduce risk of cytokine release syndrome in COVID NCT04317092, NCT04320615, and more Siltuximab IL-6 inhibitor proposed to reduce risk of cytokine release syndrome in COVID NCT04330638, NCT04329650 Sarilumab IL-6R inhibitor proposed to reduce risk of cytokine release syndrome in COVID NCT04315298, NCT04324073, and more

JAK Inhibitors
Baricitinib JAK inhibitor proposed for anti-inflammatory activity to attenuate immune overactivation in severe COVID NCT04340232, NCT04320277 and more Tofacitinib NCT04332042 Ruxolitinib NCT04337359, NCT04331665 and more

Symptom/Finding Synonyms/related entities identified in EHR
Fever / chills fever, fevers, chill, chills, tactile fever, felt warm, subjective fever Altered or diminished sense of taste or smell change in smell, lost her sense of smell and taste, bitter taste in his mouth, no sense of taste or smell, no sense of smell or taste, decrease in smell, decreased sense of taste, change in her sense of taste and smell, decrease in smell and taste, ageusia, change in taste, lost her sense of taste and smell, dysgeusia, everything smells and tastes terrible, anosmia, altered smell, loss of taste and smell, change in his sense of smell and taste, altered sense of taste and smell, decrease in taste and smell, decreased taste, bitter taste in her mouth, taste is altered, lost his sense of smell, decreased smell, altered taste, altered sense of smell and taste, change in taste and smell, lost his sense of smell and taste, change in her sense of smell and taste, everything tastes and smells terrible, lost his sense of taste and smell, lost her sense of smell, loss of smell and taste, decrease in taste, decrease taste, bitter taste, no smell or taste, no taste or smell, anosmia/dysgeusia, decrease smell, change in smell and taste, change in his sense of taste and smell