A Workflow of Integrated Resources to Catalyze Network Pharmacology Driven COVID-19 Research

Motivation In the event of an outbreak due to an emerging pathogen, time is of the essence to contain or to mitigate the spread of the disease. Drug repositioning is one of the strategies that has the potential to deliver therapeutics relatively quickly. The SARS-CoV-2 pandemic has shown that integrating critical data resources to drive drug-repositioning studies, involving host-host, host-pathogen and drug-target interactions, remains a time-consuming effort that translates to a delay in the development and delivery of a life-saving therapy. Results Here, we describe a workflow we designed for a semi-automated integration of rapidly emerging datasets that can be generally adopted in a broad network pharmacology research setting. The workflow was used to construct a COVID-19 focused multimodal network that integrates 487 host-pathogen, 74,805 host-host protein and 1,265 drug-target interactions. The resultant Neo4j graph database named “Neo4COVID19” is accessible via a web interface and via API calls based on the Bolt protocol. We believe that our Neo4COVID19 database will be a valuable asset to the research community and will catalyze the discovery of therapeutics to fight COVID-19. Availability https://neo4covid19.ncats.io


Reproducing the Integration Workflow
In order to reproduce the workflow, provided the required Python [1] environment has been set up, a local copy of the neo4covid19 repository needs to be created as follows.
git clone https://github.com/ncats/neo4covid19 Note, that paths referring to files in this manuscript start with "neo4covid19".In this context, neo4covid19 points to the root directory of the local copy of the cloned repository.
The first stage of the workflow is executed as: In case an error occurs due to an API call, try this command instead: python prepare.pytest This is followed by assembling the SmartGraph subnetwork.For details, please refer to section "Assembly of the SmartGraph Subnetwork".
The last stage of the workflow is executed as: Or, in the case of an API call error: python compile.pytest

Assembly of the SmartGraph Subnetwork
In order to reveal potential connection between histone acetyltransferases (HATs) and SARS-CoV-2 virus implicated host proteins (VIHPs), we performed network analysis with the help of the SmartGraph platform [5].Since a set of VIHPs is compiled in the integration workflow, it was necessary to implement a breakpoint in the workflow.Upon completion of the first part of the workflow, SmartGraph analysis is performed, and the results are subsequently fed to the second stage of the workflow to finish the integration.While this scenario is not ideal, at the time of the workflow creation, the SmartGraph platform did not provide API access.
The gene names of VIHPs were mapped to UniProt IDs [6], [7] to comply with the SmartGraph input requirements.First, VIHPs present in the file chembl_uniprot_mapping.txt (distributed as part of ChEMBL database, version 27 [8]) were identified.Next, with the help of UniProt (API) [7], [9] the UniProt IDs of these genes were retrieved.
These are the detailed step to assemble the SmartGraph subnetwork.Assuming you have created a local copy of the neo4covid19 repository (see above), perform the following steps:

Expansion of HHIs via StringApp API
Expanding the HHIs present in a preliminary Neo4COVID-19 network was performed in a twostep procedure employing the STRING [10] and stringApp APIs [11].
In the first step, the gene symbols of human proteins in pre-expanded Neo4COVID-19 network were translated into the STRING database identifiers with the STRING API.We utilized the following URL for this API call: https://string-db.org/api/tsv-no-header/get_string_ids .Gene symbols were passed to parameter identifiers as a newline "\n" separated string (without quotation marks).Mapping of gene identifiers was forced to a one-to-one mapping by selecting the "best" STRING ID for a given gene symbol by setting limit to 1.In addition, we limited the mapping to human genes only by setting species to 9606; we included the original IDs in the results by setting echo_query to 1; and we provided a string to our liking for caller_identity.
Next, with the returned STRING database IDs we made a second API call to URL https://api.jensenlab.org/network .The STRING database IDs were passed to the entities parameter as a newline "\n" separated string.The additional parameter was set to 100, which defines the maximal number of proteins the original network can be extended with.
Parameter alpha was set to its default value of 0.5.
The basis of the expansion is the computation of a connectivity score for proteins not in the query network.The connectivity score is a ratio of the total connectivity score of a given protein to the query proteins versus its total connectivity score to all proteins in STRING database [Ref].
For more details, please refer to the section "Network Expansion" in the study of Doncheva et al. [11].
Of note, the following genes present in the pre-extension network were excluded from the STRING extension process as they produced errors when included into the API call: ELOC, EP300, SLC25A5, TUBA1A, STAT1, ELOB, RBX1, CREBBP, SKP1.

Applying Custom Visual Style to the Imported Network in Cytoscape
The file containing the custom Cytoscape [12] visual style (style_Neo4COVID19.xml) is distributed as part of the Neo4COVID19 code repository (neo4covid19/code/ style_Neo4COVID19.xml)[4].The process of importing and applying the custom style is shown on Fig S2.

Mapping of Viral Gene Names
We have established a mapping between the viral gene names predicted by P-HIPSTer [13], [14] and those reported in the interactome study by [15], [16] The mapping is provided on sheets "ID_Mapping" and "Sheet1_MappedIDs" in the file data/output/Merged.xlsx in the neo4covid19 repository [4].Where applicable, the target development category (TDL) [18], [19] of proteins are colorcoded according to legend.Screenshots were made from the Cytoscape application.