Expanding opportunities for mining bioactive chemistry from patents

Graphical abstract


Introduction
Compared to papers, patents in the biological sciences have hitherto been an underexploited information source, principally because retrieval specificity and data extraction is more challenging [1]. However, the World Intellectual Property Organisation (WIPO) indexing of 2.4 million PCT (WO) publications indicates 8.4% are assigned the International Patent Classification (IPC code) A61K (medical, veterinary science and hygiene) that encompasses bioscience filings [2]. Medicinal chemistry as C07D (heterocyclic compounds) represents 3.1%. This article focuses on the 1.5% subset of C07D (and) A61K filings for two main reasons. Firstly, the data mining challenges for bioscience patents as a whole are too diverse (including millions of sequence listings) to be covered here [3]. Secondly, for medicinal chemistry, patents have a central importance, because they not only underpin over four decades of drug discovery research (both commercial and academic) but also contain substantially more structure-activity relationship (SAR) results than journals. This article will focus on exploring scientific value rather than intellectual property (IP) because, while both aspects are intertwined, the analytical approaches diverge. A short video accompanies this article.

Value
The question needs to be posed as to what data-centric patent mining has to offer practitioners in cheminformatics, pharmacology, medicinal chemistry or chemical biology. To answer this, it is necessary to compare the availability of data from non-patent sources. The appearance of ChEMBL in 2009 substantially increased the scale of results accessible from medicinal chemistry journals, with the current (release 19) count of 1.41 million structures including 0.94 million extracted from 57K papers (n.b. because ChEMBL subsume CIDs from confirmed PubChem BioAssays their compound count in situ exceeds the ChEMBL source count inside Pub-Chem) [4]. However, specialised databases have been curating SAR data from the literature for some years prior to this, including BindingDB [5], Guide to PHARMACOLOGY (GtoPdb, formerly IUPHARdb) [6], GLIDA [7] and PDSP [8].
The utility of engaging with patents as an adjunct to nonpatent sources can be introduced via a practical example. The WIPO database provides a searchable interface for patents from the major authorities. Executing a simple query (select for BACE* AND inhibitor(s), front page, English PCT applications) gives 280 results. The first two documents, both published on May 1st 2014, were WO2014066132 from Eli Lilley [9] and WO2014065434 from Shionogi [10], both specifying BACE1 inhibitors for Alzheimer's disease (6132 used BACE as a synonym for BACE1, UniProt P56817). These are shown in Fig. 1, together with the extraction of two example structures linked to activity data.
Connections gained from this result included the following: 1. Using the same search parameters, it was established that Shionogi had published nine BACE1 patents since 2007 and Eli Lilley five since 2005 (all potentially extractable as described in this section). 2. The rendering of images and tables in-line with text (except the Shionogi PDF exceeded the page limit) allowed scanning of results but full PDFs could be downloaded for checking. 3. The structure and associated IC50 values for selected potent BACE1 inhibitors, were discerned using chemicalize.org for the former and the result tables for the latter. 4. As ascertained by a PubChem search, example 8 from 6132 was identified in CID 73603937, deposited as the HCL by Thomson Pharma on 12th of May (presumably extracted from the same patent) but no close analogues were found. In the lower-left panel, example 72 from 5434 (page 238 in the PDF) was reported to have an IC50 against purified enzyme of 13.6 nM (page 249). The structure was determined from an initial image conversion using OSRA [11] and subsequently edited in the PubChem sketcher [12] from which PubChem searches were launched. The SMILES and InChIKey are shown below. In the lower-right panel example 8 from 6132 is shown (page 60 in the PDF) that has a reported IC50 of 105 nM (on page 63). Using chemicalize.org [13], the IUPAC name was used to generate a range of molecular outputs including a SMILES string and the InChIKey below.  it could be at least 18 months before compounds from these patents appear in a journal article and sometime after that before a ChEMBL release surfaces the published structures in PubChem. 13. Since this search was carried out, both documents have been processed in SureChEMBL (n.b. this source is currently still designated SureChem for the pre-2013 entries in PubChem).
Aspects related to the search example are expanded as general points in Box 1. In addition, some of the themes are addressed in the following sections.

Speed
For operations filing patents that include novel compounds with commercially useful bioactivity, rapid interrogation of patent chemistry is an imperative. Because they are also likely to licence commercial databases for prior-art checking and competitive intelligence, the timing at which patent structures surface 'in the wild' is less of an issue (except to note that public sources must now be included in prior-art searches). Nevertheless, the scientific preview opportunities offered by medicinal chemistry patents can also be valuable for those not necessarily committed to filing themselves (Box 1). In this respect, it may not be appreciated how level the information playing field is become, because a patent becomes open and globally accessible only on the day of publication. This means that, as shown in Fig. 1 . not yet in PubChem) now has a chemistry extraction time of less than a week. A practical preview example can be given for the case of BACE2 as a new diabetes target [16]. Only one targeted inhibitor has appeared in a 2011 paper but, since 2010, many hundreds have been exemplified as different chemotypes in patents published from four pharmaceutical companies and one academic institution. In addition, most of these included discrete activity data and comparative BACE1 cross-screening results. Thus, in the five years since the first published patents, no papers describing extensive chemistry directed against this important new therapeutic target have yet appeared.

Scale and quality
'So how do we know more bioactive chemistry is available from patents than papers?' There are different data sets to approach this question but each has caveats. An upper limit is provided by the GVKBIO Online Structure Activity Relationship database (GOSTAR https://gostardb.com/gostar/). As a manually curated SAR-focused suite of databases for published and patented inhibitors against biological targets over the past 40 years, it currently includes 6.3 million chemical structures. A caveat is that a proportion of the patent activity measurements are binned rather than discrete values. A recent analysis of a 20-year slice of this data set provided some relevant statistics [17]. Firstly a total document ratio for patents: papers of 58:82 (thousand), secondly a compound ratio of 2.7:1 (million) and thirdly, an individual extracted structures per-document ratio of 12:46. Additional data slices related to the patents: papers ratio can be made inside PubChem. The sum of all large patentextracted sources (Table 1) is 15.4 million. The equivalent total for literature-linked compounds (via ChEMBL and PubMed/MeSH) is just over 1.0 million. The intersect (structures common to both) is 0.5 million. While this indicates an approximate patents: papers structure ratio of 15:1 there are caveats to what this represents in bioactivity terms, especially because none of the larger patent sources in Pub-Chem currently connect structures directly to data. The intra-PubChem numbers are informative but it is necessary to ascertain selectivity to understand source complementarity [18]. Aspects of this are detailed in Table 1, along with metrics related to quality.
The dates indicate that IBM, SureChem and SCRIPDB are currently frozen. Additional date cutting indicates that ChEMBL releases are approximately tri-annual but Thomson (Reuters) Pharma submits every week. The stereo and E/Z filters are quality indicators (e.g. the highly curated source, ChEBI, scores 16%). The other manual extractions (ChEMBL and Thomson Pharma) score higher than the automated chemical named entity recognition (CNER) pipelines but, of the latter, SCRIPDB does better than IBM. Slicing the Mw distribution at 400 is a rough proxy for the length of name strings converted in CNER. Here again, as expected, manual extraction sources score higher (because they select complete structures in the first place). Causes for the low IBM score include R-group inclusions and the splitting of longer IUPAC names. Pragmatically, compared to the major benefits of being able to access them at all, the quality of open patent-extracted structures is of lesser importance (it may be for prior-art searching but this is a different issue). Reasons for this include: (a) the value lies primarily in the document > assay > result > compound > protein (D-A-R-C-P) relationships, so even if C (compound) has only a similarity match (e.g. due to an error in another C), the relevance of the connection can usually be resolved; (b) mistakes, isomeric variation and other forms of representational 'noise' in the original documents largely determine extraction quality per se; (c) different extracted isomers and tautomers can be connected (i.e. via the C-to-C match) or same connectivity relationships inside PubChem; (d) both objective quality measurements and structures-in-common between independent sources are important to assess for any large sets of structures (i.e. not just from patents) and (e) reassuring levels of extraction concordance are not only formally recorded within PubChem via substances (SIDs) from different patent sources, but also, in SureChEMBL, this extends to multiple intra-document and inter-document structure identities within the same patent family. Uniqueness, indicated by structures in only one source, is a useful value indicator but not without caveats. The figures (Table 1) suggest that the SureChem CNER extraction has contributed the most novel patent structures, However, a proportion of these could be alternative representations of the same canonical forms in other sources (although this measurement is confounded for comparisons between ChEMBL and Thomson Pharma, because they share some of the same journal sources). The two-component count identifies 5% mixtures in SCRIPDB and SureChem (mostly salts) but IBM appear to have filtered these out. The next category is a crude lead-like molecular property filter. The sources converge at around 50-60% for patent structures. Thus, even if these do not have explicit assay results, they represent a large and generally synthetically accessible, potential bioactive chemical space. The last two columns in Table 1 refer to structure-to-document connectivity. Inside the CID records those with links to the USPTO website are processed from IBM and SCRIPDB by PubChem. They are also usefully IPC-indexed by which we can establish that of the 8.5 million structures assigned codes, 7.2 million are under C07 and 6.1 million C07D. Note that the SureChem records match SureChEMBL externally where a structure search connects to them to patent numbers (also IPC indexed) from the major offices. For Thomson Pharma, the 4.2 million external links are subscription-only but can be either to a patent and/or a paper (it would seem probable that the chemistry curation split is similar to GOSTAR that is 3:1 patents: papers).

Relationship annotation in databases
The BACE1 examples above (Fig. 1) show that D-A-R-C-P mapping from an individual document requires curation. Any scaled-up availability of this (analogous to that done for papers in ChEMBL) has hitherto been a feature of a limited number of commercial databases. Nevertheless, example entries from new initiatives in two open databases are shown (Fig. 2).
In GtoPdb the paper and the patent were connected by curatorially establishing identical structures for CID 46861623 and the IPUAC name. The journal publication focuses on the pharmacology of AZD9668 (i.e. this is not an SAR paper and may therefore neither be picked up by ChEMBL nor consequently PubChem BioAssay). The complementarity of the curated pointer to the patent for anyone interested in this drug is both an extensive set of analogues and unpublished data. Note that AZD9668, as patent example 94, has Kd results converted to Ki but an IC50 only in the paper. However, seven analogues have IC50 data in the patent, including the most potent (example 32 at 3 nM) as CID 11478818 (n.b. the patent may have been filed before AZD9668 was selected for development). While a limited number of GtoPdb entries have patent connections so far, more are being added, particularly for those clinical candidates with little or no SAR in papers.
The patent curation in BindingDB, initiated in Sept 2013, is also of high utility but takes a different approach. In this case, the selection of recent US patents is protein target-based. The BACE1 filing in Fig. 2 has 42 example structures (via CWUs) manually aligned with their activity data from the patent tables but intersected with PubChem CIDs, BindingDB SIDs and a short assay description. For example, the record (http:// www.bindingdb.org/data/mols/tenK10/MolStructure_ 102939.html), was extracted from US8541427 [20]. While this does not locate the structure within the document (i.e. searching SureChEMBL, via CID 44247663, connects to the image for example 10), it does allow the curated set to be directly retrieved as a CID list with the PubChem query 'US8541427'. This can be done for any of the 367 BindingDB patents (Oct 2014) covering 32 670 compounds with targetmapped activity results.

Conclusions
The options to mine patent data from individual documents up to large extracted structure sets are expanding in open resources. For example, SureChEMBL has reached 15.6 million in situ at 80 K novel structures per month (Dr. G. Papadatos, RDKit UGM, presentation Nov 2014). Paradoxically, patents are fully accessible for text-mining, in contrast to most of the literature. Of the patent-extracted structures already in PubChem over 9 million are within the property boundaries for potential bioactivity and 0.5 million intersect with identical structures from papers, via ChEMBL and/or PubMed. Future challenges will include abstracting D-A-R-C-P relationships and synergistically intersecting these with the analogous relationships and entities identified from the literature, as already demonstrated by BindingDB and GtoPdb.