BIGCHEM: Challenges and Opportunities for Big Data Analysis in Chemistry

Abstract The increasing volume of biomedical data in chemistry and life sciences requires the development of new methods and approaches for their handling. Here, we briefly discuss some challenges and opportunities of this fast growing area of research with a focus on those to be addressed within the BIGCHEM project. The article starts with a brief description of some available resources for “Big Data” in chemistry and a discussion of the importance of data quality. We then discuss challenges with visualization of millions of compounds by combining chemical and biological data, the expectations from mining the “Big Data” using advanced machine‐learning methods, and their applications in polypharmacology prediction and target de‐convolution in phenotypic screening. We show that the efficient exploration of billions of molecules requires the development of smart strategies. We also address the issue of secure information sharing without disclosing chemical structures, which is critical to enable bi‐party or multi‐party data sharing. Data sharing is important in the context of the recent trend of “open innovation” in pharmaceutical industry, which has led to not only more information sharing among academics and pharma industries but also the so‐called “precompetitive” collaboration between pharma companies. At the end we highlight the importance of education in “Big Data” for further progress of this area.


Introduction
The Wikipedia definition" Big Data" [1] is at erm for data sets that are so largeo rc omplex that traditional data processing applications are inadequatea lso highlights that not only the size but also the data complexity is very important. In pharmaceutical research area, "Big Data" is emerging from the rapidly growing genomics data thanks to the rapid development of genes equencing technology.L ikewise people start to ask the question if there is Big Data in chemistry?
Overt he past decade there actually has beenar emarkable increase in the amount of available compound activity and biomedical data. [2][3][4] The definitiono f" Big Data" in chemistry is generally not clear.F requently,t he "Big Data" in chemistry refers to considerably larger databasest han commonly used ones (in orders of magnitude), [5] which become recently available thanks to the emerging of new experimental techniques such as high throughput screening, parallel synthesis etc., [3,6] or as access to new chemical information as result of automatic data mining (e.g.,p atents, literature or in house data collections). [7,8] How to efficiently mining the large scaleo fd ata in chemistry becomes an important problem for the future development of the chemicali ndustry including pharmaceutical, agrochemical, biotechnological, fragrances, and generalc hemical companies.
Big Data collected from literature usually are quite noisy and the reasons are multiple. First of all, this could be due to the biological assayi tself,f or example the original ex-periment errors, assaya rtifacts in screeninge tc. Secondly, there lack of as tandard way for annotating biological endpoints, mode of action and target identifier.T hirdly,e rrors exist when extracting data values, units and/orc hemical name recognitionf or automatic literature mining. Some actions have been donet oa ddress these problems, e.g.,i m-

SPECIAL ISSUE
Abstract:T he increasing volumeo fb iomedical data in chemistry and life sciences requirest he development of new methods and approaches for their handling. Here, we briefly discuss some challenges and opportunities of this fast growing area of research with af ocus on those to be addressedw ithin the BIGCHEM project.T he article starts with ab rief description of some availabler esourcesf or "Big Data" in chemistry and ad iscussion of the importance of data quality.W et hen discuss challenges with visualization of millions of compounds by combiningc hemical and biological data, the expectations from mining the "Big Data" using advanced machine-learning methods, and their applications in polypharmacologyp redictiona nd target de-con-volution in phenotypic screening. We showt hat the efficient exploration of billions of molecules requires the development of smart strategies. We also address the issue of secure informations haring without disclosingc hemical structures, whichi sc riticalt oe nable bi-party or multi-party data sharing. Data sharing is important in the context of the recent trend of "open innovation" in pharmaceutical industry,w hich has led to not only more information sharing among academics and pharma industries but also the socalled "precompetitive" collaborationb etween pharma companies.A tt he end we highlightt he importance of education in "Big Data" for further progress of this area. proving data quality by applying promiscuityf ilters to clean up the screening data, developing bioassay ontology (BAO) tools to better organize and/or standardize the collected data etc. [9][10][11] In order to support data analysis, collection, sharing, and dissemination several largep rojects, such as ELIXIR (https:// www.elixir-europe.org), eTox (http://www.etoxproject.eu), BIGCHEM (http://bigchem.eu)a nd others, were initiated under the sponsorshipo fE uropean Commission.M any of these activitiesa re within the core area of chemoinformatics. The BIGCHEM project was recently sponsored by EU Horizon2020 program.T he consortium includes academia, big pharma companies, large academic societies (Helmholtz, Fraunhofer) and Small and MediumE nterprises (SMEs). This project mainly aims to develop computational methods specifically for Big Data analysis. Below,w eb riefly review challenges and opportunitiesi nt he Big Data analytics area particularly focusingo ns everal aspects, whicha re going to be addressed in BIGCHEM project.

Data Repositories
Publicly available databases sucha sP ubChem, [3] Bind-ingDB, [6] and ChEMBL [4] (Table 1) represent exampleso f large public domain repositories of compound activity data. ChEMBLa nd BindingDB contain manually extracted data from tens of thousands of articles. PubChemwas originally starteda sac entral repository of High Throughput (HTS) screeninge xperiments for the National Institute of Health's (USA) Molecular Libraries Program but also incorporates data from other repositories( e.g.,C hEMBL and BindingDB). Commercial databases, such as SciFinder, GOSTAR and Reaxys (Table 1) have accumulated al arge amount of data collected from publicationsa nd patent data. Similarlyt op ublic and commercially available repositories, industryh as produced large privatec ollections. For example,m ore than 150M data points are available as part of AstraZeneca International Bioscience Information System (AZ IBIS) just for experiments performed before 2008. [8] The data quality in databases can significantly vary depending on data source, data acquisition procedures and curation efforts. Accumulated chemical patentsr epresent another rich resource for chemical information. Large-scale text mining has beend one on patent corpus to extract useful information.I BM has contributed chemical structures from pre-2000 patents in PubChem. [12] SureChEMBL database [13] was launchedi n2 014 providing the wealth of knowledge hidden in patentd ocumentsa nd currently contains 17 million compounds extractedf rom 14 million patentd ocuments.
Under the enormous pressure of developing new drug with more restrained R&D budget,r ecent years have seen large pharma companiesi ncreasingly exploringt he so called "open innovation" model for drug discovery research.T he collaboration between academics and pharmaceuticali ndustryi nt erms of compound, data sharing has been largely increased. [18] The examples include AstraZeneca-Sanger Drug Combination predictionc hallenge to develop better algorithms for treatment of cancer. [19] European Lead Factory [20] is another collaboration effort of seven pharma companies, SMEs and academic partners to create ad iverse library of 500k compounds by combining compoundsf rom partners, external usersa nd newlys ynthesized libraries and to screen these libraries againstc ommercial and public targets. Both academia and industry should benefitf rom these kind of collaborative efforts, whichc an result in more chemical and biological data being available in the public domain. More interestingly, even the collaborations between pharmaceutical companies on the socalled "precompetitive" level,w hich was hardly toi magine ten yearsa go, has become at rend. These efforts have made sharing of data within each organization become possible and lead to af urther increase in the size of "Big Data". [21,22]

Frequent Hitters Analysis
Big Data sometimesa lso means noisy data. Data coming from HTS experiments could often be contaminated with false positive and false negative results. The errors can appear due to casual problems such as measurements errors, robotic failure, temperature differences, etc.,w hich could be easily addressed with propere xperimental protocols (e.g. by repeated measurements). Unfortunately,t here are also systematic problems, such as low solubility in water,d egradation of compoundso r" frequent hitters" ChEMBL v. 21 [4] 1,592,191 13,968,617 1,212,831 PubChem HTS assays and data mined from literature BindingDB [6] 529,618 1,207,821 6,265
FHs are usually referred to as compounds that provide unspecific activity in different assays. [23] Some of these compounds cause nonspecific binding( e.g.,r eactive compounds) or/andi nterfere with ap articulara ssay technology (e.g.,l ight quenching, compoundsf orming micelles, luciferase inhibitors, formation of complexes with tagged proteins for AlphaScreen [24] ). Others are promiscuous binders that interact with different targets in as pecific,d ose-dependent fashion [25] and could constitute up to 99.8 %o fh its. [26] An analysis of results of these HTS data without filtering nonspecific binders and compounds that interfere with the assay technology could result in am odel to predict FHs and not the target activity.T hereforec arryingo ut FH analysis would help to clean screening data and eventually help to build better predictive model.
Various sourcesf or compounds behaving as FH have been proposed such as:c helation, redox activity,m embrane disruption, singlet oxygen production,c ompound fluorescence, cysteineo xidation and non-selectiver eactivity. [26] It was also estimated that 1-2 %o fdrug-like compounds could self-associate into colloidala ggregates that non-specifically inhibit enzymes and otherproteinsatatypical screening concentrationo f5uM. [27] Baell and Holloway looked at compounds with activity in multiple assays [11] and found certain substructures, which appeared repeatedly in promiscuous hits, and labeled them "Pan Assay Interference Compounds (PAINS)".
The non-specific binding,h owever,c an be also important and be extensively exploitedb yn ature. As ignificant overlap between PAINS substructures and natural productsf or quinonesa nd catechols [28] indicate that these scaffolds were selected by evolution for their shotgun properties. Other substructures such as 2-amino thiazoles have been shown to be FHs in the sense of promiscuous binders, but are also present in marketed drugs. [29] An application of alerts developed by chemicalp roviders to flag problematic compounds found that drugs are two-three folds enriched with such alerts as compared to the screeningl ibraries. [30] Thus, ab lind exclusiono f" undesired" compoundsm ay result in as ignificant risk to miss potentially interesting compounds and thus throw the babyo ut with the bathwater.
In order to develop FH filters it is importantt of ind assays, which use similar technology.T ob ea ble to better analyze and compare different HTS data the Bio-Assay Ontology (BAO) concept was developed. [10,31] It has been used to both annotateH TS in PubChema nd in an industrial setting. [9] The use of BAO makes it easier to group assays according to the used technologies and to identify relevant FHs. The identified catalogue of FH substructures can be very usefult or emove chemicalm atter in future HTS campaigns that will specifically interfere with the used assay technology.T hus,t he information about the mechanism of action of FHs will be important to design and correctly interpret screening campaigns. [24,32] It should be noted that even the best BAO and best methods of standardization of experiments will never fully addresst he problem of heterogeneity and complexity of biological and chemical data. By no means experiments performed in mouse and in rats can be combinedi nto one single "activity" columna ssociated to as ingle,s tandardized structure for all possible experiments. Suchm erge can be performed only depending on the conditions of experiments, endpoint and, importantly,p roperties of compounds. For example, even for simple lipophilicity property logP shake-flask values can be merged with logD values obtained from HPLC experiments only for compounds non ionized under the pH of the experimentb ut not for all possible structures.

Data Visualization and Exploration of Chemical Space
The visualization and compact representation of millions of compounds( such as > 110M compounds in SciFinder,s ee Ta ble 1), which is usually the first step of data analysis, represents significant challengei nB ig Data analysis. It is usually done by projectingl arge compound collections into al ow dimensional space, amenable to visual inspection and intuitive analysis by the human brain. It could help to detect chemical entities with novel chemical scaffolds and physicochemical properties (e.g., for compound library design), to compare different libraries or to identify regions of chemicals pace that possess certain pharmacological profile. [33] Exemplary approaches such as principle component analysis (PCA), [34] GenerativeT opographic Mapping (GTM), [35] Kohonenn etworks, [36] Diffusion Maps, [37] and interactive mapso btained by projectiono fh igh-dimensional descriptor spaces, [38,39] are promising techniques in this context. Such visualization methods can be also used to interpret structure-activityr elationships. [40] For example, in the "Stargate" version of GTM,l atent space links two different initial spaces -one defined by molecular descriptora nd another one by experimental activities. [41] This allows, on one hand, to predict the whole pharmacological profile for one particular chemical structure and, on the other hand, to identifyn ew structures corresponding to the givenp rofile. Another example of exploringc hemical space is using as o called ChemGPS approach to represent and navigate throughd rug-like [42] and pharmacokinetic [43] chemical space based on PCA components extracted from molecular 2D descriptors.I ts variant ChemGPS-NP [44] characterize the natural product space in particular.I th as been shownt hat the accuracy of describing molecules in ChemGPS-NP defined space is similar to the accuracy of structural fingerprints in retrieving bioactive molecules. [45] Beside the spacer epresented by known and available chemical structures, the chemicals pacec omposed by virtual compounds is much bigger.T he number of potential molecular structures, which could theoretically be enumerated, is vast. For example, the database GDB-17 contains 166.4 billionm oleculest hat are possible combinations of up to 17 atoms of C, N, O, Sa nd halogens following simple rules of chemicals tability and synthetic feasibility. [46,47] Although GDB-17i sa lready very large, it would be many orders of magnitude larger if extended to 20-30 heavy atoms, which is the average size of drug-like molecules. [48] These data sets raise new challenges even for traditional profiling of chemical compound collections, which is used to identify chemicals with favorable properties (e.g.,L ipinski's rule, non-toxic, etc.). Even af ast algorithm,w hich is able to process1 00,000 molecules per minute, will require > 1,000 days~3years of calculations on one core to annotate the full GDB-17. If the models upports efficientp arallelization, it could be executed on, e.g.,t he supercomputer of Leibniz Supercomputing Centre with 241,000c ores. In this case the calculation time can be theoretically decreased to ten minutes. Instead of ab rute forcea pproach one can rely on, e.g. sequential triaging scheme that eliminates undesired regions, such as low solubilityo rl ow prediction accuracy due to limited applicability domain of model, [49] by very fast algorithms first and then applies more compute expensive methods on smaller subspaces. Thus novel approaches or workflows are needed to efficiently search through this enormous chemical space.

Structure-Activity Relationship Modeling
Althoughaplethorao fm achinel earning algorithms is available for SAR studies [50,51] there is an increasing need for robust and efficient computational methods, which are able to cope with very large and heterogeneous datasets. The current methods alreadya llow to build predictive models from hundreds of thousandso fc ompounds and high-dimensional descriptors with data matrices of > 0.2 trillion entries. [5] Advancesi nt his field can be also expected from data fusion methods, which simultaneously model several relatedp roperties. [52] The simultaneous modeling of suchi ncompatibled ata by exploring inter-correlation between differentproperties, e.g. tissue/air partitioning coefficients in humans and in rats, has already successfully contributed models with improved accuracy compared to those built with any single activity data. [52] Numerous methods have been developed to predict compound polypharmacology. [53][54][55] Prediction of the binding affinity of ligands to multiple proteins allows to anticipate potential selectivity issues, discover beneficial multi-target activities as early as possiblei nt he drug discovery process, [56] or make target deconvolution for phenotypics creening. [57] Most of these methods rely on building single target model individually, one future developmentc ould be to use all available chemogenomics data to pursue multi-taskl earning and build one multi-label model to predict multiple target activity simultaneously. Ar ecent studys hows that massive multitask networkso btain predictive accuracies significantly better than single-task methods. [58] Probabilistic matrix factorization (PMF) methods have been foundp articularlyu seful in buildingm ulti-task model. [59,60] Further injection of ligand and protein information into PMF methoda ss ide information may further improvet he prediction accuracy. [61] However in Big Data setting, this would require hugec omputer power and dedicated parallel programming model. Recently deepl earning technology has gainedl arge attention in public media. In 2015t he deep learning models achieved accuracy of human brain for handwritten Chinese character recognition [62] ,w hile in 2016 ad eep-learning network won Go [63] tournament againstt he human champion. Moreover, recent announcement of Google Cloud Platform [64] has made possible the use of technologies staying behind the best implementation of machine learning algorithms by non-experts. The deep learningn eural network technology, which is able to efficiently deal with high-dimensional and complex data, has also beena pplied in chemoinformatics area [51,65,66] and is expectedt of urtherc ontribute to the progress of this direction of studies.
Another importantq uestioni s" does more data contribute better models"? Ac onsensus model to predict melting point (MP), which was developed with N = 275k measurements, calculatedR MSE = 31 AE1 8Cf or Bergstrçm data set [67] of drugs( N = 277). [5] This resulti sa lmost 15 8Ci mprovement compared to the results of the original study [67] and 3 8Ci mprovement compared to the model developed with N = 47k molecules. [68] It should be noticed that models were developed using different descriptors and machine learningm ethods, which could contribute to the difference in their performances. To exclude influence of these factors, we used exactly the same protocols from ref [5] to develop am odel using Bergstrçm data only.T he developed consensus model calculated RMSE = 50 AE 1 8Cc onfirming, on one hand, that the improvementi nt he predictiona ccuracy for MP was contributed by an increase of the trainings et size and, on the otherh and, suggesting that modern automatic text patent mining, which was used to contribute > 80 %o f data of the 275k set, produced datao fexcellentq uality.

De Novo Design
De novo design aims at generating new chemical entities with drug-like properties and desired biological activities in ad irected fashion.C omparing with normal virtual screening or HTS, which search for activec ompounds in physically available compound database, de novo designt ries to generate hypothetical candidate compounds in silico.T here are mainly two type of methods for making de Novo molecular design, one is the based on the similarity to known active compounds, i.e.,l igand basedD eN ovo design, and the other typei sb ased on protein 3D structure to generate new compounds, i.e.,s tructure based De Novo design. Here we mainly discuss ligand based methods, structure based methods can be found in elsewhere. [69] Onew ay for doing de Novo design is to search the large virtual compound database such as the GDB to get de novo hits. In ordert os earch vast virtual chemical space, one would need integrated workflows combining efficient search and multiparameter optimization strategies to filter out molecules with sub-optimal profilesa se arly as possible. For examplep hysicochemical and synthetic feasibility filters can be frontloaded to trim down the number of compounds. Ruddigkeit et al. [70] was able to searche ntire GDB17 databasew ith aw orkflow combining MQN2 D structure fingerprint with ROCS shapem atching method. Another strategy is reaction-driven,f ragment-based de novo design. Based on known chemical reactionsa nd commerciallya vailable build blocks, chemicallyd iverse and synthetically feasible compounds are generated via normally multistep and multi-parameter optimization process searching for candidate compounds which satisfy to certain property profile. These reaction-based methods have been successfully applied to designd en ovo bioactive compounds. [71][72][73] The third strategy to provide an intelligent search of new compounds is to generate structures,w hich are sufficiently new but still within the chemical spaces covered by the models. Ag roup of these methods, which is known as "inverse QSAR", has received ab oost during recent yeard ue to increasing computational powera nd new theoretical developments. As et of linear constrained Diophantine equations was used by Faulon et al. [74] to exhaustively enumerate new compounds. Wong et al. [75] used kernel methods to map training compounds from input space to the kernel feature space. In this space the authors generated new data points, which were used to recover the chemical structures. Actually,t his approach is similar to that of aforementioned "Stargate" GTM, [41] with an exception that the former algorithm does not use supervised learning to create maps. In anothera pproach Funatsue tal. [76] used Gaussianm odelsa nd Bayesian inferencet oe xhaustively fill at arget region of the model space with new structures. Thus, these methods propose novel chemical structures while still staying within the chemical space of the QSAR models.

Data Sharing and Data Security
Even large pharma companiesc an accumulate only limited amounts of relevant property information.A si tw as mentioned before, sharing data collected by differento rganizations offers the opportunity to develop computational models on am uch broader data basis, thereby increasing model robustness,a ccuracy and coverageo fc hemical space. [77,78] The development of approaches to predict ADME/T properties in ac ollaborative manneri sb ecoming ap art of future pharma R&D strategies. Recently,A straZeneca and Bayer made the efforts to compare their entire compound collection in as ecure manner, [22] while AstraZeneca and Roche started ad ata sharing consortium on the topic of matched molecular pairs to improve metabolism, pharmacokinetics, and safety of their compounds through MedChemica. [21] Moreover,A straZeneca has already donated some of its ADMETox data to CheMBL. [79] However,c ollaborative efforts in this field are generally not straightforward. The intellectual propertya spects associated with private compound collections and associated data might be relevantf or ongoing drug discoverye fforts. Secure multiparty computation methods based on moderne ncryption theory [80,81] providew ays to develop models without the need to share molecular structures or proprietary molecular representations.T hese methods are compute-intense and bandwidth-demanding but fast development of Internet and increasing computational power of computers is making them applicablet or eal world problems. [82,83] Training Big Data Scientists -T he Chemoinformaticians The "Big Data" challenges require professionally trained experts, "data scientists in chemistry" -t he chemoinformaticians, who can cope with the complexity and diversity of problems in this field of scientific discovery.T raditional "data scientists" coming from computer science field, as well as computational chemists with little knowledge in computer science, are very unlikely to have sufficient knowledge and expertise to address both chemoinformatics questions and will need additional training.I mportant questions in this regard are following ones:H ow should one balance chemistry and computer science training? How should one ensure ah igh level of scientific expertise and, at the same time, ap ractically oriented mindset? Which new and rapidly developing methodologies should be considered? How should one prepare trainees to work at the interface between computing,c hemistry,a nd pharmaceuticalr esearch?T heseq uestions can be answered only duringc lose interactions of academic partners and the end-users and tight involvement of industrial partners in targeted research trainings. In this respect the training programs, such as offeredt hrough Marie Skłodowska-Curie Actions, provide generous fundings upport by means of Innovative Training Networks, which foster and promotes uch type of interactions.

Conclusions
Both industry and academic partners share high expectations from "Big Data" in chemistry,w hich is an ew emerging area of research on the borders of several disciplines. The advance in this area requires developmento fn ew computational approaches and more importantly education of scientists, who will further progress this field.

Conflict of Interest
IVT is CEO and founder of BigChemG mbH, which licenses the OCHEM [17].